Prism: mapping interpretable concepts and features in a latent space of language | thesephist.com
In the future, these "latent direction" operations could become more intuitive and flexible editing features in writing support and knowledge exploration. For example, complex style elements and topic tendencies in a text might be controlled by a "knob"-like GUI, allowing the user to directly stretch and loosen "context" and "meaning. Also, the concept of information manipulation could be further extended to embedded spaces in other fields (images, audio, video), enabling creative media editing and advanced document management.
In short, this research opens new doors for humans to directly grasp and control the representations inside AI models, and could be the foundation for the next generation of interactive information manipulation and generation tools.
Explain the results and methods that have been achieved at this point in this paper.
: Using sentence embedding as input, we have developed a method to extract a large number of features (directions) in the embedding space with a sparse representation. This method produces tens of thousands of human-interpretable features from an embedding space of approximately 512 to 2048 dimensions.
: We show how to automatically describe and label feature directions extracted using GPT-4 based on text examples, and automatically score how accurately the description represents the feature. This enables reliability evaluation for a large number of features without human intervention.
Semantic editing is achieved by using the obtained feature directions to control the style and content of the text by simple vector operations (addition to feature directions, etc.) in the embedding space, and then redecoding the result from the embedding to the text. The "feature gradients" method further improves the accuracy of the editing.
Overall, this research establishes a fundamental method for advanced visualization, interpretation, and editing of model internal representations by combining "sparse feature decomposition models" and "automatic explanation and evaluation using large-scale language models.
Suppose we are given a short list of about 10,000 free-texts. What would be possible by applying this method?
By vectorizing responses with an embedded model and analyzing them with a sparse autoencoder (SAE), a large number of human-understandable features (topics, writing style, expression patterns, specific keyword trends, etc.) can be automatically extracted and visualized.
In the past, clustering was limited to finding vague "somewhat similar responses," but now that interpretable feature directions are available, it is possible to clarify why responses are similar to each other and which elements characterize the responses, enabling more precise categorization and extraction of important features.
Using the extracted feature directions, it will also be possible to make specific semantic modifications to the response text, such as "slightly more formal direction" or "question-like nuance. This is a new means of operation that allows fine-tuning and unification of semantic units while overlooking the entire response.
In summary, by using this method for the 10,000 open-ended responses, it will be possible to more precisely understand and analyze the group of responses and identify potential conceptual and stylistic features, while utilizing them to semantically organize and edit the text.
In such a "mostly homogeneous, with only some (about 10%) forming distinct clusters" situation, the following perspectives are possible.
In general, uniform data distributions may appear to remain uniform when this method is applied. However, that does not imply a failure of "unresolvability," but rather could be a means of reaffirming, from a feature direction perspective, that "this data does in fact have such homogeneity. And with respect to cluster areas that are even partially specialized, it would act as a clue to a deeper understanding than before.
This page is auto-translated from [/nishio/Prism: mapping interpretable concepts and features in a latent space of language](https://scrapbox.io/nishio/Prism: mapping interpretable concepts and features in a latent space of language) using DeepL. If you looks something interesting but the auto-translated English is not good enough to understand it, feel free to let me know at @nishio_en. I'm very happy to spread my thought to non-Japanese readers.