NISHIO Hirokazu[日本語][English]

Nattoku in Vector space

nishio #gleninjapan In the lightning talk, I didn't delve into mathematical discussions due to limited time. For a more detailed explanation, I created diagrams by embedding the meanings of words into vector spaces using LLM.

Lightning talk: Words as Public Goods

nishio This is a two-dimensional visualization of the meanings of each word embedded in a high-dimensional space using OpenAI's text embedding API. In simple terms, it demonstrates how AI recognizes the similarity in meanings of words like this.

Embedding API

nishio Plotting two languages on a single chart is not a straightforward task. In this chart, the first principal component axis of PCA is treated as the axis representing the differences between languages and has been removed.

nishio Here is annotated version. The plotted words are a combination of those I have considered and those that GPT-4 has suggested as being similar. So it shows GPT4 can not find English words similar to Japanese Nattoku. Understanding and agreement is major explanation in dictionaries

nishio One word can bridge multiple concepts. In this example, the Japanese word "納得" (nattoku) serves as a bridge connecting concepts like "understanding", "agreement", and "satisfaction". Similarly, in Mandarin, "數位" (shùwèi) connects concepts like "digital" and "plural".

nishio In the mapping from a high-dimensional space(H) to a low-dimensional space(L), objects that are close in H will generally remain close in L. However, there is no guarantee that objects far apart in H will also be far apart in L. nishio You can think of it like imagining the shadow of a three-dimensional object. Therefore, the absence of proximity in a low-dimensional space can be useful for understanding a high-dimensional space.

Making

Simple PCA generated this
- it shows thr difference of languages

Visualization on each languages
Those are good for observing each language only, but those should not be overlapped, by the nature of PCA.
In the observation I found some candidate words are far from other words. Those outlier are ommitted from the visualization. We have only two~three dimensions to express information.

In this chart, the first principal component axis of PCA is treated as the axis representing the differences between languages and has been removed.

Finnaly got this

(C)NISHIO Hirokazu / Converted from Markdown (ja)
Source: [GitHub] / [Scrapbox]