Vector searches do not always hit other languages with the same meaning.
31806 :Yasukazu Nishio2023/4/3(Mon.) 14:50 I don't have enough time at all, but a note on what I'd like to try when I have more time. The fact that the English and Japanese embedding vectors are far apart. There is a possibility that there is a 1-3 dimensional component that expresses language differences, and if we can destroy it, we can determine similarity regardless of language. Then you can answer a question in Japanese based on Wikipedia in English. If this "similar content determination regardless of language" can be done, the cost of multilingual deployment of help should be drastically reduced in the Cybozu context.
31828 :Yasukazu Nishio2023/4/6(Thu) 0:10 All English translations of my Scrapbox and differential updates in Github Actions are finally working. This would make a 31806 experiment. We have created embed vectors for all the Japanese articles. 1: Create an embed vector of all English articles 2: Produce the percentage of each article's nearest article that is a "translated article" and the percentage that is an "article in the same language". 3: Try to get the difference between the bilingual vectors and average them (is it also useful to see how much they vary? Is the median better than the mean? Is it better to get the top 3 dimensions of the contribution ratio?) 4: Do 2 after crushing the dimension of 3 Something like this? To be continued tomorrow
For the 3 bilingual vectors, I'd need to multiply the PCA to get the vectors in order of contribution rate, and also to check how many dimensions there are.
31836 :Yasukazu Nishio2023/4/6(Thu) 21:09 1: I thought I could just use the same Japanese vectors I used before, but that would have complicated the mapping to English, so I decided to redo them all together.
This page is auto-translated from /nishio/実験2023-04-06 using DeepL. If you looks something interesting but the auto-translated English is not good enough to understand it, feel free to let me know at @nishio_en. I'm very happy to spread my thought to non-Japanese readers.