NISHIO Hirokazu[日本語][English]

pVectorSearch2024-04-02

/plurality-japaneseのベクトル検索

prev

reading

$ git clone https://github.com/nishio/omoikane-embed-core plurality-japanese-embed

$ pip install -r requirements.txt ModuleNotFoundError: No module named 'distutils'

Ensure distutils is Installed: distutils is included with the standard library for Python versions prior to 3.10. For Python 3.10 and later, distutils has been deprecated and is not included by default. If you're using Python 3.10 or later, consider using setuptools instead for package management and distribution. はー、なるほど 作ったときは3.10だったけど今は3.12だな

諸々修正して動いた

:

% python make_vecs_from_json/main.py
processing 769 pages
100%|██████████████████████████| 769/769 [00:05<00:00, 139.02it/s]
total tasks: 7470,  0.0% was cached
processing 7470 tasks in 150 batches
100%|██████████████████████████| 150/150 [06:29<00:00,  2.60s/it]

image

upload :

% python upload_vecs/main.py 
uploading plurality-japanese.pickle
100%|██████████████████████████| 74/74 [00:24<00:00,  3.06it/s]
OK

before/after imageimage

blocksize=100での実験

  • 結果を待ってる間に並列してviewの開発

結果 :

% python make_vecs_from_json/main.py
processing 769 pages
100%|██████████████████████████| 769/769 [00:03<00:00, 224.82it/s]
total tasks: 19866,  13.4% was cached
processing 17205 tasks in 345 batches
100%|██████████████████████████| 345/345 [12:19<00:00,  2.14s/it]
% python upload_vecs/main.py        
uploading plurality-japanese.pickle
100%|██████████████████████████| 239/239 [01:18<00:00,  3.05it/s]
OK

image

image チャンクを小さくして実行した分は$0.36くらい

view

% git clone https://github.com/nishio/omoikane-vecsearch plurality-vecsearch-ja

% npm install

  • audit fix --forceしてomoikane-vecsearchに返しておいた
  • % npm run dev
    • をしてローカルではちゃんと検索できるのを確認

% git remote rename origin upstream % git remote add origin https://github.com/nishio/plurality-vecsearch-ja.git % git branch -M main % git push -u origin main

Vercel dashboardを開く image

buildとdeployはできたが、検索対象プロジェクトの設定がされてないな

before / after imageimage after image うーん まあここの改善は後でいいか

リリース!

/plurality-japanese/ベクトル検索の改善から今日やったこと

  • ✅日本語だけを入れたサービスを別途作る
  • チャンクの改善
    • ✅チャンクは今までの500トークンのものだけでなく100トークンのものも入れる
    • ✅1ページから1チャンクしかヒットしないようにする
  • データの追加について
    • ✅1: まずはこのScrapbox

「ベクトル検索」でベクトル検索した結果

  • "Vector Search"などもマッチしているのがわかる

2024-04-04

GitHub Actionsで動いてない問題の修正

build Node.js 16 actions are deprecated. Please update the following actions to use Node.js 20: actions/checkout@v3, actions/setup-python@v4. For more information see: https://github.blog/changelog/2023-09-22-github-actions-transitioning-from-node-16-to-node-20/. actions/checkout: Action for checking out a repo actions/setup-python: Set up your GitHub Actions workflow with a specific version of Python

:

The conflict is caused by:
    The user requested protobuf==5.26.1
    grpcio-tools 1.62.1 depends on protobuf<5.0dev and >=4.21.6

To fix this you could try to:
1. loosen the range of package versions you've specified
2. remove package versions to allow pip attempt to solve the dependency conflict

diff

違う名前のブランチにpush

image

imageimage image

image


(C)NISHIO Hirokazu / Converted from Markdown (ja)
Source: [GitHub] / [Scrapbox]