CLIP

In-depth explanation of OpenAI's new image classification model CLIP from the paper! | DeepSquare

This is the Image & Text model CLIP, which maps text and images to a shared vector space. For applications of the models https://huggingface.co/sentence-transformers/clip-ViT-L-14 clip-ViT-L-14

49408 BOS=49406 EOS=49407 python

>>> clip.tokenize("a painting of a cat")
tensor([49406,   320,  3086,   539,   320,  2368, 49407,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0](/en/49406%2C%20%20%20320%2C%20%203086%2C%20%20%20539%2C%20%20%20320%2C%20%202368%2C%2049407%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%0A%20%20%20%20%20%20%20%20%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%0A%20%20%20%20%20%20%20%20%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%0A%20%20%20%20%20%20%20%20%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%0A%20%20%20%20%20%20%20%20%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%0A%20%20%20%20%20%20%20%20%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%0A%20%20%20%20%20%20%20%20%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%0A%20%20%20%20%20%20%20%20%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200), dtype=torch.int32)

subwords python

>>> clip.tokenize("bozuman")
tensor([49406,   647,  4091,   786, 49407,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0](/en/49406%2C%20%20%20647%2C%20%204091%2C%20%20%20786%2C%2049407%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%0A%20%20%20%20%20%20%20%20%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%0A%20%20%20%20%20%20%20%20%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%0A%20%20%20%20%20%20%20%20%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%0A%20%20%20%20%20%20%20%20%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%0A%20%20%20%20%20%20%20%20%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%0A%20%20%20%20%20%20%20%20%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%0A%20%20%20%20%20%20%20%20%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200%2C%20%20%20%20%200), dtype=torch.int32)

This page is auto-translated from /nishio/CLIP using DeepL. If you looks something interesting but the auto-translated English is not good enough to understand it, feel free to let me know at @nishio_en. I'm very happy to spread my thought to non-Japanese readers.

(C)NISHIO Hirokazu / Converted from Markdown (en)
Source: [GitHub] / [Scrapbox]