OpenAI-Clip

简介

Multi‑modal foundational model for vision and language tasks like image/text similarity and for zero‑shot image classification.
Contrastive Language‑Image Pre‑Training (CLIP) uses a ViT like transformer to get visual features and a causal language model to get the text features. Both the text and visual features can then be used for a variety of zero‑shot learning tasks.

效果视频

规格与下载

技术细节

Model checkpoint:ViT-B/16
Image input resolution:224x224
Text context length:77
Number of parameters:150M
Model size (float):571 MB

应用领域

Image Search
Content Moderation
Caption Creation

授权信息

Source Model: MIT
Deployable Model: AI-HUB-MODELS-LICENSE