Add ability to use Jina CLIP V2 for semantic search (#16826)

* add wheels * move extra index url to bottom * config model option * add postprocess * fix config * jina v2 embedding class * use jina v2 in embeddings * fix ov inference * frontend * update reference config * revert device * fix truncation * return np tensors * use correct embeddings from inference * manual preprocess * clean up * docs * lower batch size for v2 only * docs clarity * wording
2025-07-21 13:46:56 +02:00 · 2025-02-26 08:58:25 -06:00 · 2025-02-26 08:58:25 -06:00 · d0e9bcbfdc
commit d0e9bcbfdc
parent 447f26e1b9
10 changed files with 380 additions and 49 deletions
--- a/docker/main/requirements-wheels.txt
+++ b/docker/main/requirements-wheels.txt
@ -54,7 +54,6 @@ pywebpush == 2.0.*
 pyclipper == 1.3.*
 shapely == 2.0.*
 Levenshtein==0.26.*
-prometheus-client == 0.21.*
 # HailoRT Wheels
 appdirs==1.4.*
 argcomplete==2.0.*
--- a/docs/docs/configuration/reference.md
+++ b/docs/docs/configuration/reference.md
@ -536,6 +536,8 @@ semantic_search:
  enabled: False
  # Optional: Re-index embeddings database from historical tracked objects (default: shown below)
  reindex: False
+  # Optional: Set the model used for embeddings. (default: shown below)
+  model: "jinav1"
  # Optional: Set the model size used for embeddings. (default: shown below)
  # NOTE: small model runs on CPU and large model runs on GPU
  model_size: "small"
--- a/docs/docs/configuration/semantic_search.md
+++ b/docs/docs/configuration/semantic_search.md
@ -5,7 +5,7 @@ title: Semantic Search

 Semantic Search in Frigate allows you to find tracked objects within your review items using either the image itself, a user-defined text description, or an automatically generated one. This feature works by creating _embeddings_ — numerical vector representations — for both the images and text descriptions of your tracked objects. By comparing these embeddings, Frigate assesses their similarities to deliver relevant search results.

-Frigate uses [Jina AI's CLIP model](https://huggingface.co/jinaai/jina-clip-v1) to create and save embeddings to Frigate's database. All of this runs locally.
+Frigate uses models from [Jina AI](https://huggingface.co/jinaai) to create and save embeddings to Frigate's database. All of this runs locally.

 Semantic Search is accessed via the _Explore_ view in the Frigate UI.

@ -35,23 +35,47 @@ If you are enabling Semantic Search for the first time, be advised that Frigate

 :::

-### Jina AI CLIP
+### Jina AI CLIP (version 1)

-The vision model is able to embed both images and text into the same vector space, which allows `image -> image` and `text -> image` similarity searches. Frigate uses this model on tracked objects to encode the thumbnail image and store it in the database. When searching for tracked objects via text in the search box, Frigate will perform a `text -> image` similarity search against this embedding. When clicking "Find Similar" in the tracked object detail pane, Frigate will perform an `image -> image` similarity search to retrieve the closest matching thumbnails.
+The [V1 model from Jina](https://huggingface.co/jinaai/jina-clip-v1) has a vision model which is able to embed both images and text into the same vector space, which allows `image -> image` and `text -> image` similarity searches. Frigate uses this model on tracked objects to encode the thumbnail image and store it in the database. When searching for tracked objects via text in the search box, Frigate will perform a `text -> image` similarity search against this embedding. When clicking "Find Similar" in the tracked object detail pane, Frigate will perform an `image -> image` similarity search to retrieve the closest matching thumbnails.

-The text model is used to embed tracked object descriptions and perform searches against them. Descriptions can be created, viewed, and modified on the Explore page when clicking on thumbnail of a tracked object. See [the Generative AI docs](/configuration/genai.md) for more information on how to automatically generate tracked object descriptions.
+The V1 text model is used to embed tracked object descriptions and perform searches against them. Descriptions can be created, viewed, and modified on the Explore page when clicking on thumbnail of a tracked object. See [the Generative AI docs](/configuration/genai.md) for more information on how to automatically generate tracked object descriptions.

-Differently weighted versions of the Jina model are available and can be selected by setting the `model_size` config option as `small` or `large`:
+Differently weighted versions of the Jina models are available and can be selected by setting the `model_size` config option as `small` or `large`:

 ```yaml
 semantic_search:
  enabled: True
+  model: "jinav1"
  model_size: small
 ```

 - Configuring the `large` model employs the full Jina model and will automatically run on the GPU if applicable.
 - Configuring the `small` model employs a quantized version of the Jina model that uses less RAM and runs on CPU with a very negligible difference in embedding quality.

+### Jina AI CLIP (version 2)
+
+Frigate also supports the [V2 model from Jina](https://huggingface.co/jinaai/jina-clip-v2), which introduces multilingual support (89 languages). In contrast, the V1 model only supports English.
+
+V2 offers only a 3% performance improvement over V1 in both text-image and text-text retrieval tasks, an upgrade that is unlikely to yield noticeable real-world benefits. Additionally, V2 has _significantly_ higher RAM and GPU requirements, leading to increased inference time and memory usage. If you plan to use V2, ensure your system has ample RAM and a discrete GPU. CPU inference (with the `small` model) using V2 is not recommended.
+
+To use the V2 model, update the `model` parameter in your config:
+
+```yaml
+semantic_search:
+  enabled: True
+  model: "jinav2"
+  model_size: large
+```
+
+For most users, especially native English speakers, the V1 model remains the recommended choice.
+
+:::note
+
+Switching between V1 and V2 requires reindexing your embeddings. To do this, set `reindex: True` in your Semantic Search configuration and restart Frigate. The embeddings from V1 and V2 are incompatible, and failing to reindex will result in incorrect search results.
+
+:::
+
 ### GPU Acceleration

 The CLIP models are downloaded in ONNX format, and the `large` model can be accelerated using GPU hardware, when available. This depends on the Docker build that is used.
--- a/frigate/config/classification.py
+++ b/frigate/config/classification.py
@ -1,3 +1,4 @@
+from enum import Enum
 from typing import Dict, List, Optional

 from pydantic import Field
@ -11,6 +12,11 @@ __all__ = [
 ]


+class SemanticSearchModelEnum(str, Enum):
+    jinav1 = "jinav1"
+    jinav2 = "jinav2"
+
+
 class BirdClassificationConfig(FrigateBaseModel):
    enabled: bool = Field(default=False, title="Enable bird classification.")
    threshold: float = Field(
@ -30,7 +36,11 @@ class ClassificationConfig(FrigateBaseModel):
 class SemanticSearchConfig(FrigateBaseModel):
    enabled: bool = Field(default=False, title="Enable semantic search.")
    reindex: Optional[bool] = Field(
-        default=False, title="Reindex all detections on startup."
+        default=False, title="Reindex all tracked objects on startup."
+    )
+    model: Optional[SemanticSearchModelEnum] = Field(
+        default=SemanticSearchModelEnum.jinav1,
+        title="The CLIP model to use for semantic search.",
    )
    model_size: str = Field(
        default="small", title="The size of the embeddings model used."
--- a/frigate/embeddings/embeddings.py
+++ b/frigate/embeddings/embeddings.py
@ -10,6 +10,7 @@ from playhouse.shortcuts import model_to_dict

 from frigate.comms.inter_process import InterProcessRequestor
 from frigate.config import FrigateConfig
+from frigate.config.classification import SemanticSearchModelEnum
 from frigate.const import (
    CONFIG_DIR,
    UPDATE_EMBEDDINGS_REINDEX_PROGRESS,
@ -23,6 +24,7 @@ from frigate.util.builtin import serialize
 from frigate.util.path import get_event_thumbnail_bytes

 from .onnx.jina_v1_embedding import JinaV1ImageEmbedding, JinaV1TextEmbedding
+from .onnx.jina_v2_embedding import JinaV2Embedding

 logger = logging.getLogger(__name__)

@ -75,18 +77,7 @@ class Embeddings:
        # Create tables if they don't exist
        self.db.create_embeddings_tables()

-        models = [
-            "jinaai/jina-clip-v1-text_model_fp16.onnx",
-            "jinaai/jina-clip-v1-tokenizer",
-            "jinaai/jina-clip-v1-vision_model_fp16.onnx"
-            if config.semantic_search.model_size == "large"
-            else "jinaai/jina-clip-v1-vision_model_quantized.onnx",
-            "jinaai/jina-clip-v1-preprocessor_config.json",
-            "facenet-facenet.onnx",
-            "paddleocr-onnx-detection.onnx",
-            "paddleocr-onnx-classification.onnx",
-            "paddleocr-onnx-recognition.onnx",
-        ]
+        models = self.get_model_definitions()

        for model in models:
            self.requestor.send_data(
@ -97,17 +88,64 @@ class Embeddings:
                },
            )

-        self.text_embedding = JinaV1TextEmbedding(
-            model_size=config.semantic_search.model_size,
-            requestor=self.requestor,
-            device="CPU",
+        if self.config.semantic_search.model == SemanticSearchModelEnum.jinav2:
+            # Single JinaV2Embedding instance for both text and vision
+            self.embedding = JinaV2Embedding(
+                model_size=self.config.semantic_search.model_size,
+                requestor=self.requestor,
+                device="GPU"
+                if self.config.semantic_search.model_size == "large"
+                else "CPU",
+            )
+            self.text_embedding = lambda input_data: self.embedding(
+                input_data, embedding_type="text"
+            )
+            self.vision_embedding = lambda input_data: self.embedding(
+                input_data, embedding_type="vision"
+            )
+        else:  # Default to jinav1
+            self.text_embedding = JinaV1TextEmbedding(
+                model_size=config.semantic_search.model_size,
+                requestor=self.requestor,
+                device="CPU",
+            )
+            self.vision_embedding = JinaV1ImageEmbedding(
+                model_size=config.semantic_search.model_size,
+                requestor=self.requestor,
+                device="GPU" if config.semantic_search.model_size == "large" else "CPU",
+            )
+
+    def get_model_definitions(self):
+        # Version-specific models
+        if self.config.semantic_search.model == SemanticSearchModelEnum.jinav2:
+            models = [
+                "jinaai/jina-clip-v2-tokenizer",
+                "jinaai/jina-clip-v2-model_fp16.onnx"
+                if self.config.semantic_search.model_size == "large"
+                else "jinaai/jina-clip-v2-model_quantized.onnx",
+                "jinaai/jina-clip-v2-preprocessor_config.json",
+            ]
+        else:  # Default to jinav1
+            models = [
+                "jinaai/jina-clip-v1-text_model_fp16.onnx",
+                "jinaai/jina-clip-v1-tokenizer",
+                "jinaai/jina-clip-v1-vision_model_fp16.onnx"
+                if self.config.semantic_search.model_size == "large"
+                else "jinaai/jina-clip-v1-vision_model_quantized.onnx",
+                "jinaai/jina-clip-v1-preprocessor_config.json",
+            ]
+
+        # Add common models
+        models.extend(
+            [
+                "facenet-facenet.onnx",
+                "paddleocr-onnx-detection.onnx",
+                "paddleocr-onnx-classification.onnx",
+                "paddleocr-onnx-recognition.onnx",
+            ]
        )

-        self.vision_embedding = JinaV1ImageEmbedding(
-            model_size=config.semantic_search.model_size,
-            requestor=self.requestor,
-            device="GPU" if config.semantic_search.model_size == "large" else "CPU",
-        )
+        return models

    def embed_thumbnail(
        self, event_id: str, thumbnail: bytes, upsert: bool = True
@ -244,7 +282,11 @@ class Embeddings:
        # Get total count of events to process
        total_events = Event.select().count()

-        batch_size = 32
+        batch_size = (
+            4
+            if self.config.semantic_search.model == SemanticSearchModelEnum.jinav2
+            else 32
+        )
        current_page = 1

        totals = {
--- a/frigate/embeddings/onnx/base_embedding.py
+++ b/frigate/embeddings/onnx/base_embedding.py
@ -72,6 +72,9 @@ class BaseEmbedding(ABC):

        return image

+    def _postprocess_outputs(self, outputs: any) -> any:
+        return outputs
+
    def __call__(
        self, inputs: list[str] | list[Image.Image] | list[str]
    ) -> list[np.ndarray]:
@ -91,5 +94,7 @@ class BaseEmbedding(ABC):
            else:
                logger.warning(f"Expected input '{key}' not found in onnx_inputs")

-        embeddings = self.runner.run(onnx_inputs)[0]
+        outputs = self.runner.run(onnx_inputs)[0]
+        embeddings = self._postprocess_outputs(outputs)
+
        return [embedding for embedding in embeddings]
--- a/frigate/embeddings/onnx/jina_v2_embedding.py
+++ b/frigate/embeddings/onnx/jina_v2_embedding.py
@ -0,0 +1,231 @@
+"""JinaV2 Embeddings."""
+
+import io
+import logging
+import os
+
+import numpy as np
+from PIL import Image
+from transformers import AutoTokenizer
+from transformers.utils.logging import disable_progress_bar, set_verbosity_error
+
+from frigate.comms.inter_process import InterProcessRequestor
+from frigate.const import MODEL_CACHE_DIR, UPDATE_MODEL_STATE
+from frigate.types import ModelStatusTypesEnum
+from frigate.util.downloader import ModelDownloader
+
+from .base_embedding import BaseEmbedding
+from .runner import ONNXModelRunner
+
+# disables the progress bar and download logging for downloading tokenizers and image processors
+disable_progress_bar()
+set_verbosity_error()
+logger = logging.getLogger(__name__)
+
+
+class JinaV2Embedding(BaseEmbedding):
+    def __init__(
+        self,
+        model_size: str,
+        requestor: InterProcessRequestor,
+        device: str = "AUTO",
+        embedding_type: str = None,
+    ):
+        model_file = (
+            "model_fp16.onnx" if model_size == "large" else "model_quantized.onnx"
+        )
+        super().__init__(
+            model_name="jinaai/jina-clip-v2",
+            model_file=model_file,
+            download_urls={
+                model_file: f"https://huggingface.co/jinaai/jina-clip-v2/resolve/main/onnx/{model_file}",
+                "preprocessor_config.json": "https://huggingface.co/jinaai/jina-clip-v2/resolve/main/preprocessor_config.json",
+            },
+        )
+        self.tokenizer_file = "tokenizer"
+        self.embedding_type = embedding_type
+        self.requestor = requestor
+        self.model_size = model_size
+        self.device = device
+        self.download_path = os.path.join(MODEL_CACHE_DIR, self.model_name)
+        self.tokenizer = None
+        self.image_processor = None
+        self.runner = None
+        files_names = list(self.download_urls.keys()) + [self.tokenizer_file]
+        if not all(
+            os.path.exists(os.path.join(self.download_path, n)) for n in files_names
+        ):
+            logger.debug(f"starting model download for {self.model_name}")
+            self.downloader = ModelDownloader(
+                model_name=self.model_name,
+                download_path=self.download_path,
+                file_names=files_names,
+                download_func=self._download_model,
+            )
+            self.downloader.ensure_model_files()
+        else:
+            self.downloader = None
+            ModelDownloader.mark_files_state(
+                self.requestor,
+                self.model_name,
+                files_names,
+                ModelStatusTypesEnum.downloaded,
+            )
+            self._load_model_and_utils()
+            logger.debug(f"models are already downloaded for {self.model_name}")
+
+    def _download_model(self, path: str):
+        try:
+            file_name = os.path.basename(path)
+
+            if file_name in self.download_urls:
+                ModelDownloader.download_from_url(self.download_urls[file_name], path)
+            elif file_name == self.tokenizer_file:
+                if not os.path.exists(os.path.join(path, self.model_name)):
+                    logger.info(f"Downloading {self.model_name} tokenizer")
+
+                tokenizer = AutoTokenizer.from_pretrained(
+                    self.model_name,
+                    trust_remote_code=True,
+                    cache_dir=os.path.join(
+                        MODEL_CACHE_DIR, self.model_name, "tokenizer"
+                    ),
+                    clean_up_tokenization_spaces=True,
+                )
+                tokenizer.save_pretrained(path)
+            self.requestor.send_data(
+                UPDATE_MODEL_STATE,
+                {
+                    "model": f"{self.model_name}-{file_name}",
+                    "state": ModelStatusTypesEnum.downloaded,
+                },
+            )
+        except Exception:
+            self.requestor.send_data(
+                UPDATE_MODEL_STATE,
+                {
+                    "model": f"{self.model_name}-{file_name}",
+                    "state": ModelStatusTypesEnum.error,
+                },
+            )
+
+    def _load_model_and_utils(self):
+        if self.runner is None:
+            if self.downloader:
+                self.downloader.wait_for_download()
+
+            tokenizer_path = os.path.join(
+                f"{MODEL_CACHE_DIR}/{self.model_name}/tokenizer"
+            )
+            self.tokenizer = AutoTokenizer.from_pretrained(
+                self.model_name,
+                cache_dir=tokenizer_path,
+                trust_remote_code=True,
+                clean_up_tokenization_spaces=True,
+            )
+
+            self.runner = ONNXModelRunner(
+                os.path.join(self.download_path, self.model_file),
+                self.device,
+                self.model_size,
+            )
+
+    def _preprocess_image(self, image_data: bytes | Image.Image) -> np.ndarray:
+        """
+        Manually preprocess a single image from bytes or PIL.Image to (3, 512, 512).
+        """
+        if isinstance(image_data, bytes):
+            image = Image.open(io.BytesIO(image_data))
+        else:
+            image = image_data
+
+        if image.mode != "RGB":
+            image = image.convert("RGB")
+
+        image = image.resize((512, 512), Image.Resampling.LANCZOS)
+
+        # Convert to numpy array, normalize to [0, 1], and transpose to (channels, height, width)
+        image_array = np.array(image, dtype=np.float32) / 255.0
+        image_array = np.transpose(image_array, (2, 0, 1))  # (H, W, C) -> (C, H, W)
+
+        return image_array
+
+    def _preprocess_inputs(self, raw_inputs):
+        """
+        Preprocess inputs into a list of real input tensors (no dummies).
+        - For text: Returns list of input_ids.
+        - For vision: Returns list of pixel_values.
+        """
+        if not isinstance(raw_inputs, list):
+            raw_inputs = [raw_inputs]
+
+        processed = []
+        if self.embedding_type == "text":
+            for text in raw_inputs:
+                input_ids = self.tokenizer([text], return_tensors="np")["input_ids"]
+                processed.append(input_ids)
+        elif self.embedding_type == "vision":
+            for img in raw_inputs:
+                pixel_values = self._preprocess_image(img)
+                processed.append(
+                    pixel_values[np.newaxis, ...]
+                )  # Add batch dim: (1, 3, 512, 512)
+        else:
+            raise ValueError(
+                f"Invalid embedding_type: {self.embedding_type}. Must be 'text' or 'vision'."
+            )
+        return processed
+
+    def _postprocess_outputs(self, outputs):
+        """
+        Process ONNX model outputs, truncating each embedding in the array to truncate_dim.
+        - outputs: NumPy array of embeddings.
+        - Returns: List of truncated embeddings.
+        """
+        # size of vector in database
+        truncate_dim = 768
+
+        # jina v2 defaults to 1024 and uses Matryoshka representation, so
+        # truncating only causes an extremely minor decrease in retrieval accuracy
+        if outputs.shape[-1] > truncate_dim:
+            outputs = outputs[..., :truncate_dim]
+
+        return outputs
+
+    def __call__(
+        self, inputs: list[str] | list[Image.Image] | list[str], embedding_type=None
+    ) -> list[np.ndarray]:
+        self.embedding_type = embedding_type
+        if not self.embedding_type:
+            raise ValueError(
+                "embedding_type must be specified either in __init__ or __call__"
+            )
+
+        self._load_model_and_utils()
+        processed = self._preprocess_inputs(inputs)
+        batch_size = len(processed)
+
+        # Prepare ONNX inputs with matching batch sizes
+        onnx_inputs = {}
+        if self.embedding_type == "text":
+            onnx_inputs["input_ids"] = np.stack([x[0] for x in processed])
+            onnx_inputs["pixel_values"] = np.zeros(
+                (batch_size, 3, 512, 512), dtype=np.float32
+            )
+        elif self.embedding_type == "vision":
+            onnx_inputs["input_ids"] = np.zeros((batch_size, 16), dtype=np.int64)
+            onnx_inputs["pixel_values"] = np.stack([x[0] for x in processed])
+        else:
+            raise ValueError("Invalid embedding type")
+
+        # Run inference
+        outputs = self.runner.run(onnx_inputs)
+        if self.embedding_type == "text":
+            embeddings = outputs[2]  # text embeddings
+        elif self.embedding_type == "vision":
+            embeddings = outputs[3]  # image embeddings
+        else:
+            raise ValueError("Invalid embedding type")
+
+        embeddings = self._postprocess_outputs(embeddings)
+        return [embedding for embedding in embeddings]
--- a/frigate/embeddings/onnx/runner.py
+++ b/frigate/embeddings/onnx/runner.py
@ -66,14 +66,9 @@ class ONNXModelRunner:
    def run(self, input: dict[str, Any]) -> Any:
        if self.type == "ov":
            infer_request = self.interpreter.create_infer_request()
-            input_tensor = list(input.values())

-            if len(input_tensor) == 1:
-                input_tensor = ov.Tensor(array=input_tensor[0])
-            else:
-                input_tensor = ov.Tensor(array=input_tensor)
+            outputs = infer_request.infer(input)

-            infer_request.infer(input_tensor)
-            return [infer_request.get_output_tensor().data]
+            return outputs
        elif self.type == "ort":
            return self.ort.run(None, input)
--- a/web/src/pages/Explore.tsx
+++ b/web/src/pages/Explore.tsx
@ -267,20 +267,41 @@ export default function Explore() {

  // model states

-  const { payload: textModelState } = useModelState(
-    "jinaai/jina-clip-v1-text_model_fp16.onnx",
-  );
-  const { payload: textTokenizerState } = useModelState(
-    "jinaai/jina-clip-v1-tokenizer",
-  );
-  const modelFile =
-    config?.semantic_search.model_size === "large"
-      ? "jinaai/jina-clip-v1-vision_model_fp16.onnx"
-      : "jinaai/jina-clip-v1-vision_model_quantized.onnx";
+  const modelVersion = config?.semantic_search.model || "jinav1";
+  const modelSize = config?.semantic_search.model_size || "small";

-  const { payload: visionModelState } = useModelState(modelFile);
+  // Text model state
+  const { payload: textModelState } = useModelState(
+    modelVersion === "jinav1"
+      ? "jinaai/jina-clip-v1-text_model_fp16.onnx"
+      : modelSize === "large"
+        ? "jinaai/jina-clip-v2-model_fp16.onnx"
+        : "jinaai/jina-clip-v2-model_quantized.onnx",
+  );
+
+  // Tokenizer state
+  const { payload: textTokenizerState } = useModelState(
+    modelVersion === "jinav1"
+      ? "jinaai/jina-clip-v1-tokenizer"
+      : "jinaai/jina-clip-v2-tokenizer",
+  );
+
+  // Vision model state (same as text model for jinav2)
+  const visionModelFile =
+    modelVersion === "jinav1"
+      ? modelSize === "large"
+        ? "jinaai/jina-clip-v1-vision_model_fp16.onnx"
+        : "jinaai/jina-clip-v1-vision_model_quantized.onnx"
+      : modelSize === "large"
+        ? "jinaai/jina-clip-v2-model_fp16.onnx"
+        : "jinaai/jina-clip-v2-model_quantized.onnx";
+  const { payload: visionModelState } = useModelState(visionModelFile);
+
+  // Preprocessor/feature extractor state
  const { payload: visionFeatureExtractorState } = useModelState(
-    "jinaai/jina-clip-v1-preprocessor_config.json",
+    modelVersion === "jinav1"
+      ? "jinaai/jina-clip-v1-preprocessor_config.json"
+      : "jinaai/jina-clip-v2-preprocessor_config.json",
  );

  const allModelsLoaded = useMemo(() => {
--- a/web/src/types/frigateConfig.ts
+++ b/web/src/types/frigateConfig.ts
@ -20,6 +20,7 @@ export interface BirdseyeConfig {
  width: number;
 }

+export type SearchModel = "jinav1" | "jinav2";
 export type SearchModelSize = "small" | "large";

 export interface CameraConfig {
@ -458,6 +459,7 @@ export interface FrigateConfig {
  semantic_search: {
    enabled: boolean;
    reindex: boolean;
+    model: SearchModel;
    model_size: SearchModelSize;
  };