Audio transcription support (#18398)

* install new packages for transcription support * add config options * audio maintainer modifications to support transcription * pass main config to audio process * embeddings support * api and transcription post processor * embeddings maintainer support for post processor * live audio transcription with sherpa and faster-whisper * update dispatcher with live transcription topic * frontend websocket * frontend live transcription * frontend changes for speech events * i18n changes * docs * mqtt docs * fix linter * use float16 and small model on gpu for real-time * fix return value and use requestor to embed description instead of passing embeddings * run real-time transcription in its own thread * tweaks * publish live transcriptions on their own topic instead of tracked_object_update * config validator and docs * clarify docs
2025-08-04 13:47:37 +02:00 · 2025-05-27 10:26:00 -05:00 · 2025-05-27 10:26:00 -05:00 · 2bd6fa53fe
commit 2bd6fa53fe
parent 512b7e16e1
29 changed files with 2322 additions and 51 deletions
--- a/docker/main/requirements-wheels.txt
+++ b/docker/main/requirements-wheels.txt
@ -71,3 +71,8 @@ prometheus-client == 0.21.*
 # TFLite
 tflite_runtime @ https://github.com/frigate-nvr/TFlite-builds/releases/download/v2.17.1/tflite_runtime-2.17.1-cp311-cp311-linux_x86_64.whl; platform_machine == 'x86_64'
 tflite_runtime @ https://github.com/feranick/TFlite-builds/releases/download/v2.17.1/tflite_runtime-2.17.1-cp311-cp311-linux_aarch64.whl; platform_machine == 'aarch64'
+# audio transcription
+sherpa-onnx==1.12.*
+faster-whisper==1.1.*
+librosa==0.11.*
+soundfile==0.13.*
--- a/docs/docs/configuration/audio_detectors.md
+++ b/docs/docs/configuration/audio_detectors.md
@ -72,3 +72,77 @@ audio:
    - speech
    - yell
 ```
+
+### Audio Transcription
+
+Frigate supports fully local audio transcription using either `sherpa-onnx` or OpenAI’s open-source Whisper models via `faster-whisper`. To enable transcription, it is recommended to only configure the features at the global level, and enable it at the individual camera level.
+
+```yaml
+audio_transcription:
+  enabled: False
+  device: ...
+  model_size: ...
+```
+
+Enable audio transcription for select cameras at the camera level:
+
+```yaml
+cameras:
+  back_yard:
+    ...
+    audio_transcription:
+      enabled: True
+```
+
+:::note
+
+Audio detection must be enabled and configured as described above in order to use audio transcription features.
+
+:::
+
+The optional config parameters that can be set at the global level include:
+
+- **`enabled`**: Enable or disable the audio transcription feature.
+  - Default: `False`
+  - It is recommended to only configure the features at the global level, and enable it at the individual camera level.
+- **`device`**: Device to use to run transcription and translation models.
+  - Default: `CPU`
+  - This can be `CPU` or `GPU`. The `sherpa-onnx` models are lightweight and run on the CPU only. The `whisper` models can run on GPU but are only supported on CUDA hardware.
+- **`model_size`**: The size of the model used for live transcription.
+  - Default: `small`
+  - This can be `small` or `large`. The `small` setting uses `sherpa-onnx` models that are fast, lightweight, and always run on the CPU but are not as accurate as the `whisper` model.
+  - The
+  - This config option applies to **live transcription only**. Recorded `speech` events will always use a different `whisper` model (and can be accelerated for CUDA hardware if available with `device: GPU`).
+- **`language`**: Defines the language used by `whisper` to translate `speech` audio events (and live audio only if using the `large` model).
+  - Default: `en`
+  - You must use a valid [language code](https://github.com/openai/whisper/blob/main/whisper/tokenizer.py#L10).
+  - Transcriptions for `speech` events are translated.
+  - Live audio is translated only if you are using the `large` model. The `small` `sherpa-onnx` model is English-only.
+
+The only field that is valid at the camera level is `enabled`.
+
+#### Live transcription
+
+The single camera Live view in the Frigate UI supports live transcription of audio for streams defined with the `audio` role. Use the Enable/Disable Live Audio Transcription button/switch to toggle transcription processing. When speech is heard, the UI will display a black box over the top of the camera stream with text. The MQTT topic `frigate/<camera_name>/audio/transcription` will also be updated in real-time with transcribed text.
+
+Results can be error-prone due to a number of factors, including:
+
+- Poor quality camera microphone
+- Distance of the audio source to the camera microphone
+- Low audio bitrate setting in the camera
+- Background noise
+- Using the `small` model - it's fast, but not accurate for poor quality audio
+
+For speech sources close to the camera with minimal background noise, use the `small` model.
+
+If you have CUDA hardware, you can experiment with the `large` `whisper` model on GPU. Performance is not quite as fast as the `sherpa-onnx` `small` model, but live transcription is far more accurate. Using the `large` model with CPU will likely be too slow for real-time transcription.
+
+#### Transcription and translation of `speech` audio events
+
+Any `speech` events in Explore can be transcribed and/or translated through the Transcribe button in the Tracked Object Details pane.
+
+In order to use transcription and translation for past events, you must enable audio detection and define `speech` as an audio type to listen for in your config. To have `speech` events translated into the language of your choice, set the `language` config parameter with the correct [language code](https://github.com/openai/whisper/blob/main/whisper/tokenizer.py#L10).
+
+The transcribed/translated speech will appear in the description box in the Tracked Object Details pane. If Semantic Search is enabled, embeddings are generated for the transcription text and are fully searchable using the description search type.
+
+Recorded `speech` events will always use a `whisper` model, regardless of the `model_size` config setting. Without a GPU, generating transcriptions for longer `speech` events may take a fair amount of time, so be patient.
--- a/docs/docs/configuration/reference.md
+++ b/docs/docs/configuration/reference.md
@ -620,6 +620,19 @@ genai:
  object_prompts:
    person: "My special person prompt."

+# Optional: Configuration for audio transcription
+# NOTE: only the enabled option can be overridden at the camera level
+audio_transcription:
+  # Optional: Enable license plate recognition (default: shown below)
+  enabled: False
+  # Optional: The device to run the models on (default: shown below)
+  device: CPU
+  # Optional: Set the model size used for transcription. (default: shown below)
+  model_size: small
+  # Optional: Set the language used for transcription translation. (default: shown below)
+  # List of language codes: https://github.com/openai/whisper/blob/main/whisper/tokenizer.py#L10
+  language: en
+
 # Optional: Restream configuration
 # Uses https://github.com/AlexxIT/go2rtc (v1.9.9)
 # NOTE: The default go2rtc API port (1984) must be used,
--- a/docs/docs/integrations/mqtt.md
+++ b/docs/docs/integrations/mqtt.md
@ -125,7 +125,7 @@ Message published for updates to tracked object metadata, for example:
  "name": "John",
  "score": 0.95,
  "camera": "front_door_cam",
-  "timestamp": 1607123958.748393,
+  "timestamp": 1607123958.748393
 }
 ```

@ -139,7 +139,7 @@ Message published for updates to tracked object metadata, for example:
  "plate": "123ABC",
  "score": 0.95,
  "camera": "driveway_cam",
-  "timestamp": 1607123958.748393,
+  "timestamp": 1607123958.748393
 }
 ```

@ -255,6 +255,12 @@ Publishes the rms value for audio detected on this camera.

 **NOTE:** Requires audio detection to be enabled

+### `frigate/<camera_name>/audio/transcription`
+
+Publishes transcribed text for audio detected on this camera.
+
+**NOTE:** Requires audio detection and transcription to be enabled
+
 ### `frigate/<camera_name>/enabled/set`

 Topic to turn Frigate's processing of a camera on and off. Expected values are `ON` and `OFF`.
--- a/frigate/api/classification.py
+++ b/frigate/api/classification.py
@ -14,7 +14,10 @@ from peewee import DoesNotExist
 from playhouse.shortcuts import model_to_dict

 from frigate.api.auth import require_role
-from frigate.api.defs.request.classification_body import RenameFaceBody
+from frigate.api.defs.request.classification_body import (
+    AudioTranscriptionBody,
+    RenameFaceBody,
+)
 from frigate.api.defs.tags import Tags
 from frigate.config.camera import DetectConfig
 from frigate.const import FACE_DIR
@ -366,3 +369,58 @@ def reindex_embeddings(request: Request):
            },
            status_code=500,
        )
+
+
+@router.put("/audio/transcribe")
+def transcribe_audio(request: Request, body: AudioTranscriptionBody):
+    event_id = body.event_id
+
+    try:
+        event = Event.get(Event.id == event_id)
+    except DoesNotExist:
+        message = f"Event {event_id} not found"
+        logger.error(message)
+        return JSONResponse(
+            content=({"success": False, "message": message}), status_code=404
+        )
+
+    if not request.app.frigate_config.cameras[event.camera].audio_transcription.enabled:
+        message = f"Audio transcription is not enabled for {event.camera}."
+        logger.error(message)
+        return JSONResponse(
+            content=(
+                {
+                    "success": False,
+                    "message": message,
+                }
+            ),
+            status_code=400,
+        )
+
+    context: EmbeddingsContext = request.app.embeddings
+    response = context.transcribe_audio(model_to_dict(event))
+
+    if response == "started":
+        return JSONResponse(
+            content={
+                "success": True,
+                "message": "Audio transcription has started.",
+            },
+            status_code=202,  # 202 Accepted
+        )
+    elif response == "in_progress":
+        return JSONResponse(
+            content={
+                "success": False,
+                "message": "Audio transcription for a speech event is currently in progress. Try again later.",
+            },
+            status_code=409,  # 409 Conflict
+        )
+    else:
+        return JSONResponse(
+            content={
+                "success": False,
+                "message": "Failed to transcribe audio.",
+            },
+            status_code=500,
+        )
--- a/frigate/api/defs/request/classification_body.py
+++ b/frigate/api/defs/request/classification_body.py
@ -3,3 +3,7 @@ from pydantic import BaseModel

 class RenameFaceBody(BaseModel):
    new_name: str
+
+
+class AudioTranscriptionBody(BaseModel):
+    event_id: str
--- a/frigate/app.py
+++ b/frigate/app.py
@ -494,7 +494,9 @@ class FrigateApp:
        ]

        if audio_cameras:
-            self.audio_process = AudioProcessor(audio_cameras, self.camera_metrics)
+            self.audio_process = AudioProcessor(
+                self.config, audio_cameras, self.camera_metrics
+            )
            self.audio_process.start()
            self.processes["audio_detector"] = self.audio_process.pid or 0

--- a/frigate/comms/dispatcher.py
+++ b/frigate/comms/dispatcher.py
@ -58,6 +58,7 @@ class Dispatcher:

        self._camera_settings_handlers: dict[str, Callable] = {
            "audio": self._on_audio_command,
+            "audio_transcription": self._on_audio_transcription_command,
            "detect": self._on_detect_command,
            "enabled": self._on_enabled_command,
            "improve_contrast": self._on_motion_improve_contrast_command,
@ -181,6 +182,9 @@ class Dispatcher:
                    "snapshots": self.config.cameras[camera].snapshots.enabled,
                    "record": self.config.cameras[camera].record.enabled,
                    "audio": self.config.cameras[camera].audio.enabled,
+                    "audio_transcription": self.config.cameras[
+                        camera
+                    ].audio_transcription.live_enabled,
                    "notifications": self.config.cameras[camera].notifications.enabled,
                    "notifications_suspended": int(
                        self.web_push_client.suspended_cameras.get(camera, 0)
@ -465,6 +469,37 @@ class Dispatcher:
        )
        self.publish(f"{camera_name}/audio/state", payload, retain=True)

+    def _on_audio_transcription_command(self, camera_name: str, payload: str) -> None:
+        """Callback for live audio transcription topic."""
+        audio_transcription_settings = self.config.cameras[
+            camera_name
+        ].audio_transcription
+
+        if payload == "ON":
+            if not self.config.cameras[
+                camera_name
+            ].audio_transcription.enabled_in_config:
+                logger.error(
+                    "Audio transcription must be enabled in the config to be turned on via MQTT."
+                )
+                return
+
+            if not audio_transcription_settings.live_enabled:
+                logger.info(f"Turning on live audio transcription for {camera_name}")
+                audio_transcription_settings.live_enabled = True
+        elif payload == "OFF":
+            if audio_transcription_settings.live_enabled:
+                logger.info(f"Turning off live audio transcription for {camera_name}")
+                audio_transcription_settings.live_enabled = False
+
+        self.config_updater.publish_update(
+            CameraConfigUpdateTopic(
+                CameraConfigUpdateEnum.audio_transcription, camera_name
+            ),
+            audio_transcription_settings,
+        )
+        self.publish(f"{camera_name}/audio_transcription/state", payload, retain=True)
+
    def _on_recordings_command(self, camera_name: str, payload: str) -> None:
        """Callback for recordings topic."""
        record_settings = self.config.cameras[camera_name].record
--- a/frigate/comms/embeddings_updater.py
+++ b/frigate/comms/embeddings_updater.py
@ -18,6 +18,7 @@ class EmbeddingsRequestEnum(Enum):
    reprocess_face = "reprocess_face"
    reprocess_plate = "reprocess_plate"
    reindex = "reindex"
+    transcribe_audio = "transcribe_audio"


 class EmbeddingsResponder:
--- a/frigate/config/camera/camera.py
+++ b/frigate/config/camera/camera.py
@ -19,6 +19,7 @@ from frigate.util.builtin import (

 from ..base import FrigateBaseModel
 from ..classification import (
+    AudioTranscriptionConfig,
    CameraFaceRecognitionConfig,
    CameraLicensePlateRecognitionConfig,
 )
@ -56,6 +57,9 @@ class CameraConfig(FrigateBaseModel):
    audio: AudioConfig = Field(
        default_factory=AudioConfig, title="Audio events configuration."
    )
+    audio_transcription: AudioTranscriptionConfig = Field(
+        default_factory=AudioTranscriptionConfig, title="Audio transcription config."
+    )
    birdseye: BirdseyeCameraConfig = Field(
        default_factory=BirdseyeCameraConfig, title="Birdseye camera configuration."
    )
--- a/frigate/config/camera/updater.py
+++ b/frigate/config/camera/updater.py
@ -12,6 +12,7 @@ class CameraConfigUpdateEnum(str, Enum):
    """Supported camera config update types."""

    audio = "audio"
+    audio_transcription = "audio_transcription"
    birdseye = "birdseye"
    detect = "detect"
    enabled = "enabled"
@ -74,6 +75,8 @@ class CameraConfigUpdateSubscriber:

        if update_type == CameraConfigUpdateEnum.audio:
            config.audio = updated_config
+        if update_type == CameraConfigUpdateEnum.audio_transcription:
+            config.audio_transcription = updated_config
        elif update_type == CameraConfigUpdateEnum.birdseye:
            config.birdseye = updated_config
        elif update_type == CameraConfigUpdateEnum.detect:
--- a/frigate/config/classification.py
+++ b/frigate/config/classification.py
@ -19,11 +19,32 @@ class SemanticSearchModelEnum(str, Enum):
    jinav2 = "jinav2"


-class LPRDeviceEnum(str, Enum):
+class EnrichmentsDeviceEnum(str, Enum):
    GPU = "GPU"
    CPU = "CPU"


+class AudioTranscriptionConfig(FrigateBaseModel):
+    enabled: bool = Field(default=False, title="Enable audio transcription.")
+    language: str = Field(
+        default="en",
+        title="Language abbreviation to use for audio event transcription/translation.",
+    )
+    device: Optional[EnrichmentsDeviceEnum] = Field(
+        default=EnrichmentsDeviceEnum.CPU,
+        title="The device used for license plate recognition.",
+    )
+    model_size: str = Field(
+        default="small", title="The size of the embeddings model used."
+    )
+    enabled_in_config: Optional[bool] = Field(
+        default=None, title="Keep track of original state of camera."
+    )
+    live_enabled: Optional[bool] = Field(
+        default=False, title="Enable live transcriptions."
+    )
+
+
 class BirdClassificationConfig(FrigateBaseModel):
    enabled: bool = Field(default=False, title="Enable bird classification.")
    threshold: float = Field(
@ -144,8 +165,8 @@ class CameraFaceRecognitionConfig(FrigateBaseModel):

 class LicensePlateRecognitionConfig(FrigateBaseModel):
    enabled: bool = Field(default=False, title="Enable license plate recognition.")
-    device: Optional[LPRDeviceEnum] = Field(
-        default=LPRDeviceEnum.CPU,
+    device: Optional[EnrichmentsDeviceEnum] = Field(
+        default=EnrichmentsDeviceEnum.CPU,
        title="The device used for license plate recognition.",
    )
    model_size: str = Field(
--- a/frigate/config/config.py
+++ b/frigate/config/config.py
@ -54,6 +54,7 @@ from .camera.snapshots import SnapshotsConfig
 from .camera.timestamp import TimestampStyleConfig
 from .camera_group import CameraGroupConfig
 from .classification import (
+    AudioTranscriptionConfig,
    ClassificationConfig,
    FaceRecognitionConfig,
    LicensePlateRecognitionConfig,
@ -419,6 +420,9 @@ class FrigateConfig(FrigateBaseModel):
    )

    # Classification Config
+    audio_transcription: AudioTranscriptionConfig = Field(
+        default_factory=AudioTranscriptionConfig, title="Audio transcription config."
+    )
    classification: ClassificationConfig = Field(
        default_factory=ClassificationConfig, title="Object classification config."
    )
@ -472,6 +476,7 @@ class FrigateConfig(FrigateBaseModel):
        global_config = self.model_dump(
            include={
                "audio": ...,
+                "audio_transcription": ...,
                "birdseye": ...,
                "face_recognition": ...,
                "lpr": ...,
@ -528,6 +533,7 @@ class FrigateConfig(FrigateBaseModel):
            allowed_fields_map = {
                "face_recognition": ["enabled", "min_area"],
                "lpr": ["enabled", "expire_time", "min_area", "enhancement"],
+                "audio_transcription": ["enabled", "live_enabled"],
            }

            for section in allowed_fields_map:
@ -609,6 +615,9 @@ class FrigateConfig(FrigateBaseModel):
            # set config pre-value
            camera_config.enabled_in_config = camera_config.enabled
            camera_config.audio.enabled_in_config = camera_config.audio.enabled
+            camera_config.audio_transcription.enabled_in_config = (
+                camera_config.audio_transcription.enabled
+            )
            camera_config.record.enabled_in_config = camera_config.record.enabled
            camera_config.notifications.enabled_in_config = (
                camera_config.notifications.enabled
@ -701,6 +710,21 @@ class FrigateConfig(FrigateBaseModel):
        self.model.create_colormap(sorted(self.objects.all_objects))
        self.model.check_and_load_plus_model(self.plus_api)

+        # Check audio transcription and audio detection requirements
+        if self.audio_transcription.enabled:
+            # If audio transcription is enabled globally, at least one camera must have audio detection enabled
+            if not any(camera.audio.enabled for camera in self.cameras.values()):
+                raise ValueError(
+                    "Audio transcription is enabled globally, but no cameras have audio detection enabled. At least one camera must have audio detection enabled."
+                )
+        else:
+            # If audio transcription is disabled globally, check each camera with audio_transcription enabled
+            for camera in self.cameras.values():
+                if camera.audio_transcription.enabled and not camera.audio.enabled:
+                    raise ValueError(
+                        f"Camera {camera.name} has audio transcription enabled, but audio detection is not enabled for this camera. Audio detection must be enabled for cameras with audio transcription when it is disabled globally."
+                    )
+
        if self.plus_api and not self.snapshots.clean_copy:
            logger.warning(
                "Frigate+ is configured but clean snapshots are not enabled, submissions to Frigate+ will not be possible./"
--- a/frigate/data_processing/post/audio_transcription.py
+++ b/frigate/data_processing/post/audio_transcription.py
@ -0,0 +1,212 @@
+"""Handle post-processing for audio transcription."""
+
+import logging
+import os
+import threading
+import time
+from typing import Optional
+
+from faster_whisper import WhisperModel
+from peewee import DoesNotExist
+
+from frigate.comms.embeddings_updater import EmbeddingsRequestEnum
+from frigate.comms.inter_process import InterProcessRequestor
+from frigate.config import FrigateConfig
+from frigate.const import (
+    CACHE_DIR,
+    MODEL_CACHE_DIR,
+    UPDATE_EVENT_DESCRIPTION,
+)
+from frigate.data_processing.types import PostProcessDataEnum
+from frigate.types import TrackedObjectUpdateTypesEnum
+from frigate.util.audio import get_audio_from_recording
+
+from ..types import DataProcessorMetrics
+from .api import PostProcessorApi
+
+logger = logging.getLogger(__name__)
+
+
+class AudioTranscriptionPostProcessor(PostProcessorApi):
+    def __init__(
+        self,
+        config: FrigateConfig,
+        requestor: InterProcessRequestor,
+        metrics: DataProcessorMetrics,
+    ):
+        super().__init__(config, metrics, None)
+        self.config = config
+        self.requestor = requestor
+        self.recognizer = None
+        self.transcription_lock = threading.Lock()
+        self.transcription_thread = None
+        self.transcription_running = False
+
+        # faster-whisper handles model downloading automatically
+        self.model_path = os.path.join(MODEL_CACHE_DIR, "whisper")
+        os.makedirs(self.model_path, exist_ok=True)
+
+        self.__build_recognizer()
+
+    def __build_recognizer(self) -> None:
+        try:
+            self.recognizer = WhisperModel(
+                model_size_or_path="small",
+                device="cuda"
+                if self.config.audio_transcription.device == "GPU"
+                else "cpu",
+                download_root=self.model_path,
+                local_files_only=False,  # Allow downloading if not cached
+                compute_type="int8",
+            )
+            logger.debug("Audio transcription (recordings) initialized")
+        except Exception as e:
+            logger.error(f"Failed to initialize recordings audio transcription: {e}")
+            self.recognizer = None
+
+    def process_data(
+        self, data: dict[str, any], data_type: PostProcessDataEnum
+    ) -> None:
+        """Transcribe audio from a recording.
+
+        Args:
+            data (dict): Contains data about the input (event_id, camera, etc.).
+            data_type (enum): Describes the data being processed (recording or tracked_object).
+
+        Returns:
+            None
+        """
+        event_id = data["event_id"]
+        camera_name = data["camera"]
+
+        if data_type == PostProcessDataEnum.recording:
+            start_ts = data["frame_time"]
+            recordings_available_through = data["recordings_available"]
+            end_ts = min(recordings_available_through, start_ts + 60)  # Default 60s
+
+        elif data_type == PostProcessDataEnum.tracked_object:
+            obj_data = data["event"]["data"]
+            obj_data["id"] = data["event"]["id"]
+            obj_data["camera"] = data["event"]["camera"]
+            start_ts = data["event"]["start_time"]
+            end_ts = data["event"].get(
+                "end_time", start_ts + 60
+            )  # Use end_time if available
+
+        else:
+            logger.error("No data type passed to audio transcription post-processing")
+            return
+
+        try:
+            audio_data = get_audio_from_recording(
+                self.config.cameras[camera_name].ffmpeg,
+                camera_name,
+                start_ts,
+                end_ts,
+                sample_rate=16000,
+            )
+
+            if not audio_data:
+                logger.debug(f"No audio data extracted for {event_id}")
+                return
+
+            transcription = self.__transcribe_audio(audio_data)
+            if not transcription:
+                logger.debug("No transcription generated from audio")
+                return
+
+            logger.debug(f"Transcribed audio for {event_id}: '{transcription}'")
+
+            self.requestor.send_data(
+                UPDATE_EVENT_DESCRIPTION,
+                {
+                    "type": TrackedObjectUpdateTypesEnum.description,
+                    "id": event_id,
+                    "description": transcription,
+                    "camera": camera_name,
+                },
+            )
+
+            # Embed the description
+            self.requestor.send_data(
+                EmbeddingsRequestEnum.embed_description.value,
+                {"id": event_id, "description": transcription},
+            )
+
+        except DoesNotExist:
+            logger.debug("No recording found for audio transcription post-processing")
+            return
+        except Exception as e:
+            logger.error(f"Error in audio transcription post-processing: {e}")
+
+    def __transcribe_audio(self, audio_data: bytes) -> Optional[tuple[str, float]]:
+        """Transcribe WAV audio data using faster-whisper."""
+        if not self.recognizer:
+            logger.debug("Recognizer not initialized")
+            return None
+
+        try:
+            # Save audio data to a temporary wav (faster-whisper expects a file)
+            temp_wav = os.path.join(CACHE_DIR, f"temp_audio_{int(time.time())}.wav")
+            with open(temp_wav, "wb") as f:
+                f.write(audio_data)
+
+            segments, info = self.recognizer.transcribe(
+                temp_wav,
+                language=self.config.audio_transcription.language,
+                beam_size=5,
+            )
+
+            os.remove(temp_wav)
+
+            # Combine all segment texts
+            text = " ".join(segment.text.strip() for segment in segments)
+            if not text:
+                return None
+
+            logger.debug(
+                "Detected language '%s' with probability %f"
+                % (info.language, info.language_probability)
+            )
+
+            return text
+        except Exception as e:
+            logger.error(f"Error transcribing audio: {e}")
+            return None
+
+    def _transcription_wrapper(self, event: dict[str, any]) -> None:
+        """Wrapper to run transcription and reset running flag when done."""
+        try:
+            self.process_data(
+                {
+                    "event_id": event["id"],
+                    "camera": event["camera"],
+                    "event": event,
+                },
+                PostProcessDataEnum.tracked_object,
+            )
+        finally:
+            with self.transcription_lock:
+                self.transcription_running = False
+                self.transcription_thread = None
+
+    def handle_request(self, topic: str, request_data: dict[str, any]) -> str | None:
+        if topic == "transcribe_audio":
+            event = request_data["event"]
+
+            with self.transcription_lock:
+                if self.transcription_running:
+                    logger.warning(
+                        "Audio transcription for a speech event is already running."
+                    )
+                    return "in_progress"
+
+                # Mark as running and start the thread
+                self.transcription_running = True
+                self.transcription_thread = threading.Thread(
+                    target=self._transcription_wrapper, args=(event,), daemon=True
+                )
+                self.transcription_thread.start()
+                return "started"
+
+        return None
--- a/frigate/data_processing/real_time/audio_transcription.py
+++ b/frigate/data_processing/real_time/audio_transcription.py
@ -0,0 +1,276 @@
+"""Handle processing audio for speech transcription using sherpa-onnx with FFmpeg pipe."""
+
+import logging
+import os
+import queue
+import threading
+from typing import Optional
+
+import numpy as np
+import sherpa_onnx
+
+from frigate.comms.inter_process import InterProcessRequestor
+from frigate.config import CameraConfig, FrigateConfig
+from frigate.const import MODEL_CACHE_DIR
+from frigate.util.downloader import ModelDownloader
+
+from ..types import DataProcessorMetrics
+from .api import RealTimeProcessorApi
+from .whisper_online import FasterWhisperASR, OnlineASRProcessor
+
+logger = logging.getLogger(__name__)
+
+
+class AudioTranscriptionRealTimeProcessor(RealTimeProcessorApi):
+    def __init__(
+        self,
+        config: FrigateConfig,
+        camera_config: CameraConfig,
+        requestor: InterProcessRequestor,
+        metrics: DataProcessorMetrics,
+        stop_event: threading.Event,
+    ):
+        super().__init__(config, metrics)
+        self.config = config
+        self.camera_config = camera_config
+        self.requestor = requestor
+        self.recognizer = None
+        self.stream = None
+        self.transcription_segments = []
+        self.audio_queue = queue.Queue()
+        self.stop_event = stop_event
+
+        if self.config.audio_transcription.model_size == "large":
+            self.asr = FasterWhisperASR(
+                modelsize="tiny",
+                device="cuda"
+                if self.config.audio_transcription.device == "GPU"
+                else "cpu",
+                lan=config.audio_transcription.language,
+                model_dir=os.path.join(MODEL_CACHE_DIR, "whisper"),
+            )
+            self.asr.use_vad()  # Enable Silero VAD for low-RMS audio
+
+        else:
+            # small model as default
+            download_path = os.path.join(MODEL_CACHE_DIR, "sherpa-onnx")
+            HF_ENDPOINT = os.environ.get("HF_ENDPOINT", "https://huggingface.co")
+            self.model_files = {
+                "encoder.onnx": f"{HF_ENDPOINT}/csukuangfj/sherpa-onnx-streaming-zipformer-en-2023-06-26/resolve/main/encoder-epoch-99-avg-1-chunk-16-left-128.onnx",
+                "decoder.onnx": f"{HF_ENDPOINT}/csukuangfj/sherpa-onnx-streaming-zipformer-en-2023-06-26/resolve/main/decoder-epoch-99-avg-1-chunk-16-left-128.onnx",
+                "joiner.onnx": f"{HF_ENDPOINT}/csukuangfj/sherpa-onnx-streaming-zipformer-en-2023-06-26/resolve/main/joiner-epoch-99-avg-1-chunk-16-left-128.onnx",
+                "tokens.txt": f"{HF_ENDPOINT}/csukuangfj/sherpa-onnx-streaming-zipformer-en-2023-06-26/resolve/main/tokens.txt",
+            }
+
+            if not all(
+                os.path.exists(os.path.join(download_path, n))
+                for n in self.model_files.keys()
+            ):
+                self.downloader = ModelDownloader(
+                    model_name="sherpa-onnx",
+                    download_path=download_path,
+                    file_names=self.model_files.keys(),
+                    download_func=self.__download_models,
+                    complete_func=self.__build_recognizer,
+                )
+                self.downloader.ensure_model_files()
+
+        self.__build_recognizer()
+
+    def __download_models(self, path: str) -> None:
+        try:
+            file_name = os.path.basename(path)
+            ModelDownloader.download_from_url(self.model_files[file_name], path)
+        except Exception as e:
+            logger.error(f"Failed to download {path}: {e}")
+
+    def __build_recognizer(self) -> None:
+        try:
+            if self.config.audio_transcription.model_size == "large":
+                self.online = OnlineASRProcessor(
+                    asr=self.asr,
+                )
+            else:
+                self.recognizer = sherpa_onnx.OnlineRecognizer.from_transducer(
+                    tokens=os.path.join(MODEL_CACHE_DIR, "sherpa-onnx/tokens.txt"),
+                    encoder=os.path.join(MODEL_CACHE_DIR, "sherpa-onnx/encoder.onnx"),
+                    decoder=os.path.join(MODEL_CACHE_DIR, "sherpa-onnx/decoder.onnx"),
+                    joiner=os.path.join(MODEL_CACHE_DIR, "sherpa-onnx/joiner.onnx"),
+                    num_threads=2,
+                    sample_rate=16000,
+                    feature_dim=80,
+                    enable_endpoint_detection=True,
+                    rule1_min_trailing_silence=2.4,
+                    rule2_min_trailing_silence=1.2,
+                    rule3_min_utterance_length=300,
+                    decoding_method="greedy_search",
+                    provider="cpu",
+                )
+                self.stream = self.recognizer.create_stream()
+            logger.debug("Audio transcription (live) initialized")
+        except Exception as e:
+            logger.error(
+                f"Failed to initialize live streaming audio transcription: {e}"
+            )
+            self.recognizer = None
+
+    def __process_audio_stream(
+        self, audio_data: np.ndarray
+    ) -> Optional[tuple[str, bool]]:
+        if (not self.recognizer or not self.stream) and not self.online:
+            logger.debug(
+                "Audio transcription (streaming) recognizer or stream not initialized"
+            )
+            return None
+
+        try:
+            if audio_data.dtype != np.float32:
+                audio_data = audio_data.astype(np.float32)
+
+            if audio_data.max() > 1.0 or audio_data.min() < -1.0:
+                audio_data = audio_data / 32768.0  # Normalize from int16
+
+            rms = float(np.sqrt(np.mean(np.absolute(np.square(audio_data)))))
+            logger.debug(f"Audio chunk size: {audio_data.size}, RMS: {rms:.4f}")
+
+            if self.config.audio_transcription.model_size == "large":
+                # large model
+                self.online.insert_audio_chunk(audio_data)
+                output = self.online.process_iter()
+                text = output[2].strip()
+                is_endpoint = text.endswith((".", "!", "?"))
+
+                if text:
+                    self.transcription_segments.append(text)
+                concatenated_text = " ".join(self.transcription_segments)
+                logger.debug(f"Concatenated transcription: '{concatenated_text}'")
+                text = concatenated_text
+
+            else:
+                # small model
+                self.stream.accept_waveform(16000, audio_data)
+
+                while self.recognizer.is_ready(self.stream):
+                    self.recognizer.decode_stream(self.stream)
+
+                text = self.recognizer.get_result(self.stream).strip()
+                is_endpoint = self.recognizer.is_endpoint(self.stream)
+
+            logger.debug(f"Transcription result: '{text}'")
+
+            if not text:
+                logger.debug("No transcription, returning")
+                return None
+
+            logger.debug(f"Endpoint detected: {is_endpoint}")
+
+            if is_endpoint and self.config.audio_transcription.model_size == "small":
+                # reset sherpa if we've reached an endpoint
+                self.recognizer.reset(self.stream)
+
+            return text, is_endpoint
+        except Exception as e:
+            logger.error(f"Error processing audio stream: {e}")
+            return None
+
+    def process_frame(self, obj_data: dict[str, any], frame: np.ndarray) -> None:
+        pass
+
+    def process_audio(self, obj_data: dict[str, any], audio: np.ndarray) -> bool | None:
+        if audio is None or audio.size == 0:
+            logger.debug("No audio data provided for transcription")
+            return None
+
+        # enqueue audio data for processing in the thread
+        self.audio_queue.put((obj_data, audio))
+        return None
+
+    def run(self) -> None:
+        """Run method for the transcription thread to process queued audio data."""
+        logger.debug(
+            f"Starting audio transcription thread for {self.camera_config.name}"
+        )
+        while not self.stop_event.is_set():
+            try:
+                # Get audio data from queue with a timeout to check stop_event
+                obj_data, audio = self.audio_queue.get(timeout=0.1)
+                result = self.__process_audio_stream(audio)
+
+                if not result:
+                    continue
+
+                text, is_endpoint = result
+                logger.debug(f"Transcribed audio: '{text}', Endpoint: {is_endpoint}")
+
+                self.requestor.send_data(
+                    f"{self.camera_config.name}/audio/transcription", text
+                )
+
+                self.audio_queue.task_done()
+
+                if is_endpoint:
+                    self.reset(obj_data["camera"])
+
+            except queue.Empty:
+                continue
+            except Exception as e:
+                logger.error(f"Error processing audio in thread: {e}")
+                self.audio_queue.task_done()
+
+        logger.debug(
+            f"Stopping audio transcription thread for {self.camera_config.name}"
+        )
+
+    def reset(self, camera: str) -> None:
+        if self.config.audio_transcription.model_size == "large":
+            # get final output from whisper
+            output = self.online.finish()
+            self.transcription_segments = []
+
+            self.requestor.send_data(
+                f"{self.camera_config.name}/audio/transcription",
+                (output[2].strip() + " "),
+            )
+
+            # reset whisper
+            self.online.init()
+        else:
+            # reset sherpa
+            self.recognizer.reset(self.stream)
+
+        # Clear the audio queue
+        while not self.audio_queue.empty():
+            try:
+                self.audio_queue.get_nowait()
+                self.audio_queue.task_done()
+            except queue.Empty:
+                break
+
+        logger.debug("Stream reset")
+
+    def stop(self) -> None:
+        """Stop the transcription thread and clean up."""
+        self.stop_event.set()
+        # Clear the queue to prevent processing stale data
+        while not self.audio_queue.empty():
+            try:
+                self.audio_queue.get_nowait()
+                self.audio_queue.task_done()
+            except queue.Empty:
+                break
+        logger.debug(
+            f"Transcription thread stop signaled for {self.camera_config.name}"
+        )
+
+    def handle_request(
+        self, topic: str, request_data: dict[str, any]
+    ) -> dict[str, any] | None:
+        if topic == "clear_audio_recognizer":
+            self.recognizer = None
+            self.stream = None
+            self.__build_recognizer()
+            return {"message": "Audio recognizer cleared and rebuilt", "success": True}
+        return None
+
+    def expire_object(self, object_id: str) -> None:
+        pass
--- a/frigate/data_processing/real_time/whisper_online.py
+++ b/frigate/data_processing/real_time/whisper_online.py
--- a/frigate/embeddings/init.py
+++ b/frigate/embeddings/init.py
@ -291,3 +291,8 @@ class EmbeddingsContext:

    def reindex_embeddings(self) -> dict[str, Any]:
        return self.requestor.send_data(EmbeddingsRequestEnum.reindex.value, {})
+
+    def transcribe_audio(self, event: dict[str, any]) -> dict[str, any]:
+        return self.requestor.send_data(
+            EmbeddingsRequestEnum.transcribe_audio.value, {"event": event}
+        )
--- a/frigate/embeddings/maintainer.py
+++ b/frigate/embeddings/maintainer.py
@ -37,6 +37,9 @@ from frigate.data_processing.common.license_plate.model import (
    LicensePlateModelRunner,
 )
 from frigate.data_processing.post.api import PostProcessorApi
+from frigate.data_processing.post.audio_transcription import (
+    AudioTranscriptionPostProcessor,
+)
 from frigate.data_processing.post.license_plate import (
    LicensePlatePostProcessor,
 )
@ -176,6 +179,14 @@ class EmbeddingMaintainer(threading.Thread):
                )
            )

+        if any(
+            c.enabled_in_config and c.audio_transcription.enabled
+            for c in self.config.cameras.values()
+        ):
+            self.post_processors.append(
+                AudioTranscriptionPostProcessor(self.config, self.requestor, metrics)
+            )
+
        self.stop_event = stop_event
        self.tracked_events: dict[str, list[Any]] = {}
        self.early_request_sent: dict[str, bool] = {}
@ -372,6 +383,8 @@ class EmbeddingMaintainer(threading.Thread):
                            },
                            PostProcessDataEnum.recording,
                        )
+                elif isinstance(processor, AudioTranscriptionPostProcessor):
+                    continue
                else:
                    processor.process_data(event_id, PostProcessDataEnum.event_id)

--- a/frigate/events/audio.py
+++ b/frigate/events/audio.py
@ -18,7 +18,7 @@ from frigate.comms.event_metadata_updater import (
    EventMetadataTypeEnum,
 )
 from frigate.comms.inter_process import InterProcessRequestor
-from frigate.config import CameraConfig, CameraInput, FfmpegConfig
+from frigate.config import CameraConfig, CameraInput, FfmpegConfig, FrigateConfig
 from frigate.config.camera.updater import (
    CameraConfigUpdateEnum,
    CameraConfigUpdateSubscriber,
@ -30,6 +30,9 @@ from frigate.const import (
    AUDIO_MIN_CONFIDENCE,
    AUDIO_SAMPLE_RATE,
 )
+from frigate.data_processing.real_time.audio_transcription import (
+    AudioTranscriptionRealTimeProcessor,
+)
 from frigate.ffmpeg_presets import parse_preset_input
 from frigate.log import LogPipe
 from frigate.object_detection.base import load_labels
@ -75,6 +78,7 @@ class AudioProcessor(util.Process):

    def __init__(
        self,
+        config: FrigateConfig,
        cameras: list[CameraConfig],
        camera_metrics: dict[str, CameraMetrics],
    ):
@ -82,6 +86,7 @@ class AudioProcessor(util.Process):

        self.camera_metrics = camera_metrics
        self.cameras = cameras
+        self.config = config

    def run(self) -> None:
        audio_threads: list[AudioEventMaintainer] = []
@ -94,6 +99,7 @@ class AudioProcessor(util.Process):
        for camera in self.cameras:
            audio_thread = AudioEventMaintainer(
                camera,
+                self.config,
                self.camera_metrics,
                self.stop_event,
            )
@ -122,46 +128,71 @@ class AudioEventMaintainer(threading.Thread):
    def __init__(
        self,
        camera: CameraConfig,
+        config: FrigateConfig,
        camera_metrics: dict[str, CameraMetrics],
        stop_event: threading.Event,
    ) -> None:
        super().__init__(name=f"{camera.name}_audio_event_processor")

-        self.config = camera
+        self.config = config
+        self.camera_config = camera
        self.camera_metrics = camera_metrics
        self.detections: dict[dict[str, Any]] = {}
        self.stop_event = stop_event
-        self.detector = AudioTfl(stop_event, self.config.audio.num_threads)
+        self.detector = AudioTfl(stop_event, self.camera_config.audio.num_threads)
        self.shape = (int(round(AUDIO_DURATION * AUDIO_SAMPLE_RATE)),)
        self.chunk_size = int(round(AUDIO_DURATION * AUDIO_SAMPLE_RATE * 2))
-        self.logger = logging.getLogger(f"audio.{self.config.name}")
-        self.ffmpeg_cmd = get_ffmpeg_command(self.config.ffmpeg)
-        self.logpipe = LogPipe(f"ffmpeg.{self.config.name}.audio")
+        self.logger = logging.getLogger(f"audio.{self.camera_config.name}")
+        self.ffmpeg_cmd = get_ffmpeg_command(self.camera_config.ffmpeg)
+        self.logpipe = LogPipe(f"ffmpeg.{self.camera_config.name}.audio")
        self.audio_listener = None
+        self.transcription_processor = None
+        self.transcription_thread = None

        # create communication for audio detections
        self.requestor = InterProcessRequestor()
        self.config_subscriber = CameraConfigUpdateSubscriber(
-            {self.config.name: self.config},
-            [CameraConfigUpdateEnum.audio, CameraConfigUpdateEnum.enabled],
+            {self.camera_config.name: self.camera_config},
+            [
+                CameraConfigUpdateEnum.audio,
+                CameraConfigUpdateEnum.enabled,
+                CameraConfigUpdateEnum.audio_transcription,
+            ],
        )
        self.detection_publisher = DetectionPublisher(DetectionTypeEnum.audio)
        self.event_metadata_publisher = EventMetadataPublisher()

+        if self.camera_config.audio_transcription.enabled_in_config:
+            # init the transcription processor for this camera
+            self.transcription_processor = AudioTranscriptionRealTimeProcessor(
+                config=self.config,
+                camera_config=self.camera_config,
+                requestor=self.requestor,
+                metrics=self.camera_metrics[self.camera_config.name],
+                stop_event=self.stop_event,
+            )
+
+            self.transcription_thread = threading.Thread(
+                target=self.transcription_processor.run,
+                name=f"{self.camera_config.name}_transcription_processor",
+                daemon=True,
+            )
+            self.transcription_thread.start()
+
        self.was_enabled = camera.enabled

    def detect_audio(self, audio) -> None:
-        if not self.config.audio.enabled or self.stop_event.is_set():
+        if not self.camera_config.audio.enabled or self.stop_event.is_set():
            return

        audio_as_float = audio.astype(np.float32)
        rms, dBFS = self.calculate_audio_levels(audio_as_float)

-        self.camera_metrics[self.config.name].audio_rms.value = rms
-        self.camera_metrics[self.config.name].audio_dBFS.value = dBFS
+        self.camera_metrics[self.camera_config.name].audio_rms.value = rms
+        self.camera_metrics[self.camera_config.name].audio_dBFS.value = dBFS

        # only run audio detection when volume is above min_volume
-        if rms >= self.config.audio.min_volume:
+        if rms >= self.camera_config.audio.min_volume:
            # create waveform relative to max range and look for detections
            waveform = (audio / AUDIO_MAX_BIT_RANGE).astype(np.float32)
            model_detections = self.detector.detect(waveform)
@ -169,28 +200,42 @@ class AudioEventMaintainer(threading.Thread):

            for label, score, _ in model_detections:
                self.logger.debug(
-                    f"{self.config.name} heard {label} with a score of {score}"
+                    f"{self.camera_config.name} heard {label} with a score of {score}"
                )

-                if label not in self.config.audio.listen:
+                if label not in self.camera_config.audio.listen:
                    continue

-                if score > dict((self.config.audio.filters or {}).get(label, {})).get(
-                    "threshold", 0.8
-                ):
+                if score > dict(
+                    (self.camera_config.audio.filters or {}).get(label, {})
+                ).get("threshold", 0.8):
                    self.handle_detection(label, score)
                    audio_detections.append(label)

            # send audio detection data
            self.detection_publisher.publish(
                (
-                    self.config.name,
+                    self.camera_config.name,
                    datetime.datetime.now().timestamp(),
                    dBFS,
                    audio_detections,
                )
            )

+        # run audio transcription
+        if self.transcription_processor is not None and (
+            self.camera_config.audio_transcription.live_enabled
+        ):
+            self.transcribing = True
+            # process audio until we've reached the endpoint
+            self.transcription_processor.process_audio(
+                {
+                    "id": f"{self.camera_config.name}_audio",
+                    "camera": self.camera_config.name,
+                },
+                audio,
+            )
+
        self.expire_detections()

    def calculate_audio_levels(self, audio_as_float: np.float32) -> Tuple[float, float]:
@ -204,8 +249,8 @@ class AudioEventMaintainer(threading.Thread):
        else:
            dBFS = 0

-        self.requestor.send_data(f"{self.config.name}/audio/dBFS", float(dBFS))
-        self.requestor.send_data(f"{self.config.name}/audio/rms", float(rms))
+        self.requestor.send_data(f"{self.camera_config.name}/audio/dBFS", float(dBFS))
+        self.requestor.send_data(f"{self.camera_config.name}/audio/rms", float(rms))

        return float(rms), float(dBFS)

@ -220,13 +265,13 @@ class AudioEventMaintainer(threading.Thread):
                random.choices(string.ascii_lowercase + string.digits, k=6)
            )
            event_id = f"{now}-{rand_id}"
-            self.requestor.send_data(f"{self.config.name}/audio/{label}", "ON")
+            self.requestor.send_data(f"{self.camera_config.name}/audio/{label}", "ON")

            self.event_metadata_publisher.publish(
                EventMetadataTypeEnum.manual_event_create,
                (
                    now,
-                    self.config.name,
+                    self.camera_config.name,
                    label,
                    event_id,
                    True,
@ -252,10 +297,10 @@ class AudioEventMaintainer(threading.Thread):

            if (
                now - detection.get("last_detection", now)
-                > self.config.audio.max_not_heard
+                > self.camera_config.audio.max_not_heard
            ):
                self.requestor.send_data(
-                    f"{self.config.name}/audio/{detection['label']}", "OFF"
+                    f"{self.camera_config.name}/audio/{detection['label']}", "OFF"
                )

                self.event_metadata_publisher.publish(
@ -264,12 +309,21 @@ class AudioEventMaintainer(threading.Thread):
                )
                self.detections[detection["label"]] = None

+                # clear real-time transcription
+                if self.transcription_processor is not None:
+                    self.transcription_processor.reset(self.camera_config.name)
+                    self.requestor.send_data(
+                        f"{self.camera_config.name}/audio/transcription", ""
+                    )
+
    def expire_all_detections(self) -> None:
        """Immediately end all current detections"""
        now = datetime.datetime.now().timestamp()
        for label, detection in list(self.detections.items()):
            if detection:
-                self.requestor.send_data(f"{self.config.name}/audio/{label}", "OFF")
+                self.requestor.send_data(
+                    f"{self.camera_config.name}/audio/{label}", "OFF"
+                )
                self.event_metadata_publisher.publish(
                    EventMetadataTypeEnum.manual_event_end,
                    (detection["id"], now),
@ -290,7 +344,7 @@ class AudioEventMaintainer(threading.Thread):
            if self.stop_event.is_set():
                return

-            time.sleep(self.config.ffmpeg.retry_interval)
+            time.sleep(self.camera_config.ffmpeg.retry_interval)
            self.logpipe.dump()
            self.start_or_restart_ffmpeg()

@ -312,20 +366,20 @@ class AudioEventMaintainer(threading.Thread):
            log_and_restart()

    def run(self) -> None:
-        if self.config.enabled:
+        if self.camera_config.enabled:
            self.start_or_restart_ffmpeg()

        while not self.stop_event.is_set():
-            enabled = self.config.enabled
+            enabled = self.camera_config.enabled
            if enabled != self.was_enabled:
                if enabled:
                    self.logger.debug(
-                        f"Enabling audio detections for {self.config.name}"
+                        f"Enabling audio detections for {self.camera_config.name}"
                    )
                    self.start_or_restart_ffmpeg()
                else:
                    self.logger.debug(
-                        f"Disabling audio detections for {self.config.name}, ending events"
+                        f"Disabling audio detections for {self.camera_config.name}, ending events"
                    )
                    self.expire_all_detections()
                    stop_ffmpeg(self.audio_listener, self.logger)
@ -344,6 +398,12 @@ class AudioEventMaintainer(threading.Thread):

        if self.audio_listener:
            stop_ffmpeg(self.audio_listener, self.logger)
+        if self.transcription_thread:
+            self.transcription_thread.join(timeout=2)
+            if self.transcription_thread.is_alive():
+                self.logger.warning(
+                    f"Audio transcription thread {self.transcription_thread.name} is still alive"
+                )
        self.logpipe.close()
        self.requestor.stop()
        self.config_subscriber.stop()
--- a/frigate/util/audio.py
+++ b/frigate/util/audio.py
@ -0,0 +1,116 @@
+"""Utilities for creating and manipulating audio."""
+
+import logging
+import os
+import subprocess as sp
+from typing import Optional
+
+from pathvalidate import sanitize_filename
+
+from frigate.const import CACHE_DIR
+from frigate.models import Recordings
+
+logger = logging.getLogger(__name__)
+
+
+def get_audio_from_recording(
+    ffmpeg,
+    camera_name: str,
+    start_ts: float,
+    end_ts: float,
+    sample_rate: int = 16000,
+) -> Optional[bytes]:
+    """Extract audio from recording files between start_ts and end_ts in WAV format suitable for sherpa-onnx.
+
+    Args:
+        ffmpeg: FFmpeg configuration object
+        camera_name: Name of the camera
+        start_ts: Start timestamp in seconds
+        end_ts: End timestamp in seconds
+        sample_rate: Sample rate for output audio (default 16kHz for sherpa-onnx)
+
+    Returns:
+        Bytes of WAV audio data or None if extraction failed
+    """
+    # Fetch all relevant recording segments
+    recordings = (
+        Recordings.select(
+            Recordings.path,
+            Recordings.start_time,
+            Recordings.end_time,
+        )
+        .where(
+            (Recordings.start_time.between(start_ts, end_ts))
+            | (Recordings.end_time.between(start_ts, end_ts))
+            | ((start_ts > Recordings.start_time) & (end_ts < Recordings.end_time))
+        )
+        .where(Recordings.camera == camera_name)
+        .order_by(Recordings.start_time.asc())
+    )
+
+    if not recordings:
+        logger.debug(
+            f"No recordings found for {camera_name} between {start_ts} and {end_ts}"
+        )
+        return None
+
+    # Generate concat playlist file
+    file_name = sanitize_filename(
+        f"audio_playlist_{camera_name}_{start_ts}-{end_ts}.txt"
+    )
+    file_path = os.path.join(CACHE_DIR, file_name)
+    try:
+        with open(file_path, "w") as file:
+            for clip in recordings:
+                file.write(f"file '{clip.path}'\n")
+                if clip.start_time < start_ts:
+                    file.write(f"inpoint {int(start_ts - clip.start_time)}\n")
+                if clip.end_time > end_ts:
+                    file.write(f"outpoint {int(end_ts - clip.start_time)}\n")
+
+        ffmpeg_cmd = [
+            ffmpeg.ffmpeg_path,
+            "-hide_banner",
+            "-loglevel",
+            "warning",
+            "-protocol_whitelist",
+            "pipe,file",
+            "-f",
+            "concat",
+            "-safe",
+            "0",
+            "-i",
+            file_path,
+            "-vn",  # No video
+            "-acodec",
+            "pcm_s16le",  # 16-bit PCM encoding
+            "-ar",
+            str(sample_rate),
+            "-ac",
+            "1",  # Mono audio
+            "-f",
+            "wav",
+            "-",
+        ]
+
+        process = sp.run(
+            ffmpeg_cmd,
+            capture_output=True,
+        )
+
+        if process.returncode == 0:
+            logger.debug(
+                f"Successfully extracted audio for {camera_name} from {start_ts} to {end_ts}"
+            )
+            return process.stdout
+        else:
+            logger.error(f"Failed to extract audio: {process.stderr.decode()}")
+            return None
+    except Exception as e:
+        logger.error(f"Error extracting audio from recordings: {e}")
+        return None
+    finally:
+        try:
+            os.unlink(file_path)
+        except OSError:
+            pass
--- a/web/public/locales/en/views/explore.json
+++ b/web/public/locales/en/views/explore.json
@ -103,12 +103,14 @@
        "success": {
          "regenerate": "A new description has been requested from {{provider}}. Depending on the speed of your provider, the new description may take some time to regenerate.",
          "updatedSublabel": "Successfully updated sub label.",
-          "updatedLPR": "Successfully updated license plate."
+          "updatedLPR": "Successfully updated license plate.",
+          "audioTranscription": "Successfully requested audio transcription."
        },
        "error": {
          "regenerate": "Failed to call {{provider}} for a new description: {{errorMessage}}",
          "updatedSublabelFailed": "Failed to update sub label: {{errorMessage}}",
-          "updatedLPRFailed": "Failed to update license plate: {{errorMessage}}"
+          "updatedLPRFailed": "Failed to update license plate: {{errorMessage}}",
+          "audioTranscription": "Failed to request audio transcription: {{errorMessage}}"
        }
      }
    },
@ -173,6 +175,10 @@
      "label": "Find similar",
      "aria": "Find similar tracked objects"
    },
+    "audioTranscription": {
+      "label": "Transcribe",
+      "aria": "Request audio transcription"
+    },
    "submitToPlus": {
      "label": "Submit to Frigate+",
      "aria": "Submit to Frigate Plus"
--- a/web/public/locales/en/views/live.json
+++ b/web/public/locales/en/views/live.json
@ -69,6 +69,10 @@
    "enable": "Enable Audio Detect",
    "disable": "Disable Audio Detect"
  },
+  "transcription": {
+    "enable": "Enable Live Audio Transcription",
+    "disable": "Disable Live Audio Transcription"
+  },
  "autotracking": {
    "enable": "Enable Autotracking",
    "disable": "Disable Autotracking"
@ -135,6 +139,7 @@
    "recording": "Recording",
    "snapshots": "Snapshots",
    "audioDetection": "Audio Detection",
+    "transcription": "Audio Transcription",
    "autotracking": "Autotracking"
  },
  "history": {
--- a/web/src/api/ws.tsx
+++ b/web/src/api/ws.tsx
@ -8,6 +8,7 @@ import {
  FrigateReview,
  ModelState,
  ToggleableSetting,
+  TrackedObjectUpdateReturnType,
 } from "@/types/ws";
 import { FrigateStats } from "@/types/stats";
 import { createContainer } from "react-tracked";
@ -60,6 +61,7 @@ function useValue(): useValueReturn {
        enabled,
        snapshots,
        audio,
+        audio_transcription,
        notifications,
        notifications_suspended,
        autotracking,
@ -71,6 +73,9 @@ function useValue(): useValueReturn {
      cameraStates[`${name}/detect/state`] = detect ? "ON" : "OFF";
      cameraStates[`${name}/snapshots/state`] = snapshots ? "ON" : "OFF";
      cameraStates[`${name}/audio/state`] = audio ? "ON" : "OFF";
+      cameraStates[`${name}/audio_transcription/state`] = audio_transcription
+        ? "ON"
+        : "OFF";
      cameraStates[`${name}/notifications/state`] = notifications
        ? "ON"
        : "OFF";
@ -220,6 +225,20 @@ export function useAudioState(camera: string): {
  return { payload: payload as ToggleableSetting, send };
 }

+export function useAudioTranscriptionState(camera: string): {
+  payload: ToggleableSetting;
+  send: (payload: ToggleableSetting, retain?: boolean) => void;
+} {
+  const {
+    value: { payload },
+    send,
+  } = useWs(
+    `${camera}/audio_transcription/state`,
+    `${camera}/audio_transcription/set`,
+  );
+  return { payload: payload as ToggleableSetting, send };
+}
+
 export function useAutotrackingState(camera: string): {
  payload: ToggleableSetting;
  send: (payload: ToggleableSetting, retain?: boolean) => void;
@ -421,6 +440,15 @@ export function useAudioActivity(camera: string): { payload: number } {
  return { payload: payload as number };
 }

+export function useAudioLiveTranscription(camera: string): {
+  payload: string;
+} {
+  const {
+    value: { payload },
+  } = useWs(`${camera}/audio/transcription`, "");
+  return { payload: payload as string };
+}
+
 export function useMotionThreshold(camera: string): {
  payload: string;
  send: (payload: number, retain?: boolean) => void;
@ -463,11 +491,16 @@ export function useImproveContrast(camera: string): {
  return { payload: payload as ToggleableSetting, send };
 }

-export function useTrackedObjectUpdate(): { payload: string } {
+export function useTrackedObjectUpdate(): {
+  payload: TrackedObjectUpdateReturnType;
+} {
  const {
    value: { payload },
  } = useWs("tracked_object_update", "");
-  return useDeepMemo(JSON.parse(payload as string));
+  const parsed = payload
+    ? JSON.parse(payload as string)
+    : { type: "", id: "", camera: "" };
+  return { payload: useDeepMemo(parsed) };
 }

 export function useNotifications(camera: string): {
--- a/web/src/components/overlay/detail/SearchDetailDialog.tsx
+++ b/web/src/components/overlay/detail/SearchDetailDialog.tsx
@ -77,6 +77,7 @@ import { Trans, useTranslation } from "react-i18next";
 import { TbFaceId } from "react-icons/tb";
 import { useIsAdmin } from "@/hooks/use-is-admin";
 import FaceSelectionDialog from "../FaceSelectionDialog";
+import { CgTranscript } from "react-icons/cg";

 const SEARCH_TABS = [
  "details",
@ -709,6 +710,34 @@ function ObjectDetailsTab({
    [search, t],
  );

+  // speech transcription
+
+  const onTranscribe = useCallback(() => {
+    axios
+      .put(`/audio/transcribe`, { event_id: search.id })
+      .then((resp) => {
+        if (resp.status == 202) {
+          toast.success(t("details.item.toast.success.audioTranscription"), {
+            position: "top-center",
+          });
+        }
+      })
+      .catch((error) => {
+        const errorMessage =
+          error.response?.data?.message ||
+          error.response?.data?.detail ||
+          "Unknown error";
+        toast.error(
+          t("details.item.toast.error.audioTranscription", {
+            errorMessage,
+          }),
+          {
+            position: "top-center",
+          },
+        );
+      });
+  }, [search, t]);
+
  return (
    <div className="flex flex-col gap-5">
      <div className="flex w-full flex-row">
@ -894,6 +923,16 @@ function ObjectDetailsTab({
                </Button>
              </FaceSelectionDialog>
            )}
+            {config?.cameras[search?.camera].audio_transcription.enabled &&
+              search?.label == "speech" &&
+              search?.end_time && (
+                <Button className="w-full" onClick={onTranscribe}>
+                  <div className="flex gap-1">
+                    <CgTranscript />
+                    {t("itemMenu.audioTranscription.label")}
+                  </div>
+                </Button>
+              )}
          </div>
        </div>
      </div>
--- a/web/src/pages/Explore.tsx
+++ b/web/src/pages/Explore.tsx
@ -246,15 +246,13 @@ export default function Explore() {

  // mutation and revalidation

-  const trackedObjectUpdate = useTrackedObjectUpdate();
+  const { payload: wsUpdate } = useTrackedObjectUpdate();

  useEffect(() => {
-    if (trackedObjectUpdate) {
+    if (wsUpdate && wsUpdate.type == "description") {
      mutate();
    }
-    // mutate / revalidate when event description updates come in
-    // eslint-disable-next-line react-hooks/exhaustive-deps
-  }, [trackedObjectUpdate]);
+  }, [wsUpdate, mutate]);

  // embeddings reindex progress

--- a/web/src/types/frigateConfig.ts
+++ b/web/src/types/frigateConfig.ts
@ -41,6 +41,11 @@ export interface CameraConfig {
    min_volume: number;
    num_threads: number;
  };
+  audio_transcription: {
+    enabled: boolean;
+    enabled_in_config: boolean;
+    live_enabled: boolean;
+  };
  best_image_timeout: number;
  birdseye: {
    enabled: boolean;
@ -296,6 +301,10 @@ export interface FrigateConfig {
    num_threads: number;
  };

+  audio_transcription: {
+    enabled: boolean;
+  };
+
  birdseye: BirdseyeConfig;

  cameras: {
--- a/web/src/types/ws.ts
+++ b/web/src/types/ws.ts
@ -58,6 +58,7 @@ export interface FrigateCameraState {
    snapshots: boolean;
    record: boolean;
    audio: boolean;
+    audio_transcription: boolean;
    notifications: boolean;
    notifications_suspended: number;
    autotracking: boolean;
@ -84,3 +85,21 @@ export type EmbeddingsReindexProgressType = {
 };

 export type ToggleableSetting = "ON" | "OFF";
+
+export type TrackedObjectUpdateType =
+  | "description"
+  | "lpr"
+  | "transcription"
+  | "face";
+
+export type TrackedObjectUpdateReturnType = {
+  type: TrackedObjectUpdateType;
+  id: string;
+  camera: string;
+  description?: string;
+  name?: string;
+  plate?: string;
+  score?: number;
+  timestamp?: number;
+  text?: string;
+} | null;
--- a/web/src/views/explore/ExploreView.tsx
+++ b/web/src/views/explore/ExploreView.tsx
@ -74,13 +74,13 @@ export default function ExploreView({
    }, {});
  }, [events]);

-  const trackedObjectUpdate = useTrackedObjectUpdate();
+  const { payload: wsUpdate } = useTrackedObjectUpdate();

  useEffect(() => {
-    mutate();
-    // mutate / revalidate when event description updates come in
-    // eslint-disable-next-line react-hooks/exhaustive-deps
-  }, [trackedObjectUpdate]);
+    if (wsUpdate && wsUpdate.type == "description") {
+      mutate();
+    }
+  }, [wsUpdate, mutate]);

  // update search detail when results change

--- a/web/src/views/live/LiveCameraView.tsx
+++ b/web/src/views/live/LiveCameraView.tsx
@ -1,5 +1,7 @@
 import {
+  useAudioLiveTranscription,
  useAudioState,
+  useAudioTranscriptionState,
  useAutotrackingState,
  useDetectState,
  useEnabledState,
@ -90,6 +92,8 @@ import {
  LuX,
 } from "react-icons/lu";
 import {
+  MdClosedCaption,
+  MdClosedCaptionDisabled,
  MdNoPhotography,
  MdOutlineRestartAlt,
  MdPersonOff,
@ -196,6 +200,29 @@ export default function LiveCameraView({
  const { payload: enabledState } = useEnabledState(camera.name);
  const cameraEnabled = enabledState === "ON";

+  // for audio transcriptions
+
+  const { payload: audioTranscriptionState, send: sendTranscription } =
+    useAudioTranscriptionState(camera.name);
+  const { payload: transcription } = useAudioLiveTranscription(camera.name);
+  const transcriptionRef = useRef<HTMLDivElement>(null);
+
+  useEffect(() => {
+    if (transcription) {
+      if (transcriptionRef.current) {
+        transcriptionRef.current.scrollTop =
+          transcriptionRef.current.scrollHeight;
+      }
+    }
+  }, [transcription]);
+
+  useEffect(() => {
+    return () => {
+      // disable transcriptions when unmounting
+      if (audioTranscriptionState == "ON") sendTranscription("OFF");
+    };
+  }, [audioTranscriptionState, sendTranscription]);
+
  // click overlay for ptzs

  const [clickOverlay, setClickOverlay] = useState(false);
@ -566,6 +593,9 @@ export default function LiveCameraView({
                autotrackingEnabled={
                  camera.onvif.autotracking.enabled_in_config
                }
+                transcriptionEnabled={
+                  camera.audio_transcription.enabled_in_config
+                }
                fullscreen={fullscreen}
                streamName={streamName ?? ""}
                setStreamName={setStreamName}
@ -625,6 +655,16 @@ export default function LiveCameraView({
              />
            </div>
          </TransformComponent>
+          {camera?.audio?.enabled_in_config &&
+            audioTranscriptionState == "ON" &&
+            transcription != null && (
+              <div
+                ref={transcriptionRef}
+                className="text-md scrollbar-container absolute bottom-4 left-1/2 max-h-[15vh] w-[75%] -translate-x-1/2 overflow-y-auto rounded-lg bg-black/70 p-2 text-white md:w-[50%]"
+              >
+                {transcription}
+              </div>
+            )}
        </div>
      </div>
      {camera.onvif.host != "" && (
@ -983,6 +1023,7 @@ type FrigateCameraFeaturesProps = {
  recordingEnabled: boolean;
  audioDetectEnabled: boolean;
  autotrackingEnabled: boolean;
+  transcriptionEnabled: boolean;
  fullscreen: boolean;
  streamName: string;
  setStreamName?: (value: string | undefined) => void;
@ -1002,6 +1043,7 @@ function FrigateCameraFeatures({
  recordingEnabled,
  audioDetectEnabled,
  autotrackingEnabled,
+  transcriptionEnabled,
  fullscreen,
  streamName,
  setStreamName,
@ -1033,6 +1075,8 @@ function FrigateCameraFeatures({
  const { payload: audioState, send: sendAudio } = useAudioState(camera.name);
  const { payload: autotrackingState, send: sendAutotracking } =
    useAutotrackingState(camera.name);
+  const { payload: transcriptionState, send: sendTranscription } =
+    useAudioTranscriptionState(camera.name);

  // roles

@ -1196,6 +1240,27 @@ function FrigateCameraFeatures({
                disabled={!cameraEnabled}
              />
            )}
+            {audioDetectEnabled && transcriptionEnabled && (
+              <CameraFeatureToggle
+                className="p-2 md:p-0"
+                variant={fullscreen ? "overlay" : "primary"}
+                Icon={
+                  transcriptionState == "ON"
+                    ? MdClosedCaption
+                    : MdClosedCaptionDisabled
+                }
+                isActive={transcriptionState == "ON"}
+                title={
+                  transcriptionState == "ON"
+                    ? t("transcription.disable")
+                    : t("transcription.enable")
+                }
+                onClick={() =>
+                  sendTranscription(transcriptionState == "ON" ? "OFF" : "ON")
+                }
+                disabled={!cameraEnabled || audioState == "OFF"}
+              />
+            )}
            {autotrackingEnabled && (
              <CameraFeatureToggle
                className="p-2 md:p-0"
@ -1558,6 +1623,16 @@ function FrigateCameraFeatures({
                  }
                />
              )}
+              {audioDetectEnabled && transcriptionEnabled && (
+                <FilterSwitch
+                  label={t("cameraSettings.transcription")}
+                  disabled={audioState == "OFF"}
+                  isChecked={transcriptionState == "ON"}
+                  onCheckedChange={() =>
+                    sendTranscription(transcriptionState == "ON" ? "OFF" : "ON")
+                  }
+                />
+              )}
              {autotrackingEnabled && (
                <FilterSwitch
                  label={t("cameraSettings.autotracking")}