Audio transcription support (#18398)

* install new packages for transcription support

* add config options

* audio maintainer modifications to support transcription

* pass main config to audio process

* embeddings support

* api and transcription post processor

* embeddings maintainer support for post processor

* live audio transcription with sherpa and faster-whisper

* update dispatcher with live transcription topic

* frontend websocket

* frontend live transcription

* frontend changes for speech events

* i18n changes

* docs

* mqtt docs

* fix linter

* use float16 and small model on gpu for real-time

* fix return value and use requestor to embed description instead of passing embeddings

* run real-time transcription in its own thread

* tweaks

* publish live transcriptions on their own topic instead of tracked_object_update

* config validator and docs

* clarify docs
This commit is contained in:
Josh Hawkins 2025-05-27 10:26:00 -05:00 committed by GitHub
parent 512b7e16e1
commit 2bd6fa53fe
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
29 changed files with 2322 additions and 51 deletions

View File

@ -71,3 +71,8 @@ prometheus-client == 0.21.*
# TFLite
tflite_runtime @ https://github.com/frigate-nvr/TFlite-builds/releases/download/v2.17.1/tflite_runtime-2.17.1-cp311-cp311-linux_x86_64.whl; platform_machine == 'x86_64'
tflite_runtime @ https://github.com/feranick/TFlite-builds/releases/download/v2.17.1/tflite_runtime-2.17.1-cp311-cp311-linux_aarch64.whl; platform_machine == 'aarch64'
# audio transcription
sherpa-onnx==1.12.*
faster-whisper==1.1.*
librosa==0.11.*
soundfile==0.13.*

View File

@ -72,3 +72,77 @@ audio:
- speech
- yell
```
### Audio Transcription
Frigate supports fully local audio transcription using either `sherpa-onnx` or OpenAIs open-source Whisper models via `faster-whisper`. To enable transcription, it is recommended to only configure the features at the global level, and enable it at the individual camera level.
```yaml
audio_transcription:
enabled: False
device: ...
model_size: ...
```
Enable audio transcription for select cameras at the camera level:
```yaml
cameras:
back_yard:
...
audio_transcription:
enabled: True
```
:::note
Audio detection must be enabled and configured as described above in order to use audio transcription features.
:::
The optional config parameters that can be set at the global level include:
- **`enabled`**: Enable or disable the audio transcription feature.
- Default: `False`
- It is recommended to only configure the features at the global level, and enable it at the individual camera level.
- **`device`**: Device to use to run transcription and translation models.
- Default: `CPU`
- This can be `CPU` or `GPU`. The `sherpa-onnx` models are lightweight and run on the CPU only. The `whisper` models can run on GPU but are only supported on CUDA hardware.
- **`model_size`**: The size of the model used for live transcription.
- Default: `small`
- This can be `small` or `large`. The `small` setting uses `sherpa-onnx` models that are fast, lightweight, and always run on the CPU but are not as accurate as the `whisper` model.
- The
- This config option applies to **live transcription only**. Recorded `speech` events will always use a different `whisper` model (and can be accelerated for CUDA hardware if available with `device: GPU`).
- **`language`**: Defines the language used by `whisper` to translate `speech` audio events (and live audio only if using the `large` model).
- Default: `en`
- You must use a valid [language code](https://github.com/openai/whisper/blob/main/whisper/tokenizer.py#L10).
- Transcriptions for `speech` events are translated.
- Live audio is translated only if you are using the `large` model. The `small` `sherpa-onnx` model is English-only.
The only field that is valid at the camera level is `enabled`.
#### Live transcription
The single camera Live view in the Frigate UI supports live transcription of audio for streams defined with the `audio` role. Use the Enable/Disable Live Audio Transcription button/switch to toggle transcription processing. When speech is heard, the UI will display a black box over the top of the camera stream with text. The MQTT topic `frigate/<camera_name>/audio/transcription` will also be updated in real-time with transcribed text.
Results can be error-prone due to a number of factors, including:
- Poor quality camera microphone
- Distance of the audio source to the camera microphone
- Low audio bitrate setting in the camera
- Background noise
- Using the `small` model - it's fast, but not accurate for poor quality audio
For speech sources close to the camera with minimal background noise, use the `small` model.
If you have CUDA hardware, you can experiment with the `large` `whisper` model on GPU. Performance is not quite as fast as the `sherpa-onnx` `small` model, but live transcription is far more accurate. Using the `large` model with CPU will likely be too slow for real-time transcription.
#### Transcription and translation of `speech` audio events
Any `speech` events in Explore can be transcribed and/or translated through the Transcribe button in the Tracked Object Details pane.
In order to use transcription and translation for past events, you must enable audio detection and define `speech` as an audio type to listen for in your config. To have `speech` events translated into the language of your choice, set the `language` config parameter with the correct [language code](https://github.com/openai/whisper/blob/main/whisper/tokenizer.py#L10).
The transcribed/translated speech will appear in the description box in the Tracked Object Details pane. If Semantic Search is enabled, embeddings are generated for the transcription text and are fully searchable using the description search type.
Recorded `speech` events will always use a `whisper` model, regardless of the `model_size` config setting. Without a GPU, generating transcriptions for longer `speech` events may take a fair amount of time, so be patient.

View File

@ -620,6 +620,19 @@ genai:
object_prompts:
person: "My special person prompt."
# Optional: Configuration for audio transcription
# NOTE: only the enabled option can be overridden at the camera level
audio_transcription:
# Optional: Enable license plate recognition (default: shown below)
enabled: False
# Optional: The device to run the models on (default: shown below)
device: CPU
# Optional: Set the model size used for transcription. (default: shown below)
model_size: small
# Optional: Set the language used for transcription translation. (default: shown below)
# List of language codes: https://github.com/openai/whisper/blob/main/whisper/tokenizer.py#L10
language: en
# Optional: Restream configuration
# Uses https://github.com/AlexxIT/go2rtc (v1.9.9)
# NOTE: The default go2rtc API port (1984) must be used,

View File

@ -125,7 +125,7 @@ Message published for updates to tracked object metadata, for example:
"name": "John",
"score": 0.95,
"camera": "front_door_cam",
"timestamp": 1607123958.748393,
"timestamp": 1607123958.748393
}
```
@ -139,7 +139,7 @@ Message published for updates to tracked object metadata, for example:
"plate": "123ABC",
"score": 0.95,
"camera": "driveway_cam",
"timestamp": 1607123958.748393,
"timestamp": 1607123958.748393
}
```
@ -255,6 +255,12 @@ Publishes the rms value for audio detected on this camera.
**NOTE:** Requires audio detection to be enabled
### `frigate/<camera_name>/audio/transcription`
Publishes transcribed text for audio detected on this camera.
**NOTE:** Requires audio detection and transcription to be enabled
### `frigate/<camera_name>/enabled/set`
Topic to turn Frigate's processing of a camera on and off. Expected values are `ON` and `OFF`.

View File

@ -14,7 +14,10 @@ from peewee import DoesNotExist
from playhouse.shortcuts import model_to_dict
from frigate.api.auth import require_role
from frigate.api.defs.request.classification_body import RenameFaceBody
from frigate.api.defs.request.classification_body import (
AudioTranscriptionBody,
RenameFaceBody,
)
from frigate.api.defs.tags import Tags
from frigate.config.camera import DetectConfig
from frigate.const import FACE_DIR
@ -366,3 +369,58 @@ def reindex_embeddings(request: Request):
},
status_code=500,
)
@router.put("/audio/transcribe")
def transcribe_audio(request: Request, body: AudioTranscriptionBody):
event_id = body.event_id
try:
event = Event.get(Event.id == event_id)
except DoesNotExist:
message = f"Event {event_id} not found"
logger.error(message)
return JSONResponse(
content=({"success": False, "message": message}), status_code=404
)
if not request.app.frigate_config.cameras[event.camera].audio_transcription.enabled:
message = f"Audio transcription is not enabled for {event.camera}."
logger.error(message)
return JSONResponse(
content=(
{
"success": False,
"message": message,
}
),
status_code=400,
)
context: EmbeddingsContext = request.app.embeddings
response = context.transcribe_audio(model_to_dict(event))
if response == "started":
return JSONResponse(
content={
"success": True,
"message": "Audio transcription has started.",
},
status_code=202, # 202 Accepted
)
elif response == "in_progress":
return JSONResponse(
content={
"success": False,
"message": "Audio transcription for a speech event is currently in progress. Try again later.",
},
status_code=409, # 409 Conflict
)
else:
return JSONResponse(
content={
"success": False,
"message": "Failed to transcribe audio.",
},
status_code=500,
)

View File

@ -3,3 +3,7 @@ from pydantic import BaseModel
class RenameFaceBody(BaseModel):
new_name: str
class AudioTranscriptionBody(BaseModel):
event_id: str

View File

@ -494,7 +494,9 @@ class FrigateApp:
]
if audio_cameras:
self.audio_process = AudioProcessor(audio_cameras, self.camera_metrics)
self.audio_process = AudioProcessor(
self.config, audio_cameras, self.camera_metrics
)
self.audio_process.start()
self.processes["audio_detector"] = self.audio_process.pid or 0

View File

@ -58,6 +58,7 @@ class Dispatcher:
self._camera_settings_handlers: dict[str, Callable] = {
"audio": self._on_audio_command,
"audio_transcription": self._on_audio_transcription_command,
"detect": self._on_detect_command,
"enabled": self._on_enabled_command,
"improve_contrast": self._on_motion_improve_contrast_command,
@ -181,6 +182,9 @@ class Dispatcher:
"snapshots": self.config.cameras[camera].snapshots.enabled,
"record": self.config.cameras[camera].record.enabled,
"audio": self.config.cameras[camera].audio.enabled,
"audio_transcription": self.config.cameras[
camera
].audio_transcription.live_enabled,
"notifications": self.config.cameras[camera].notifications.enabled,
"notifications_suspended": int(
self.web_push_client.suspended_cameras.get(camera, 0)
@ -465,6 +469,37 @@ class Dispatcher:
)
self.publish(f"{camera_name}/audio/state", payload, retain=True)
def _on_audio_transcription_command(self, camera_name: str, payload: str) -> None:
"""Callback for live audio transcription topic."""
audio_transcription_settings = self.config.cameras[
camera_name
].audio_transcription
if payload == "ON":
if not self.config.cameras[
camera_name
].audio_transcription.enabled_in_config:
logger.error(
"Audio transcription must be enabled in the config to be turned on via MQTT."
)
return
if not audio_transcription_settings.live_enabled:
logger.info(f"Turning on live audio transcription for {camera_name}")
audio_transcription_settings.live_enabled = True
elif payload == "OFF":
if audio_transcription_settings.live_enabled:
logger.info(f"Turning off live audio transcription for {camera_name}")
audio_transcription_settings.live_enabled = False
self.config_updater.publish_update(
CameraConfigUpdateTopic(
CameraConfigUpdateEnum.audio_transcription, camera_name
),
audio_transcription_settings,
)
self.publish(f"{camera_name}/audio_transcription/state", payload, retain=True)
def _on_recordings_command(self, camera_name: str, payload: str) -> None:
"""Callback for recordings topic."""
record_settings = self.config.cameras[camera_name].record

View File

@ -18,6 +18,7 @@ class EmbeddingsRequestEnum(Enum):
reprocess_face = "reprocess_face"
reprocess_plate = "reprocess_plate"
reindex = "reindex"
transcribe_audio = "transcribe_audio"
class EmbeddingsResponder:

View File

@ -19,6 +19,7 @@ from frigate.util.builtin import (
from ..base import FrigateBaseModel
from ..classification import (
AudioTranscriptionConfig,
CameraFaceRecognitionConfig,
CameraLicensePlateRecognitionConfig,
)
@ -56,6 +57,9 @@ class CameraConfig(FrigateBaseModel):
audio: AudioConfig = Field(
default_factory=AudioConfig, title="Audio events configuration."
)
audio_transcription: AudioTranscriptionConfig = Field(
default_factory=AudioTranscriptionConfig, title="Audio transcription config."
)
birdseye: BirdseyeCameraConfig = Field(
default_factory=BirdseyeCameraConfig, title="Birdseye camera configuration."
)

View File

@ -12,6 +12,7 @@ class CameraConfigUpdateEnum(str, Enum):
"""Supported camera config update types."""
audio = "audio"
audio_transcription = "audio_transcription"
birdseye = "birdseye"
detect = "detect"
enabled = "enabled"
@ -74,6 +75,8 @@ class CameraConfigUpdateSubscriber:
if update_type == CameraConfigUpdateEnum.audio:
config.audio = updated_config
if update_type == CameraConfigUpdateEnum.audio_transcription:
config.audio_transcription = updated_config
elif update_type == CameraConfigUpdateEnum.birdseye:
config.birdseye = updated_config
elif update_type == CameraConfigUpdateEnum.detect:

View File

@ -19,11 +19,32 @@ class SemanticSearchModelEnum(str, Enum):
jinav2 = "jinav2"
class LPRDeviceEnum(str, Enum):
class EnrichmentsDeviceEnum(str, Enum):
GPU = "GPU"
CPU = "CPU"
class AudioTranscriptionConfig(FrigateBaseModel):
enabled: bool = Field(default=False, title="Enable audio transcription.")
language: str = Field(
default="en",
title="Language abbreviation to use for audio event transcription/translation.",
)
device: Optional[EnrichmentsDeviceEnum] = Field(
default=EnrichmentsDeviceEnum.CPU,
title="The device used for license plate recognition.",
)
model_size: str = Field(
default="small", title="The size of the embeddings model used."
)
enabled_in_config: Optional[bool] = Field(
default=None, title="Keep track of original state of camera."
)
live_enabled: Optional[bool] = Field(
default=False, title="Enable live transcriptions."
)
class BirdClassificationConfig(FrigateBaseModel):
enabled: bool = Field(default=False, title="Enable bird classification.")
threshold: float = Field(
@ -144,8 +165,8 @@ class CameraFaceRecognitionConfig(FrigateBaseModel):
class LicensePlateRecognitionConfig(FrigateBaseModel):
enabled: bool = Field(default=False, title="Enable license plate recognition.")
device: Optional[LPRDeviceEnum] = Field(
default=LPRDeviceEnum.CPU,
device: Optional[EnrichmentsDeviceEnum] = Field(
default=EnrichmentsDeviceEnum.CPU,
title="The device used for license plate recognition.",
)
model_size: str = Field(

View File

@ -54,6 +54,7 @@ from .camera.snapshots import SnapshotsConfig
from .camera.timestamp import TimestampStyleConfig
from .camera_group import CameraGroupConfig
from .classification import (
AudioTranscriptionConfig,
ClassificationConfig,
FaceRecognitionConfig,
LicensePlateRecognitionConfig,
@ -419,6 +420,9 @@ class FrigateConfig(FrigateBaseModel):
)
# Classification Config
audio_transcription: AudioTranscriptionConfig = Field(
default_factory=AudioTranscriptionConfig, title="Audio transcription config."
)
classification: ClassificationConfig = Field(
default_factory=ClassificationConfig, title="Object classification config."
)
@ -472,6 +476,7 @@ class FrigateConfig(FrigateBaseModel):
global_config = self.model_dump(
include={
"audio": ...,
"audio_transcription": ...,
"birdseye": ...,
"face_recognition": ...,
"lpr": ...,
@ -528,6 +533,7 @@ class FrigateConfig(FrigateBaseModel):
allowed_fields_map = {
"face_recognition": ["enabled", "min_area"],
"lpr": ["enabled", "expire_time", "min_area", "enhancement"],
"audio_transcription": ["enabled", "live_enabled"],
}
for section in allowed_fields_map:
@ -609,6 +615,9 @@ class FrigateConfig(FrigateBaseModel):
# set config pre-value
camera_config.enabled_in_config = camera_config.enabled
camera_config.audio.enabled_in_config = camera_config.audio.enabled
camera_config.audio_transcription.enabled_in_config = (
camera_config.audio_transcription.enabled
)
camera_config.record.enabled_in_config = camera_config.record.enabled
camera_config.notifications.enabled_in_config = (
camera_config.notifications.enabled
@ -701,6 +710,21 @@ class FrigateConfig(FrigateBaseModel):
self.model.create_colormap(sorted(self.objects.all_objects))
self.model.check_and_load_plus_model(self.plus_api)
# Check audio transcription and audio detection requirements
if self.audio_transcription.enabled:
# If audio transcription is enabled globally, at least one camera must have audio detection enabled
if not any(camera.audio.enabled for camera in self.cameras.values()):
raise ValueError(
"Audio transcription is enabled globally, but no cameras have audio detection enabled. At least one camera must have audio detection enabled."
)
else:
# If audio transcription is disabled globally, check each camera with audio_transcription enabled
for camera in self.cameras.values():
if camera.audio_transcription.enabled and not camera.audio.enabled:
raise ValueError(
f"Camera {camera.name} has audio transcription enabled, but audio detection is not enabled for this camera. Audio detection must be enabled for cameras with audio transcription when it is disabled globally."
)
if self.plus_api and not self.snapshots.clean_copy:
logger.warning(
"Frigate+ is configured but clean snapshots are not enabled, submissions to Frigate+ will not be possible./"

View File

@ -0,0 +1,212 @@
"""Handle post-processing for audio transcription."""
import logging
import os
import threading
import time
from typing import Optional
from faster_whisper import WhisperModel
from peewee import DoesNotExist
from frigate.comms.embeddings_updater import EmbeddingsRequestEnum
from frigate.comms.inter_process import InterProcessRequestor
from frigate.config import FrigateConfig
from frigate.const import (
CACHE_DIR,
MODEL_CACHE_DIR,
UPDATE_EVENT_DESCRIPTION,
)
from frigate.data_processing.types import PostProcessDataEnum
from frigate.types import TrackedObjectUpdateTypesEnum
from frigate.util.audio import get_audio_from_recording
from ..types import DataProcessorMetrics
from .api import PostProcessorApi
logger = logging.getLogger(__name__)
class AudioTranscriptionPostProcessor(PostProcessorApi):
def __init__(
self,
config: FrigateConfig,
requestor: InterProcessRequestor,
metrics: DataProcessorMetrics,
):
super().__init__(config, metrics, None)
self.config = config
self.requestor = requestor
self.recognizer = None
self.transcription_lock = threading.Lock()
self.transcription_thread = None
self.transcription_running = False
# faster-whisper handles model downloading automatically
self.model_path = os.path.join(MODEL_CACHE_DIR, "whisper")
os.makedirs(self.model_path, exist_ok=True)
self.__build_recognizer()
def __build_recognizer(self) -> None:
try:
self.recognizer = WhisperModel(
model_size_or_path="small",
device="cuda"
if self.config.audio_transcription.device == "GPU"
else "cpu",
download_root=self.model_path,
local_files_only=False, # Allow downloading if not cached
compute_type="int8",
)
logger.debug("Audio transcription (recordings) initialized")
except Exception as e:
logger.error(f"Failed to initialize recordings audio transcription: {e}")
self.recognizer = None
def process_data(
self, data: dict[str, any], data_type: PostProcessDataEnum
) -> None:
"""Transcribe audio from a recording.
Args:
data (dict): Contains data about the input (event_id, camera, etc.).
data_type (enum): Describes the data being processed (recording or tracked_object).
Returns:
None
"""
event_id = data["event_id"]
camera_name = data["camera"]
if data_type == PostProcessDataEnum.recording:
start_ts = data["frame_time"]
recordings_available_through = data["recordings_available"]
end_ts = min(recordings_available_through, start_ts + 60) # Default 60s
elif data_type == PostProcessDataEnum.tracked_object:
obj_data = data["event"]["data"]
obj_data["id"] = data["event"]["id"]
obj_data["camera"] = data["event"]["camera"]
start_ts = data["event"]["start_time"]
end_ts = data["event"].get(
"end_time", start_ts + 60
) # Use end_time if available
else:
logger.error("No data type passed to audio transcription post-processing")
return
try:
audio_data = get_audio_from_recording(
self.config.cameras[camera_name].ffmpeg,
camera_name,
start_ts,
end_ts,
sample_rate=16000,
)
if not audio_data:
logger.debug(f"No audio data extracted for {event_id}")
return
transcription = self.__transcribe_audio(audio_data)
if not transcription:
logger.debug("No transcription generated from audio")
return
logger.debug(f"Transcribed audio for {event_id}: '{transcription}'")
self.requestor.send_data(
UPDATE_EVENT_DESCRIPTION,
{
"type": TrackedObjectUpdateTypesEnum.description,
"id": event_id,
"description": transcription,
"camera": camera_name,
},
)
# Embed the description
self.requestor.send_data(
EmbeddingsRequestEnum.embed_description.value,
{"id": event_id, "description": transcription},
)
except DoesNotExist:
logger.debug("No recording found for audio transcription post-processing")
return
except Exception as e:
logger.error(f"Error in audio transcription post-processing: {e}")
def __transcribe_audio(self, audio_data: bytes) -> Optional[tuple[str, float]]:
"""Transcribe WAV audio data using faster-whisper."""
if not self.recognizer:
logger.debug("Recognizer not initialized")
return None
try:
# Save audio data to a temporary wav (faster-whisper expects a file)
temp_wav = os.path.join(CACHE_DIR, f"temp_audio_{int(time.time())}.wav")
with open(temp_wav, "wb") as f:
f.write(audio_data)
segments, info = self.recognizer.transcribe(
temp_wav,
language=self.config.audio_transcription.language,
beam_size=5,
)
os.remove(temp_wav)
# Combine all segment texts
text = " ".join(segment.text.strip() for segment in segments)
if not text:
return None
logger.debug(
"Detected language '%s' with probability %f"
% (info.language, info.language_probability)
)
return text
except Exception as e:
logger.error(f"Error transcribing audio: {e}")
return None
def _transcription_wrapper(self, event: dict[str, any]) -> None:
"""Wrapper to run transcription and reset running flag when done."""
try:
self.process_data(
{
"event_id": event["id"],
"camera": event["camera"],
"event": event,
},
PostProcessDataEnum.tracked_object,
)
finally:
with self.transcription_lock:
self.transcription_running = False
self.transcription_thread = None
def handle_request(self, topic: str, request_data: dict[str, any]) -> str | None:
if topic == "transcribe_audio":
event = request_data["event"]
with self.transcription_lock:
if self.transcription_running:
logger.warning(
"Audio transcription for a speech event is already running."
)
return "in_progress"
# Mark as running and start the thread
self.transcription_running = True
self.transcription_thread = threading.Thread(
target=self._transcription_wrapper, args=(event,), daemon=True
)
self.transcription_thread.start()
return "started"
return None

View File

@ -0,0 +1,276 @@
"""Handle processing audio for speech transcription using sherpa-onnx with FFmpeg pipe."""
import logging
import os
import queue
import threading
from typing import Optional
import numpy as np
import sherpa_onnx
from frigate.comms.inter_process import InterProcessRequestor
from frigate.config import CameraConfig, FrigateConfig
from frigate.const import MODEL_CACHE_DIR
from frigate.util.downloader import ModelDownloader
from ..types import DataProcessorMetrics
from .api import RealTimeProcessorApi
from .whisper_online import FasterWhisperASR, OnlineASRProcessor
logger = logging.getLogger(__name__)
class AudioTranscriptionRealTimeProcessor(RealTimeProcessorApi):
def __init__(
self,
config: FrigateConfig,
camera_config: CameraConfig,
requestor: InterProcessRequestor,
metrics: DataProcessorMetrics,
stop_event: threading.Event,
):
super().__init__(config, metrics)
self.config = config
self.camera_config = camera_config
self.requestor = requestor
self.recognizer = None
self.stream = None
self.transcription_segments = []
self.audio_queue = queue.Queue()
self.stop_event = stop_event
if self.config.audio_transcription.model_size == "large":
self.asr = FasterWhisperASR(
modelsize="tiny",
device="cuda"
if self.config.audio_transcription.device == "GPU"
else "cpu",
lan=config.audio_transcription.language,
model_dir=os.path.join(MODEL_CACHE_DIR, "whisper"),
)
self.asr.use_vad() # Enable Silero VAD for low-RMS audio
else:
# small model as default
download_path = os.path.join(MODEL_CACHE_DIR, "sherpa-onnx")
HF_ENDPOINT = os.environ.get("HF_ENDPOINT", "https://huggingface.co")
self.model_files = {
"encoder.onnx": f"{HF_ENDPOINT}/csukuangfj/sherpa-onnx-streaming-zipformer-en-2023-06-26/resolve/main/encoder-epoch-99-avg-1-chunk-16-left-128.onnx",
"decoder.onnx": f"{HF_ENDPOINT}/csukuangfj/sherpa-onnx-streaming-zipformer-en-2023-06-26/resolve/main/decoder-epoch-99-avg-1-chunk-16-left-128.onnx",
"joiner.onnx": f"{HF_ENDPOINT}/csukuangfj/sherpa-onnx-streaming-zipformer-en-2023-06-26/resolve/main/joiner-epoch-99-avg-1-chunk-16-left-128.onnx",
"tokens.txt": f"{HF_ENDPOINT}/csukuangfj/sherpa-onnx-streaming-zipformer-en-2023-06-26/resolve/main/tokens.txt",
}
if not all(
os.path.exists(os.path.join(download_path, n))
for n in self.model_files.keys()
):
self.downloader = ModelDownloader(
model_name="sherpa-onnx",
download_path=download_path,
file_names=self.model_files.keys(),
download_func=self.__download_models,
complete_func=self.__build_recognizer,
)
self.downloader.ensure_model_files()
self.__build_recognizer()
def __download_models(self, path: str) -> None:
try:
file_name = os.path.basename(path)
ModelDownloader.download_from_url(self.model_files[file_name], path)
except Exception as e:
logger.error(f"Failed to download {path}: {e}")
def __build_recognizer(self) -> None:
try:
if self.config.audio_transcription.model_size == "large":
self.online = OnlineASRProcessor(
asr=self.asr,
)
else:
self.recognizer = sherpa_onnx.OnlineRecognizer.from_transducer(
tokens=os.path.join(MODEL_CACHE_DIR, "sherpa-onnx/tokens.txt"),
encoder=os.path.join(MODEL_CACHE_DIR, "sherpa-onnx/encoder.onnx"),
decoder=os.path.join(MODEL_CACHE_DIR, "sherpa-onnx/decoder.onnx"),
joiner=os.path.join(MODEL_CACHE_DIR, "sherpa-onnx/joiner.onnx"),
num_threads=2,
sample_rate=16000,
feature_dim=80,
enable_endpoint_detection=True,
rule1_min_trailing_silence=2.4,
rule2_min_trailing_silence=1.2,
rule3_min_utterance_length=300,
decoding_method="greedy_search",
provider="cpu",
)
self.stream = self.recognizer.create_stream()
logger.debug("Audio transcription (live) initialized")
except Exception as e:
logger.error(
f"Failed to initialize live streaming audio transcription: {e}"
)
self.recognizer = None
def __process_audio_stream(
self, audio_data: np.ndarray
) -> Optional[tuple[str, bool]]:
if (not self.recognizer or not self.stream) and not self.online:
logger.debug(
"Audio transcription (streaming) recognizer or stream not initialized"
)
return None
try:
if audio_data.dtype != np.float32:
audio_data = audio_data.astype(np.float32)
if audio_data.max() > 1.0 or audio_data.min() < -1.0:
audio_data = audio_data / 32768.0 # Normalize from int16
rms = float(np.sqrt(np.mean(np.absolute(np.square(audio_data)))))
logger.debug(f"Audio chunk size: {audio_data.size}, RMS: {rms:.4f}")
if self.config.audio_transcription.model_size == "large":
# large model
self.online.insert_audio_chunk(audio_data)
output = self.online.process_iter()
text = output[2].strip()
is_endpoint = text.endswith((".", "!", "?"))
if text:
self.transcription_segments.append(text)
concatenated_text = " ".join(self.transcription_segments)
logger.debug(f"Concatenated transcription: '{concatenated_text}'")
text = concatenated_text
else:
# small model
self.stream.accept_waveform(16000, audio_data)
while self.recognizer.is_ready(self.stream):
self.recognizer.decode_stream(self.stream)
text = self.recognizer.get_result(self.stream).strip()
is_endpoint = self.recognizer.is_endpoint(self.stream)
logger.debug(f"Transcription result: '{text}'")
if not text:
logger.debug("No transcription, returning")
return None
logger.debug(f"Endpoint detected: {is_endpoint}")
if is_endpoint and self.config.audio_transcription.model_size == "small":
# reset sherpa if we've reached an endpoint
self.recognizer.reset(self.stream)
return text, is_endpoint
except Exception as e:
logger.error(f"Error processing audio stream: {e}")
return None
def process_frame(self, obj_data: dict[str, any], frame: np.ndarray) -> None:
pass
def process_audio(self, obj_data: dict[str, any], audio: np.ndarray) -> bool | None:
if audio is None or audio.size == 0:
logger.debug("No audio data provided for transcription")
return None
# enqueue audio data for processing in the thread
self.audio_queue.put((obj_data, audio))
return None
def run(self) -> None:
"""Run method for the transcription thread to process queued audio data."""
logger.debug(
f"Starting audio transcription thread for {self.camera_config.name}"
)
while not self.stop_event.is_set():
try:
# Get audio data from queue with a timeout to check stop_event
obj_data, audio = self.audio_queue.get(timeout=0.1)
result = self.__process_audio_stream(audio)
if not result:
continue
text, is_endpoint = result
logger.debug(f"Transcribed audio: '{text}', Endpoint: {is_endpoint}")
self.requestor.send_data(
f"{self.camera_config.name}/audio/transcription", text
)
self.audio_queue.task_done()
if is_endpoint:
self.reset(obj_data["camera"])
except queue.Empty:
continue
except Exception as e:
logger.error(f"Error processing audio in thread: {e}")
self.audio_queue.task_done()
logger.debug(
f"Stopping audio transcription thread for {self.camera_config.name}"
)
def reset(self, camera: str) -> None:
if self.config.audio_transcription.model_size == "large":
# get final output from whisper
output = self.online.finish()
self.transcription_segments = []
self.requestor.send_data(
f"{self.camera_config.name}/audio/transcription",
(output[2].strip() + " "),
)
# reset whisper
self.online.init()
else:
# reset sherpa
self.recognizer.reset(self.stream)
# Clear the audio queue
while not self.audio_queue.empty():
try:
self.audio_queue.get_nowait()
self.audio_queue.task_done()
except queue.Empty:
break
logger.debug("Stream reset")
def stop(self) -> None:
"""Stop the transcription thread and clean up."""
self.stop_event.set()
# Clear the queue to prevent processing stale data
while not self.audio_queue.empty():
try:
self.audio_queue.get_nowait()
self.audio_queue.task_done()
except queue.Empty:
break
logger.debug(
f"Transcription thread stop signaled for {self.camera_config.name}"
)
def handle_request(
self, topic: str, request_data: dict[str, any]
) -> dict[str, any] | None:
if topic == "clear_audio_recognizer":
self.recognizer = None
self.stream = None
self.__build_recognizer()
return {"message": "Audio recognizer cleared and rebuilt", "success": True}
return None
def expire_object(self, object_id: str) -> None:
pass

File diff suppressed because it is too large Load Diff

View File

@ -291,3 +291,8 @@ class EmbeddingsContext:
def reindex_embeddings(self) -> dict[str, Any]:
return self.requestor.send_data(EmbeddingsRequestEnum.reindex.value, {})
def transcribe_audio(self, event: dict[str, any]) -> dict[str, any]:
return self.requestor.send_data(
EmbeddingsRequestEnum.transcribe_audio.value, {"event": event}
)

View File

@ -37,6 +37,9 @@ from frigate.data_processing.common.license_plate.model import (
LicensePlateModelRunner,
)
from frigate.data_processing.post.api import PostProcessorApi
from frigate.data_processing.post.audio_transcription import (
AudioTranscriptionPostProcessor,
)
from frigate.data_processing.post.license_plate import (
LicensePlatePostProcessor,
)
@ -176,6 +179,14 @@ class EmbeddingMaintainer(threading.Thread):
)
)
if any(
c.enabled_in_config and c.audio_transcription.enabled
for c in self.config.cameras.values()
):
self.post_processors.append(
AudioTranscriptionPostProcessor(self.config, self.requestor, metrics)
)
self.stop_event = stop_event
self.tracked_events: dict[str, list[Any]] = {}
self.early_request_sent: dict[str, bool] = {}
@ -372,6 +383,8 @@ class EmbeddingMaintainer(threading.Thread):
},
PostProcessDataEnum.recording,
)
elif isinstance(processor, AudioTranscriptionPostProcessor):
continue
else:
processor.process_data(event_id, PostProcessDataEnum.event_id)

View File

@ -18,7 +18,7 @@ from frigate.comms.event_metadata_updater import (
EventMetadataTypeEnum,
)
from frigate.comms.inter_process import InterProcessRequestor
from frigate.config import CameraConfig, CameraInput, FfmpegConfig
from frigate.config import CameraConfig, CameraInput, FfmpegConfig, FrigateConfig
from frigate.config.camera.updater import (
CameraConfigUpdateEnum,
CameraConfigUpdateSubscriber,
@ -30,6 +30,9 @@ from frigate.const import (
AUDIO_MIN_CONFIDENCE,
AUDIO_SAMPLE_RATE,
)
from frigate.data_processing.real_time.audio_transcription import (
AudioTranscriptionRealTimeProcessor,
)
from frigate.ffmpeg_presets import parse_preset_input
from frigate.log import LogPipe
from frigate.object_detection.base import load_labels
@ -75,6 +78,7 @@ class AudioProcessor(util.Process):
def __init__(
self,
config: FrigateConfig,
cameras: list[CameraConfig],
camera_metrics: dict[str, CameraMetrics],
):
@ -82,6 +86,7 @@ class AudioProcessor(util.Process):
self.camera_metrics = camera_metrics
self.cameras = cameras
self.config = config
def run(self) -> None:
audio_threads: list[AudioEventMaintainer] = []
@ -94,6 +99,7 @@ class AudioProcessor(util.Process):
for camera in self.cameras:
audio_thread = AudioEventMaintainer(
camera,
self.config,
self.camera_metrics,
self.stop_event,
)
@ -122,46 +128,71 @@ class AudioEventMaintainer(threading.Thread):
def __init__(
self,
camera: CameraConfig,
config: FrigateConfig,
camera_metrics: dict[str, CameraMetrics],
stop_event: threading.Event,
) -> None:
super().__init__(name=f"{camera.name}_audio_event_processor")
self.config = camera
self.config = config
self.camera_config = camera
self.camera_metrics = camera_metrics
self.detections: dict[dict[str, Any]] = {}
self.stop_event = stop_event
self.detector = AudioTfl(stop_event, self.config.audio.num_threads)
self.detector = AudioTfl(stop_event, self.camera_config.audio.num_threads)
self.shape = (int(round(AUDIO_DURATION * AUDIO_SAMPLE_RATE)),)
self.chunk_size = int(round(AUDIO_DURATION * AUDIO_SAMPLE_RATE * 2))
self.logger = logging.getLogger(f"audio.{self.config.name}")
self.ffmpeg_cmd = get_ffmpeg_command(self.config.ffmpeg)
self.logpipe = LogPipe(f"ffmpeg.{self.config.name}.audio")
self.logger = logging.getLogger(f"audio.{self.camera_config.name}")
self.ffmpeg_cmd = get_ffmpeg_command(self.camera_config.ffmpeg)
self.logpipe = LogPipe(f"ffmpeg.{self.camera_config.name}.audio")
self.audio_listener = None
self.transcription_processor = None
self.transcription_thread = None
# create communication for audio detections
self.requestor = InterProcessRequestor()
self.config_subscriber = CameraConfigUpdateSubscriber(
{self.config.name: self.config},
[CameraConfigUpdateEnum.audio, CameraConfigUpdateEnum.enabled],
{self.camera_config.name: self.camera_config},
[
CameraConfigUpdateEnum.audio,
CameraConfigUpdateEnum.enabled,
CameraConfigUpdateEnum.audio_transcription,
],
)
self.detection_publisher = DetectionPublisher(DetectionTypeEnum.audio)
self.event_metadata_publisher = EventMetadataPublisher()
if self.camera_config.audio_transcription.enabled_in_config:
# init the transcription processor for this camera
self.transcription_processor = AudioTranscriptionRealTimeProcessor(
config=self.config,
camera_config=self.camera_config,
requestor=self.requestor,
metrics=self.camera_metrics[self.camera_config.name],
stop_event=self.stop_event,
)
self.transcription_thread = threading.Thread(
target=self.transcription_processor.run,
name=f"{self.camera_config.name}_transcription_processor",
daemon=True,
)
self.transcription_thread.start()
self.was_enabled = camera.enabled
def detect_audio(self, audio) -> None:
if not self.config.audio.enabled or self.stop_event.is_set():
if not self.camera_config.audio.enabled or self.stop_event.is_set():
return
audio_as_float = audio.astype(np.float32)
rms, dBFS = self.calculate_audio_levels(audio_as_float)
self.camera_metrics[self.config.name].audio_rms.value = rms
self.camera_metrics[self.config.name].audio_dBFS.value = dBFS
self.camera_metrics[self.camera_config.name].audio_rms.value = rms
self.camera_metrics[self.camera_config.name].audio_dBFS.value = dBFS
# only run audio detection when volume is above min_volume
if rms >= self.config.audio.min_volume:
if rms >= self.camera_config.audio.min_volume:
# create waveform relative to max range and look for detections
waveform = (audio / AUDIO_MAX_BIT_RANGE).astype(np.float32)
model_detections = self.detector.detect(waveform)
@ -169,28 +200,42 @@ class AudioEventMaintainer(threading.Thread):
for label, score, _ in model_detections:
self.logger.debug(
f"{self.config.name} heard {label} with a score of {score}"
f"{self.camera_config.name} heard {label} with a score of {score}"
)
if label not in self.config.audio.listen:
if label not in self.camera_config.audio.listen:
continue
if score > dict((self.config.audio.filters or {}).get(label, {})).get(
"threshold", 0.8
):
if score > dict(
(self.camera_config.audio.filters or {}).get(label, {})
).get("threshold", 0.8):
self.handle_detection(label, score)
audio_detections.append(label)
# send audio detection data
self.detection_publisher.publish(
(
self.config.name,
self.camera_config.name,
datetime.datetime.now().timestamp(),
dBFS,
audio_detections,
)
)
# run audio transcription
if self.transcription_processor is not None and (
self.camera_config.audio_transcription.live_enabled
):
self.transcribing = True
# process audio until we've reached the endpoint
self.transcription_processor.process_audio(
{
"id": f"{self.camera_config.name}_audio",
"camera": self.camera_config.name,
},
audio,
)
self.expire_detections()
def calculate_audio_levels(self, audio_as_float: np.float32) -> Tuple[float, float]:
@ -204,8 +249,8 @@ class AudioEventMaintainer(threading.Thread):
else:
dBFS = 0
self.requestor.send_data(f"{self.config.name}/audio/dBFS", float(dBFS))
self.requestor.send_data(f"{self.config.name}/audio/rms", float(rms))
self.requestor.send_data(f"{self.camera_config.name}/audio/dBFS", float(dBFS))
self.requestor.send_data(f"{self.camera_config.name}/audio/rms", float(rms))
return float(rms), float(dBFS)
@ -220,13 +265,13 @@ class AudioEventMaintainer(threading.Thread):
random.choices(string.ascii_lowercase + string.digits, k=6)
)
event_id = f"{now}-{rand_id}"
self.requestor.send_data(f"{self.config.name}/audio/{label}", "ON")
self.requestor.send_data(f"{self.camera_config.name}/audio/{label}", "ON")
self.event_metadata_publisher.publish(
EventMetadataTypeEnum.manual_event_create,
(
now,
self.config.name,
self.camera_config.name,
label,
event_id,
True,
@ -252,10 +297,10 @@ class AudioEventMaintainer(threading.Thread):
if (
now - detection.get("last_detection", now)
> self.config.audio.max_not_heard
> self.camera_config.audio.max_not_heard
):
self.requestor.send_data(
f"{self.config.name}/audio/{detection['label']}", "OFF"
f"{self.camera_config.name}/audio/{detection['label']}", "OFF"
)
self.event_metadata_publisher.publish(
@ -264,12 +309,21 @@ class AudioEventMaintainer(threading.Thread):
)
self.detections[detection["label"]] = None
# clear real-time transcription
if self.transcription_processor is not None:
self.transcription_processor.reset(self.camera_config.name)
self.requestor.send_data(
f"{self.camera_config.name}/audio/transcription", ""
)
def expire_all_detections(self) -> None:
"""Immediately end all current detections"""
now = datetime.datetime.now().timestamp()
for label, detection in list(self.detections.items()):
if detection:
self.requestor.send_data(f"{self.config.name}/audio/{label}", "OFF")
self.requestor.send_data(
f"{self.camera_config.name}/audio/{label}", "OFF"
)
self.event_metadata_publisher.publish(
EventMetadataTypeEnum.manual_event_end,
(detection["id"], now),
@ -290,7 +344,7 @@ class AudioEventMaintainer(threading.Thread):
if self.stop_event.is_set():
return
time.sleep(self.config.ffmpeg.retry_interval)
time.sleep(self.camera_config.ffmpeg.retry_interval)
self.logpipe.dump()
self.start_or_restart_ffmpeg()
@ -312,20 +366,20 @@ class AudioEventMaintainer(threading.Thread):
log_and_restart()
def run(self) -> None:
if self.config.enabled:
if self.camera_config.enabled:
self.start_or_restart_ffmpeg()
while not self.stop_event.is_set():
enabled = self.config.enabled
enabled = self.camera_config.enabled
if enabled != self.was_enabled:
if enabled:
self.logger.debug(
f"Enabling audio detections for {self.config.name}"
f"Enabling audio detections for {self.camera_config.name}"
)
self.start_or_restart_ffmpeg()
else:
self.logger.debug(
f"Disabling audio detections for {self.config.name}, ending events"
f"Disabling audio detections for {self.camera_config.name}, ending events"
)
self.expire_all_detections()
stop_ffmpeg(self.audio_listener, self.logger)
@ -344,6 +398,12 @@ class AudioEventMaintainer(threading.Thread):
if self.audio_listener:
stop_ffmpeg(self.audio_listener, self.logger)
if self.transcription_thread:
self.transcription_thread.join(timeout=2)
if self.transcription_thread.is_alive():
self.logger.warning(
f"Audio transcription thread {self.transcription_thread.name} is still alive"
)
self.logpipe.close()
self.requestor.stop()
self.config_subscriber.stop()

116
frigate/util/audio.py Normal file
View File

@ -0,0 +1,116 @@
"""Utilities for creating and manipulating audio."""
import logging
import os
import subprocess as sp
from typing import Optional
from pathvalidate import sanitize_filename
from frigate.const import CACHE_DIR
from frigate.models import Recordings
logger = logging.getLogger(__name__)
def get_audio_from_recording(
ffmpeg,
camera_name: str,
start_ts: float,
end_ts: float,
sample_rate: int = 16000,
) -> Optional[bytes]:
"""Extract audio from recording files between start_ts and end_ts in WAV format suitable for sherpa-onnx.
Args:
ffmpeg: FFmpeg configuration object
camera_name: Name of the camera
start_ts: Start timestamp in seconds
end_ts: End timestamp in seconds
sample_rate: Sample rate for output audio (default 16kHz for sherpa-onnx)
Returns:
Bytes of WAV audio data or None if extraction failed
"""
# Fetch all relevant recording segments
recordings = (
Recordings.select(
Recordings.path,
Recordings.start_time,
Recordings.end_time,
)
.where(
(Recordings.start_time.between(start_ts, end_ts))
| (Recordings.end_time.between(start_ts, end_ts))
| ((start_ts > Recordings.start_time) & (end_ts < Recordings.end_time))
)
.where(Recordings.camera == camera_name)
.order_by(Recordings.start_time.asc())
)
if not recordings:
logger.debug(
f"No recordings found for {camera_name} between {start_ts} and {end_ts}"
)
return None
# Generate concat playlist file
file_name = sanitize_filename(
f"audio_playlist_{camera_name}_{start_ts}-{end_ts}.txt"
)
file_path = os.path.join(CACHE_DIR, file_name)
try:
with open(file_path, "w") as file:
for clip in recordings:
file.write(f"file '{clip.path}'\n")
if clip.start_time < start_ts:
file.write(f"inpoint {int(start_ts - clip.start_time)}\n")
if clip.end_time > end_ts:
file.write(f"outpoint {int(end_ts - clip.start_time)}\n")
ffmpeg_cmd = [
ffmpeg.ffmpeg_path,
"-hide_banner",
"-loglevel",
"warning",
"-protocol_whitelist",
"pipe,file",
"-f",
"concat",
"-safe",
"0",
"-i",
file_path,
"-vn", # No video
"-acodec",
"pcm_s16le", # 16-bit PCM encoding
"-ar",
str(sample_rate),
"-ac",
"1", # Mono audio
"-f",
"wav",
"-",
]
process = sp.run(
ffmpeg_cmd,
capture_output=True,
)
if process.returncode == 0:
logger.debug(
f"Successfully extracted audio for {camera_name} from {start_ts} to {end_ts}"
)
return process.stdout
else:
logger.error(f"Failed to extract audio: {process.stderr.decode()}")
return None
except Exception as e:
logger.error(f"Error extracting audio from recordings: {e}")
return None
finally:
try:
os.unlink(file_path)
except OSError:
pass

View File

@ -103,12 +103,14 @@
"success": {
"regenerate": "A new description has been requested from {{provider}}. Depending on the speed of your provider, the new description may take some time to regenerate.",
"updatedSublabel": "Successfully updated sub label.",
"updatedLPR": "Successfully updated license plate."
"updatedLPR": "Successfully updated license plate.",
"audioTranscription": "Successfully requested audio transcription."
},
"error": {
"regenerate": "Failed to call {{provider}} for a new description: {{errorMessage}}",
"updatedSublabelFailed": "Failed to update sub label: {{errorMessage}}",
"updatedLPRFailed": "Failed to update license plate: {{errorMessage}}"
"updatedLPRFailed": "Failed to update license plate: {{errorMessage}}",
"audioTranscription": "Failed to request audio transcription: {{errorMessage}}"
}
}
},
@ -173,6 +175,10 @@
"label": "Find similar",
"aria": "Find similar tracked objects"
},
"audioTranscription": {
"label": "Transcribe",
"aria": "Request audio transcription"
},
"submitToPlus": {
"label": "Submit to Frigate+",
"aria": "Submit to Frigate Plus"

View File

@ -69,6 +69,10 @@
"enable": "Enable Audio Detect",
"disable": "Disable Audio Detect"
},
"transcription": {
"enable": "Enable Live Audio Transcription",
"disable": "Disable Live Audio Transcription"
},
"autotracking": {
"enable": "Enable Autotracking",
"disable": "Disable Autotracking"
@ -135,6 +139,7 @@
"recording": "Recording",
"snapshots": "Snapshots",
"audioDetection": "Audio Detection",
"transcription": "Audio Transcription",
"autotracking": "Autotracking"
},
"history": {

View File

@ -8,6 +8,7 @@ import {
FrigateReview,
ModelState,
ToggleableSetting,
TrackedObjectUpdateReturnType,
} from "@/types/ws";
import { FrigateStats } from "@/types/stats";
import { createContainer } from "react-tracked";
@ -60,6 +61,7 @@ function useValue(): useValueReturn {
enabled,
snapshots,
audio,
audio_transcription,
notifications,
notifications_suspended,
autotracking,
@ -71,6 +73,9 @@ function useValue(): useValueReturn {
cameraStates[`${name}/detect/state`] = detect ? "ON" : "OFF";
cameraStates[`${name}/snapshots/state`] = snapshots ? "ON" : "OFF";
cameraStates[`${name}/audio/state`] = audio ? "ON" : "OFF";
cameraStates[`${name}/audio_transcription/state`] = audio_transcription
? "ON"
: "OFF";
cameraStates[`${name}/notifications/state`] = notifications
? "ON"
: "OFF";
@ -220,6 +225,20 @@ export function useAudioState(camera: string): {
return { payload: payload as ToggleableSetting, send };
}
export function useAudioTranscriptionState(camera: string): {
payload: ToggleableSetting;
send: (payload: ToggleableSetting, retain?: boolean) => void;
} {
const {
value: { payload },
send,
} = useWs(
`${camera}/audio_transcription/state`,
`${camera}/audio_transcription/set`,
);
return { payload: payload as ToggleableSetting, send };
}
export function useAutotrackingState(camera: string): {
payload: ToggleableSetting;
send: (payload: ToggleableSetting, retain?: boolean) => void;
@ -421,6 +440,15 @@ export function useAudioActivity(camera: string): { payload: number } {
return { payload: payload as number };
}
export function useAudioLiveTranscription(camera: string): {
payload: string;
} {
const {
value: { payload },
} = useWs(`${camera}/audio/transcription`, "");
return { payload: payload as string };
}
export function useMotionThreshold(camera: string): {
payload: string;
send: (payload: number, retain?: boolean) => void;
@ -463,11 +491,16 @@ export function useImproveContrast(camera: string): {
return { payload: payload as ToggleableSetting, send };
}
export function useTrackedObjectUpdate(): { payload: string } {
export function useTrackedObjectUpdate(): {
payload: TrackedObjectUpdateReturnType;
} {
const {
value: { payload },
} = useWs("tracked_object_update", "");
return useDeepMemo(JSON.parse(payload as string));
const parsed = payload
? JSON.parse(payload as string)
: { type: "", id: "", camera: "" };
return { payload: useDeepMemo(parsed) };
}
export function useNotifications(camera: string): {

View File

@ -77,6 +77,7 @@ import { Trans, useTranslation } from "react-i18next";
import { TbFaceId } from "react-icons/tb";
import { useIsAdmin } from "@/hooks/use-is-admin";
import FaceSelectionDialog from "../FaceSelectionDialog";
import { CgTranscript } from "react-icons/cg";
const SEARCH_TABS = [
"details",
@ -709,6 +710,34 @@ function ObjectDetailsTab({
[search, t],
);
// speech transcription
const onTranscribe = useCallback(() => {
axios
.put(`/audio/transcribe`, { event_id: search.id })
.then((resp) => {
if (resp.status == 202) {
toast.success(t("details.item.toast.success.audioTranscription"), {
position: "top-center",
});
}
})
.catch((error) => {
const errorMessage =
error.response?.data?.message ||
error.response?.data?.detail ||
"Unknown error";
toast.error(
t("details.item.toast.error.audioTranscription", {
errorMessage,
}),
{
position: "top-center",
},
);
});
}, [search, t]);
return (
<div className="flex flex-col gap-5">
<div className="flex w-full flex-row">
@ -894,6 +923,16 @@ function ObjectDetailsTab({
</Button>
</FaceSelectionDialog>
)}
{config?.cameras[search?.camera].audio_transcription.enabled &&
search?.label == "speech" &&
search?.end_time && (
<Button className="w-full" onClick={onTranscribe}>
<div className="flex gap-1">
<CgTranscript />
{t("itemMenu.audioTranscription.label")}
</div>
</Button>
)}
</div>
</div>
</div>

View File

@ -246,15 +246,13 @@ export default function Explore() {
// mutation and revalidation
const trackedObjectUpdate = useTrackedObjectUpdate();
const { payload: wsUpdate } = useTrackedObjectUpdate();
useEffect(() => {
if (trackedObjectUpdate) {
if (wsUpdate && wsUpdate.type == "description") {
mutate();
}
// mutate / revalidate when event description updates come in
// eslint-disable-next-line react-hooks/exhaustive-deps
}, [trackedObjectUpdate]);
}, [wsUpdate, mutate]);
// embeddings reindex progress

View File

@ -41,6 +41,11 @@ export interface CameraConfig {
min_volume: number;
num_threads: number;
};
audio_transcription: {
enabled: boolean;
enabled_in_config: boolean;
live_enabled: boolean;
};
best_image_timeout: number;
birdseye: {
enabled: boolean;
@ -296,6 +301,10 @@ export interface FrigateConfig {
num_threads: number;
};
audio_transcription: {
enabled: boolean;
};
birdseye: BirdseyeConfig;
cameras: {

View File

@ -58,6 +58,7 @@ export interface FrigateCameraState {
snapshots: boolean;
record: boolean;
audio: boolean;
audio_transcription: boolean;
notifications: boolean;
notifications_suspended: number;
autotracking: boolean;
@ -84,3 +85,21 @@ export type EmbeddingsReindexProgressType = {
};
export type ToggleableSetting = "ON" | "OFF";
export type TrackedObjectUpdateType =
| "description"
| "lpr"
| "transcription"
| "face";
export type TrackedObjectUpdateReturnType = {
type: TrackedObjectUpdateType;
id: string;
camera: string;
description?: string;
name?: string;
plate?: string;
score?: number;
timestamp?: number;
text?: string;
} | null;

View File

@ -74,13 +74,13 @@ export default function ExploreView({
}, {});
}, [events]);
const trackedObjectUpdate = useTrackedObjectUpdate();
const { payload: wsUpdate } = useTrackedObjectUpdate();
useEffect(() => {
mutate();
// mutate / revalidate when event description updates come in
// eslint-disable-next-line react-hooks/exhaustive-deps
}, [trackedObjectUpdate]);
if (wsUpdate && wsUpdate.type == "description") {
mutate();
}
}, [wsUpdate, mutate]);
// update search detail when results change

View File

@ -1,5 +1,7 @@
import {
useAudioLiveTranscription,
useAudioState,
useAudioTranscriptionState,
useAutotrackingState,
useDetectState,
useEnabledState,
@ -90,6 +92,8 @@ import {
LuX,
} from "react-icons/lu";
import {
MdClosedCaption,
MdClosedCaptionDisabled,
MdNoPhotography,
MdOutlineRestartAlt,
MdPersonOff,
@ -196,6 +200,29 @@ export default function LiveCameraView({
const { payload: enabledState } = useEnabledState(camera.name);
const cameraEnabled = enabledState === "ON";
// for audio transcriptions
const { payload: audioTranscriptionState, send: sendTranscription } =
useAudioTranscriptionState(camera.name);
const { payload: transcription } = useAudioLiveTranscription(camera.name);
const transcriptionRef = useRef<HTMLDivElement>(null);
useEffect(() => {
if (transcription) {
if (transcriptionRef.current) {
transcriptionRef.current.scrollTop =
transcriptionRef.current.scrollHeight;
}
}
}, [transcription]);
useEffect(() => {
return () => {
// disable transcriptions when unmounting
if (audioTranscriptionState == "ON") sendTranscription("OFF");
};
}, [audioTranscriptionState, sendTranscription]);
// click overlay for ptzs
const [clickOverlay, setClickOverlay] = useState(false);
@ -566,6 +593,9 @@ export default function LiveCameraView({
autotrackingEnabled={
camera.onvif.autotracking.enabled_in_config
}
transcriptionEnabled={
camera.audio_transcription.enabled_in_config
}
fullscreen={fullscreen}
streamName={streamName ?? ""}
setStreamName={setStreamName}
@ -625,6 +655,16 @@ export default function LiveCameraView({
/>
</div>
</TransformComponent>
{camera?.audio?.enabled_in_config &&
audioTranscriptionState == "ON" &&
transcription != null && (
<div
ref={transcriptionRef}
className="text-md scrollbar-container absolute bottom-4 left-1/2 max-h-[15vh] w-[75%] -translate-x-1/2 overflow-y-auto rounded-lg bg-black/70 p-2 text-white md:w-[50%]"
>
{transcription}
</div>
)}
</div>
</div>
{camera.onvif.host != "" && (
@ -983,6 +1023,7 @@ type FrigateCameraFeaturesProps = {
recordingEnabled: boolean;
audioDetectEnabled: boolean;
autotrackingEnabled: boolean;
transcriptionEnabled: boolean;
fullscreen: boolean;
streamName: string;
setStreamName?: (value: string | undefined) => void;
@ -1002,6 +1043,7 @@ function FrigateCameraFeatures({
recordingEnabled,
audioDetectEnabled,
autotrackingEnabled,
transcriptionEnabled,
fullscreen,
streamName,
setStreamName,
@ -1033,6 +1075,8 @@ function FrigateCameraFeatures({
const { payload: audioState, send: sendAudio } = useAudioState(camera.name);
const { payload: autotrackingState, send: sendAutotracking } =
useAutotrackingState(camera.name);
const { payload: transcriptionState, send: sendTranscription } =
useAudioTranscriptionState(camera.name);
// roles
@ -1196,6 +1240,27 @@ function FrigateCameraFeatures({
disabled={!cameraEnabled}
/>
)}
{audioDetectEnabled && transcriptionEnabled && (
<CameraFeatureToggle
className="p-2 md:p-0"
variant={fullscreen ? "overlay" : "primary"}
Icon={
transcriptionState == "ON"
? MdClosedCaption
: MdClosedCaptionDisabled
}
isActive={transcriptionState == "ON"}
title={
transcriptionState == "ON"
? t("transcription.disable")
: t("transcription.enable")
}
onClick={() =>
sendTranscription(transcriptionState == "ON" ? "OFF" : "ON")
}
disabled={!cameraEnabled || audioState == "OFF"}
/>
)}
{autotrackingEnabled && (
<CameraFeatureToggle
className="p-2 md:p-0"
@ -1558,6 +1623,16 @@ function FrigateCameraFeatures({
}
/>
)}
{audioDetectEnabled && transcriptionEnabled && (
<FilterSwitch
label={t("cameraSettings.transcription")}
disabled={audioState == "OFF"}
isChecked={transcriptionState == "ON"}
onCheckedChange={() =>
sendTranscription(transcriptionState == "ON" ? "OFF" : "ON")
}
/>
)}
{autotrackingEnabled && (
<FilterSwitch
label={t("cameraSettings.autotracking")}