Files
Stirling-PDF/scripts/translations/translation_merger.py
Ludy 472ee54098 fix(translations): improve translation merger CLI and sync missing UI strings across locales (#5309)
# Description of Changes

This pull request updates the Arabic translation file
(`frontend/public/locales/ar-AR/translation.toml`) with a large number
of new and improved strings, adding support for new features and
enhancing clarity and coverage across the application. Additionally, it
makes several improvements to the TOML language check script
(`.github/scripts/check_language_toml.py`) and updates the corresponding
GitHub Actions workflow to better track and validate translation
changes.

**Translation updates and enhancements:**

* Added translations for new features and UI elements, including
annotation tools, PDF/A-3b conversion, line art compression, background
removal, split modes, onboarding tours, and more.
[[1]](diffhunk://#diff-460d5f61a7649a5b149373af2e52a8a87d9a1964cf54240a78ad4747e7233effR343-R346)
[[2]](diffhunk://#diff-460d5f61a7649a5b149373af2e52a8a87d9a1964cf54240a78ad4747e7233effR442-R460)
[[3]](diffhunk://#diff-460d5f61a7649a5b149373af2e52a8a87d9a1964cf54240a78ad4747e7233effR514-R523)
[[4]](diffhunk://#diff-460d5f61a7649a5b149373af2e52a8a87d9a1964cf54240a78ad4747e7233effR739-R743)
[[5]](diffhunk://#diff-460d5f61a7649a5b149373af2e52a8a87d9a1964cf54240a78ad4747e7233effR1281-R1295)
[[6]](diffhunk://#diff-460d5f61a7649a5b149373af2e52a8a87d9a1964cf54240a78ad4747e7233effR1412-R1416)
[[7]](diffhunk://#diff-460d5f61a7649a5b149373af2e52a8a87d9a1964cf54240a78ad4747e7233effR2362-R2365)
[[8]](diffhunk://#diff-460d5f61a7649a5b149373af2e52a8a87d9a1964cf54240a78ad4747e7233effR2411-R2415)
[[9]](diffhunk://#diff-460d5f61a7649a5b149373af2e52a8a87d9a1964cf54240a78ad4747e7233effR2990)
[[10]](diffhunk://#diff-460d5f61a7649a5b149373af2e52a8a87d9a1964cf54240a78ad4747e7233effR3408-R3420)
[[11]](diffhunk://#diff-460d5f61a7649a5b149373af2e52a8a87d9a1964cf54240a78ad4747e7233effR3782-R3794)
[[12]](diffhunk://#diff-460d5f61a7649a5b149373af2e52a8a87d9a1964cf54240a78ad4747e7233effR3812-R3815)
[[13]](diffhunk://#diff-460d5f61a7649a5b149373af2e52a8a87d9a1964cf54240a78ad4747e7233effR3828-R3832)
[[14]](diffhunk://#diff-460d5f61a7649a5b149373af2e52a8a87d9a1964cf54240a78ad4747e7233effL3974-R4157)
[[15]](diffhunk://#diff-460d5f61a7649a5b149373af2e52a8a87d9a1964cf54240a78ad4747e7233effR4208-R4221)
[[16]](diffhunk://#diff-460d5f61a7649a5b149373af2e52a8a87d9a1964cf54240a78ad4747e7233effR5247)
[[17]](diffhunk://#diff-460d5f61a7649a5b149373af2e52a8a87d9a1964cf54240a78ad4747e7233effR5414-R5423)
[[18]](diffhunk://#diff-460d5f61a7649a5b149373af2e52a8a87d9a1964cf54240a78ad4747e7233effR5444-R5447)
* Improved and expanded coverage for settings, security, onboarding, and
help menus, including detailed descriptions and tooltips for new and
existing features.
[[1]](diffhunk://#diff-460d5f61a7649a5b149373af2e52a8a87d9a1964cf54240a78ad4747e7233effR442-R460)
[[2]](diffhunk://#diff-460d5f61a7649a5b149373af2e52a8a87d9a1964cf54240a78ad4747e7233effR5247)
[[3]](diffhunk://#diff-460d5f61a7649a5b149373af2e52a8a87d9a1964cf54240a78ad4747e7233effR5414-R5423)
[[4]](diffhunk://#diff-460d5f61a7649a5b149373af2e52a8a87d9a1964cf54240a78ad4747e7233effR5444-R5447)

**TOML language check script improvements:**

* Increased the maximum allowed TOML file size from 500 KB to 570 KB to
accommodate larger translation files.
* Improved file validation logic to more accurately skip or process
files based on directory structure and file type, and added informative
print statements for skipped files.
* Enhanced reporting in the difference check: now, instead of raising
exceptions for unsafe files or oversized files, the script logs warnings
and continues processing, improving robustness and clarity in CI
reports.
* Adjusted the placement of file check report lines for clarity in the
generated report.

**Workflow and CI improvements:**

* Updated the GitHub Actions workflow
(`.github/workflows/check_toml.yml`) to trigger on changes to the
translation script and workflow files, in addition to translation TOMLs,
ensuring all relevant changes are validated.

These changes collectively improve the translation quality and coverage
for Arabic users, enhance the reliability and clarity of the translation
validation process, and ensure smoother CI/CD workflows for localization
updates.

<img width="654" height="133" alt="image"
src="https://github.com/user-attachments/assets/9f3e505d-927f-4dc0-9098-cee70bbe85ca"
/>


---

## Checklist

### General

- [ ] I have read the [Contribution
Guidelines](https://github.com/Stirling-Tools/Stirling-PDF/blob/main/CONTRIBUTING.md)
- [ ] I have read the [Stirling-PDF Developer
Guide](https://github.com/Stirling-Tools/Stirling-PDF/blob/main/devGuide/DeveloperGuide.md)
(if applicable)
- [ ] I have read the [How to add new languages to
Stirling-PDF](https://github.com/Stirling-Tools/Stirling-PDF/blob/main/devGuide/HowToAddNewLanguage.md)
(if applicable)
- [ ] I have performed a self-review of my own code
- [ ] My changes generate no new warnings

### Documentation

- [ ] I have updated relevant docs on [Stirling-PDF's doc
repo](https://github.com/Stirling-Tools/Stirling-Tools.github.io/blob/main/docs/)
(if functionality has heavily changed)
- [ ] I have read the section [Add New Translation
Tags](https://github.com/Stirling-Tools/Stirling-PDF/blob/main/devGuide/HowToAddNewLanguage.md#add-new-translation-tags)
(for new translation tags only)

### Translations (if applicable)

- [ ] I ran
[`scripts/counter_translation.py`](https://github.com/Stirling-Tools/Stirling-PDF/blob/main/docs/counter_translation.md)

### UI Changes (if applicable)

- [ ] Screenshots or videos demonstrating the UI changes are attached
(e.g., as comments or direct attachments in the PR)

### Testing (if applicable)

- [ ] I have tested my changes locally. Refer to the [Testing
Guide](https://github.com/Stirling-Tools/Stirling-PDF/blob/main/devGuide/DeveloperGuide.md#6-testing)
for more details.
2026-01-14 00:31:05 +00:00

542 lines
20 KiB
Python

#!/usr/bin/env python3
"""
Translation Merger for Stirling PDF Frontend
Merges missing translations from en-GB into target language files.
Useful for AI-assisted translation workflows.
TOML format only.
"""
import os
import argparse
import json
import shutil
import sys
from datetime import datetime
from pathlib import Path
from typing import Any
import tomllib
import tomli_w
class TranslationMerger:
def __init__(
self,
locales_dir: str = os.path.join(os.getcwd(), "frontend", "public", "locales"),
ignore_file: str = os.path.join(
os.getcwd(), "scripts", "ignore_translation.toml"
),
):
self.locales_dir = Path(locales_dir)
self.golden_truth_file = self.locales_dir / "en-GB" / "translation.toml"
self.golden_truth = self._load_translation_file(self.golden_truth_file)
self.ignore_file = Path(ignore_file)
self.ignore_patterns = self._load_ignore_patterns()
def _load_translation_file(self, file_path: Path) -> dict[str, Any]:
"""Load TOML translation file."""
try:
with open(file_path, "rb") as f:
return tomllib.load(f)
except FileNotFoundError:
print(f"Error: File not found: {file_path}")
sys.exit(1)
except Exception as e:
print(f"Error: Invalid file {file_path}: {e}")
sys.exit(1)
def _save_translation_file(
self, data: dict[str, Any], file_path: Path, backup: bool = False
) -> None:
"""Save TOML translation file with backup option."""
if backup and file_path.exists():
backup_path = file_path.with_suffix(
f".backup.{datetime.now().strftime('%Y%m%d_%H%M%S')}.toml"
)
shutil.copy2(file_path, backup_path)
print(f"Backup created: {backup_path}")
with open(file_path, "wb") as f:
tomli_w.dump(data, f)
def _load_ignore_patterns(self) -> dict[str, set[str]]:
"""Load ignore patterns from TOML file."""
if not self.ignore_file.exists():
return {}
try:
with open(self.ignore_file, "rb") as f:
ignore_data = tomllib.load(f)
# Convert to sets for faster lookup
return {
lang: set(data.get("ignore", [])) for lang, data in ignore_data.items()
}
except Exception as e:
print(f"Warning: Could not load ignore file {self.ignore_file}: {e}")
return {}
def _get_nested_value(self, data: dict[str, Any], key_path: str) -> Any:
"""Get value from nested dict using dot notation."""
keys = key_path.split(".")
current = data
for key in keys:
if isinstance(current, dict) and key in current:
current = current[key]
else:
return None
return current
def _set_nested_value(
self, data: dict[str, Any], key_path: str, value: Any
) -> None:
"""Set value in nested dict using dot notation."""
keys = key_path.split(".")
current = data
for key in keys[:-1]:
if key not in current:
current[key] = {}
elif not isinstance(current[key], dict):
# If the current value is not a dict, we can't nest into it
# This handles cases where a key exists as a string but we need to make it a dict
print(
f"Warning: Converting non-dict value at '{key}' to dict to allow nesting"
)
current[key] = {}
current = current[key]
current[keys[-1]] = value
def _flatten_dict(
self, d: dict[str, Any], parent_key: str = "", separator: str = "."
) -> dict[str, Any]:
"""Flatten nested dictionary into dot-notation keys."""
items = []
for k, v in d.items():
new_key = f"{parent_key}{separator}{k}" if parent_key else k
if isinstance(v, dict):
items.extend(self._flatten_dict(v, new_key, separator).items())
else:
items.append((new_key, v))
return dict(items)
def _delete_nested_key(self, data: dict[str, Any], key_path: str) -> bool:
"""Delete a nested key using dot notation and clean up empty branches."""
def _delete(current: dict[str, Any], keys: list[str]) -> bool:
key = keys[0]
if key not in current:
return False
if len(keys) == 1:
del current[key]
return True
if not isinstance(current[key], dict):
return False
removed = _delete(current[key], keys[1:])
if removed and current[key] == {}:
del current[key]
return removed
return _delete(data, key_path.split("."))
def get_missing_keys(self, target_file: Path) -> list[str]:
"""Get list of missing keys in target file."""
lang_code = target_file.parent.name.replace("-", "_")
ignore_set = self.ignore_patterns.get(lang_code, set())
if not target_file.exists():
golden_keys = set(self._flatten_dict(self.golden_truth).keys())
return sorted(golden_keys - ignore_set)
target_data = self._load_translation_file(target_file)
golden_flat = self._flatten_dict(self.golden_truth)
target_flat = self._flatten_dict(target_data)
missing = set(golden_flat.keys()) - set(target_flat.keys())
return sorted(missing - ignore_set)
def get_unused_keys(self, target_file: Path) -> list[str]:
"""Get list of keys that are not present in the golden truth file."""
if not target_file.exists():
return []
target_data = self._load_translation_file(target_file)
target_flat = self._flatten_dict(target_data)
golden_flat = self._flatten_dict(self.golden_truth)
return sorted(set(target_flat.keys()) - set(golden_flat.keys()))
def add_missing_translations(
self,
target_file: Path,
keys_to_add: list[str] | None = None,
save: bool = True,
backup: bool = False,
) -> dict[str, Any]:
"""Add missing translations from en-GB to target file and optionally save."""
if not target_file.parent.exists():
target_file.parent.mkdir(parents=True, exist_ok=True)
target_data = {}
elif target_file.exists():
target_data = self._load_translation_file(target_file)
else:
target_data = {}
golden_flat = self._flatten_dict(self.golden_truth)
missing_keys = keys_to_add or self.get_missing_keys(target_file)
added_count = 0
for key in missing_keys:
if key in golden_flat:
value = golden_flat[key]
# Add the English value directly without [UNTRANSLATED] marker
self._set_nested_value(target_data, key, value)
added_count += 1
if added_count > 0 and save:
self._save_translation_file(target_data, target_file, backup)
return {
"added_count": added_count,
"missing_keys": missing_keys,
"data": target_data,
}
def extract_untranslated_entries(
self, target_file: Path, output_file: Path | None = None
) -> dict[str, Any]:
"""Extract entries marked as untranslated or identical to en-GB for AI translation."""
if not target_file.exists():
print(f"Error: Target file does not exist: {target_file}")
return {}
target_data = self._load_translation_file(target_file)
golden_flat = self._flatten_dict(self.golden_truth)
target_flat = self._flatten_dict(target_data)
untranslated_entries = {}
for key, value in target_flat.items():
if key in golden_flat:
golden_value = golden_flat[key]
# Check if marked as untranslated
if isinstance(value, str) and value.startswith("[UNTRANSLATED]"):
untranslated_entries[key] = {
"original": golden_value,
"current": value,
"reason": "marked_untranslated",
}
# Check if identical to golden (and should be translated)
elif value == golden_value and not self._is_expected_identical(
key, value
):
untranslated_entries[key] = {
"original": golden_value,
"current": value,
"reason": "identical_to_english",
}
if output_file:
with open(output_file, "w", encoding="utf-8") as f:
json.dump(untranslated_entries, f, indent=2, ensure_ascii=False)
return untranslated_entries
def _is_expected_identical(self, key: str, value: str) -> bool:
"""Check if a key-value pair is expected to be identical across languages."""
identical_patterns = ["language.direction"]
if str(value).strip() in ["ltr", "rtl", "True", "False", "true", "false"]:
return True
for pattern in identical_patterns:
if pattern in key.lower():
return True
return False
def apply_translations(
self,
target_file: Path,
translations: dict[str, str],
backup: bool = False,
) -> dict[str, Any]:
"""Apply provided translations to target file."""
if not target_file.exists():
print(f"Error: Target file does not exist: {target_file}")
return {"success": False, "error": "File not found"}
target_data = self._load_translation_file(target_file)
applied_count = 0
errors = []
for key, translation in translations.items():
try:
# Remove [UNTRANSLATED] marker if present
if isinstance(translation, str) and translation.startswith(
"[UNTRANSLATED]"
):
translation = translation.replace("[UNTRANSLATED]", "").strip()
self._set_nested_value(target_data, key, translation)
applied_count += 1
except Exception as e:
errors.append(f"Error setting {key}: {e}")
if applied_count > 0:
self._save_translation_file(target_data, target_file, backup)
return {
"success": applied_count > 0,
"applied_count": applied_count,
"errors": errors,
"data": target_data,
}
def remove_unused_translations(
self,
target_file: Path,
keys_to_remove: list[str] | None = None,
save: bool = True,
backup: bool = False,
) -> dict[str, Any]:
"""Remove translations that are not present in the golden truth file."""
if not target_file.exists():
print(f"Error: Target file does not exist: {target_file}")
return {"success": False, "error": "File not found"}
target_data = self._load_translation_file(target_file)
keys_to_remove = keys_to_remove or self.get_unused_keys(target_file)
removed_count = 0
for key in keys_to_remove:
if self._delete_nested_key(target_data, key):
removed_count += 1
if removed_count > 0 and save:
self._save_translation_file(target_data, target_file, backup)
return {
"success": removed_count > 0,
"removed_count": removed_count,
"data": target_data,
}
def create_translation_template(self, target_file: Path, output_file: Path) -> None:
"""Create a template file for AI translation with context."""
untranslated = self.extract_untranslated_entries(target_file)
template = {
"metadata": {
"source_language": "en-GB",
"target_language": target_file.parent.name,
"total_entries": len(untranslated),
"created_at": datetime.now().isoformat(),
"instructions": 'Translate the "original" values to the target language. Keep the same keys.',
},
"translations": {},
}
for key, entry in untranslated.items():
template["translations"][key] = {
"original": entry["original"],
"translated": "", # AI should fill this
"context": self._get_context_for_key(key),
"reason": entry["reason"],
}
with open(output_file, "w", encoding="utf-8") as f:
json.dump(template, f, indent=2, ensure_ascii=False)
print(f"Translation template created: {output_file}")
print(f"Contains {len(untranslated)} entries to translate")
def _get_context_for_key(self, key: str) -> str:
"""Get context information for a translation key."""
parts = key.split(".")
if len(parts) >= 2:
return f"Section: {parts[0]}, Property: {parts[-1]}"
return f"Property: {parts[-1]}"
def main():
parser = argparse.ArgumentParser(
description="Merge and manage translation files",
epilog="Works with TOML translation files.",
)
parser.add_argument(
"--locales-dir",
default=os.path.join(os.getcwd(), "frontend", "public", "locales"),
help="Path to locales directory",
)
parser.add_argument(
"--ignore-file",
default=os.path.join(os.getcwd(), "scripts", "ignore_translation.toml"),
help="Path to ignore patterns TOML file",
)
parser.add_argument(
"language",
nargs="?",
help="Target language code (e.g., fr-FR). If omitted, add-missing and remove-unused run for all locales except en-GB.",
)
subparsers = parser.add_subparsers(dest="command", help="Available commands")
# Add missing command
add_parser = subparsers.add_parser(
"add-missing", help="Add missing translations from en-GB"
)
add_parser.add_argument(
"--backup", action="store_true", help="Create backup before modifying files"
)
# Extract untranslated command
extract_parser = subparsers.add_parser(
"extract-untranslated", help="Extract untranslated entries"
)
extract_parser.add_argument("--output", help="Output file path")
# Create template command
template_parser = subparsers.add_parser(
"create-template", help="Create AI translation template"
)
template_parser.add_argument(
"--output", required=True, help="Output template file path"
)
# Apply translations command
apply_parser = subparsers.add_parser(
"apply-translations", help="Apply translations from JSON file"
)
apply_parser.add_argument(
"--translations-file", required=True, help="JSON file with translations"
)
apply_parser.add_argument(
"--backup", action="store_true", help="Create backup before modifying files"
)
# Remove unused translations command
remove_parser = subparsers.add_parser(
"remove-unused", help="Remove unused translations not present in en-GB"
)
remove_parser.add_argument(
"--backup", action="store_true", help="Create backup before modifying files"
)
args = parser.parse_args()
if not args.command:
parser.print_help()
return
merger = TranslationMerger(args.locales_dir, args.ignore_file)
if args.command == "add-missing":
if args.language:
# Find translation file
lang_dir = Path(args.locales_dir) / args.language
target_file = lang_dir / "translation.toml"
print(f"Processing {args.language}...")
result = merger.add_missing_translations(target_file, backup=args.backup)
print(f"Added {result['added_count']} missing translations")
else:
total_added = 0
for lang_dir in sorted(Path(args.locales_dir).iterdir()):
if not lang_dir.is_dir() or lang_dir.name == "en-GB":
continue
target_file = lang_dir / "translation.toml"
print(f"Processing {lang_dir.name}...")
result = merger.add_missing_translations(
target_file, backup=args.backup
)
added = result["added_count"]
total_added += added
print(f"Added {added} missing translations")
print(f"\nTotal added across all languages: {total_added}")
elif args.command == "remove-unused":
if args.language:
lang_dir = Path(args.locales_dir) / args.language
target_file = lang_dir / "translation.toml"
print(f"Processing {args.language}...")
result = merger.remove_unused_translations(target_file, backup=args.backup)
print(f"Removed {result['removed_count']} unused translations")
else:
total_removed = 0
for lang_dir in sorted(Path(args.locales_dir).iterdir()):
if not lang_dir.is_dir() or lang_dir.name == "en-GB":
continue
target_file = lang_dir / "translation.toml"
print(f"Processing {lang_dir.name}...")
result = merger.remove_unused_translations(
target_file, backup=args.backup
)
removed = result["removed_count"]
total_removed += removed
print(f"Removed {removed} unused translations")
print(f"\nTotal removed across all languages: {total_removed}")
elif args.command == "extract-untranslated":
if not args.language:
print("Error: language is required for extract-untranslated")
sys.exit(1)
lang_dir = Path(args.locales_dir) / args.language
target_file = lang_dir / "translation.toml"
output_file = (
Path(args.output)
if args.output
else target_file.with_suffix(".untranslated.json")
)
untranslated = merger.extract_untranslated_entries(target_file, output_file)
print(f"Extracted {len(untranslated)} untranslated entries to {output_file}")
elif args.command == "create-template":
if not args.language:
print("Error: language is required for create-template")
sys.exit(1)
lang_dir = Path(args.locales_dir) / args.language
target_file = lang_dir / "translation.toml"
merger.create_translation_template(target_file, Path(args.output))
elif args.command == "apply-translations":
if not args.language:
print("Error: language is required for apply-translations")
sys.exit(1)
lang_dir = Path(args.locales_dir) / args.language
target_file = lang_dir / "translation.toml"
with open(args.translations_file, "r", encoding="utf-8") as f:
translations_data = json.load(f)
# Extract translations from template format or simple dict
if "translations" in translations_data:
translations = {
k: v["translated"]
for k, v in translations_data["translations"].items()
if v.get("translated")
}
else:
translations = translations_data
result = merger.apply_translations(
target_file, translations, backup=args.backup
)
if result["success"]:
print(f"Applied {result['applied_count']} translations")
if result["errors"]:
print(f"Errors encountered: {len(result['errors'])}")
for error in result["errors"][:5]:
print(f" - {error}")
else:
print("No translations applied.")
if __name__ == "__main__":
main()