mirror of
https://github.com/Frooodle/Stirling-PDF.git
synced 2026-02-17 13:52:14 +01:00
translations (#4906)
# Description of Changes <!-- Please provide a summary of the changes, including: - What was changed - Why the change was made - Any challenges encountered Closes #(issue_number) --> --- ## Checklist ### General - [ ] I have read the [Contribution Guidelines](https://github.com/Stirling-Tools/Stirling-PDF/blob/main/CONTRIBUTING.md) - [ ] I have read the [Stirling-PDF Developer Guide](https://github.com/Stirling-Tools/Stirling-PDF/blob/main/devGuide/DeveloperGuide.md) (if applicable) - [ ] I have read the [How to add new languages to Stirling-PDF](https://github.com/Stirling-Tools/Stirling-PDF/blob/main/devGuide/HowToAddNewLanguage.md) (if applicable) - [ ] I have performed a self-review of my own code - [ ] My changes generate no new warnings ### Documentation - [ ] I have updated relevant docs on [Stirling-PDF's doc repo](https://github.com/Stirling-Tools/Stirling-Tools.github.io/blob/main/docs/) (if functionality has heavily changed) - [ ] I have read the section [Add New Translation Tags](https://github.com/Stirling-Tools/Stirling-PDF/blob/main/devGuide/HowToAddNewLanguage.md#add-new-translation-tags) (for new translation tags only) ### Translations (if applicable) - [ ] I ran [`scripts/counter_translation.py`](https://github.com/Stirling-Tools/Stirling-PDF/blob/main/docs/counter_translation.md) ### UI Changes (if applicable) - [ ] Screenshots or videos demonstrating the UI changes are attached (e.g., as comments or direct attachments in the PR) ### Testing (if applicable) - [ ] I have tested my changes locally. Refer to the [Testing Guide](https://github.com/Stirling-Tools/Stirling-PDF/blob/main/devGuide/DeveloperGuide.md#6-testing) for more details.
This commit is contained in:
@@ -2,6 +2,43 @@
|
||||
|
||||
This directory contains Python scripts for managing frontend translations in Stirling PDF. These tools help analyze, merge, validate, and manage translations against the en-GB golden truth file.
|
||||
|
||||
## Quick Start - Automated Translation (RECOMMENDED)
|
||||
|
||||
The **fastest and easiest way** to translate a language is using the automated pipeline:
|
||||
|
||||
```bash
|
||||
# Set your OpenAI API key
|
||||
export OPENAI_API_KEY=your_openai_api_key_here
|
||||
|
||||
# Translate a language automatically (extract → translate → merge → beautify → verify)
|
||||
python3 scripts/translations/auto_translate.py es-ES
|
||||
|
||||
# With custom batch size (default: 500 entries per batch)
|
||||
python3 scripts/translations/auto_translate.py es-ES --batch-size 600
|
||||
|
||||
# Keep temporary files for inspection
|
||||
python3 scripts/translations/auto_translate.py es-ES --no-cleanup
|
||||
```
|
||||
|
||||
**What it does:**
|
||||
1. Extracts untranslated entries from the language file
|
||||
2. Splits into batches (default 500 entries each)
|
||||
3. Translates each batch using GPT-5 with specialized prompts
|
||||
4. Validates placeholders are preserved
|
||||
5. Merges translated batches
|
||||
6. Applies translations to language file
|
||||
7. Beautifies structure to match en-GB
|
||||
8. Cleans up temporary files
|
||||
9. Reports final completion percentage
|
||||
|
||||
**Time:** ~8-10 minutes per language with 1200+ untranslated entries
|
||||
|
||||
**Cost:** ~$2-4 per language using GPT-5 (or use `gpt-5-mini` for lower cost)
|
||||
|
||||
See [`auto_translate.py`](#auto_translatepy-automated-translation-pipeline) for full details.
|
||||
|
||||
---
|
||||
|
||||
## Scripts Overview
|
||||
|
||||
### 0. Validation Scripts (Run First!)
|
||||
@@ -191,7 +228,97 @@ python scripts/translations/compact_translator.py it-IT --output to_translate.js
|
||||
- Batch size control for manageable chunks
|
||||
- 50-80% fewer characters than other extraction methods
|
||||
|
||||
### 5. `json_beautifier.py`
|
||||
### 5. `auto_translate.py` - Automated Translation Pipeline
|
||||
|
||||
**NEW: Fully automated translation workflow using GPT-5.**
|
||||
|
||||
Combines all translation steps into a single command that handles everything from extraction to verification.
|
||||
|
||||
**Usage:**
|
||||
```bash
|
||||
# Basic usage (requires OPENAI_API_KEY environment variable)
|
||||
export OPENAI_API_KEY=your_api_key
|
||||
python3 scripts/translations/auto_translate.py es-ES
|
||||
|
||||
# With inline API key
|
||||
python3 scripts/translations/auto_translate.py es-ES --api-key YOUR_KEY
|
||||
|
||||
# Custom batch size (default: 500 entries)
|
||||
python3 scripts/translations/auto_translate.py es-ES --batch-size 600
|
||||
|
||||
# Custom timeout per batch (default: 600 seconds / 10 minutes)
|
||||
python3 scripts/translations/auto_translate.py es-ES --timeout 900
|
||||
|
||||
# Keep temporary files for debugging
|
||||
python3 scripts/translations/auto_translate.py es-ES --no-cleanup
|
||||
|
||||
# Skip final verification
|
||||
python3 scripts/translations/auto_translate.py es-ES --skip-verification
|
||||
```
|
||||
|
||||
**Features:**
|
||||
- Fully automated end-to-end translation pipeline
|
||||
- Uses GPT-5 with specialized prompts for Stirling PDF
|
||||
- Preserves all placeholders ({n}, {{variable}}, etc.)
|
||||
- Maintains consistent terminology
|
||||
- Validates translations automatically
|
||||
- Creates backups before modifying files
|
||||
- Reports detailed progress and final completion %
|
||||
|
||||
**Pipeline Steps:**
|
||||
1. **Extract**: Finds all untranslated entries
|
||||
2. **Split**: Divides into manageable batches (default: 500 entries)
|
||||
3. **Translate**: Uses GPT-5 to translate each batch with specialized prompts
|
||||
4. **Validate**: Ensures placeholders are preserved
|
||||
5. **Merge**: Combines all translated batches
|
||||
6. **Apply**: Updates the language file
|
||||
7. **Beautify**: Restructures to match en-GB format
|
||||
8. **Cleanup**: Removes temporary files
|
||||
9. **Verify**: Reports final completion percentage
|
||||
|
||||
**Translation Quality:**
|
||||
- Preserves ALL placeholders exactly as-is
|
||||
- Keeps HTML tags intact (<strong>, <br>, etc.)
|
||||
- Doesn't translate technical terms (PDF, API, OAuth2, etc.)
|
||||
- Maintains consistent terminology throughout
|
||||
- Uses appropriate formal/informal tone per language
|
||||
|
||||
**Supported Languages:**
|
||||
All language codes from `frontend/public/locales/` (e.g., es-ES, de-DE, fr-FR, zh-CN, ar-AR, etc.)
|
||||
|
||||
### 6. `batch_translator.py` - GPT-5 Translation Engine
|
||||
|
||||
Low-level translation script used by `auto_translate.py`. Can be used standalone for manual batch translation.
|
||||
|
||||
**Usage:**
|
||||
```bash
|
||||
# Translate single batch file
|
||||
python3 scripts/translations/batch_translator.py my_batch.json --language es-ES --api-key YOUR_KEY
|
||||
|
||||
# Translate multiple batches
|
||||
python3 scripts/translations/batch_translator.py batch_*.json --language de-DE --api-key YOUR_KEY
|
||||
|
||||
# Use different GPT model
|
||||
python3 scripts/translations/batch_translator.py batch.json --language fr-FR --model gpt-5-mini
|
||||
|
||||
# Skip validation
|
||||
python3 scripts/translations/batch_translator.py batch.json --language it-IT --skip-validation
|
||||
```
|
||||
|
||||
**Features:**
|
||||
- Translates JSON batch files using OpenAI GPT-5
|
||||
- Specialized system prompts for Stirling PDF translations
|
||||
- Automatic placeholder validation
|
||||
- Supports pattern matching for multiple files
|
||||
- Configurable model selection (gpt-5, gpt-5-mini, gpt-5-nano)
|
||||
- Rate limiting with configurable delays
|
||||
|
||||
**Models:**
|
||||
- `gpt-5` (default): Best quality, $1.25/1M input, $10/1M output
|
||||
- `gpt-5-mini`: Balanced quality/cost
|
||||
- `gpt-5-nano`: Fastest, most economical
|
||||
|
||||
### 7. `json_beautifier.py`
|
||||
Restructures and beautifies translation JSON files to match en-GB structure exactly.
|
||||
|
||||
**Usage:**
|
||||
|
||||
324
scripts/translations/auto_translate.py
Normal file
324
scripts/translations/auto_translate.py
Normal file
@@ -0,0 +1,324 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Automated Translation Pipeline
|
||||
Extracts, translates, merges, and beautifies translations for a language.
|
||||
"""
|
||||
|
||||
import json
|
||||
import sys
|
||||
import argparse
|
||||
import os
|
||||
import subprocess
|
||||
from pathlib import Path
|
||||
import time
|
||||
|
||||
|
||||
def run_command(cmd, description=""):
|
||||
"""Run a shell command and return success status."""
|
||||
if description:
|
||||
print(f"\n{'='*60}")
|
||||
print(f"Step: {description}")
|
||||
print(f"{'='*60}")
|
||||
|
||||
result = subprocess.run(cmd, shell=True, capture_output=True, text=True)
|
||||
|
||||
if result.stdout:
|
||||
print(result.stdout)
|
||||
if result.stderr:
|
||||
print(result.stderr, file=sys.stderr)
|
||||
|
||||
return result.returncode == 0
|
||||
|
||||
|
||||
def extract_untranslated(language_code, batch_size=500):
|
||||
"""Extract untranslated entries and split into batches."""
|
||||
print(f"\n🔍 Extracting untranslated entries for {language_code}...")
|
||||
|
||||
# Load files
|
||||
golden_path = Path(f'frontend/public/locales/en-GB/translation.json')
|
||||
lang_path = Path(f'frontend/public/locales/{language_code}/translation.json')
|
||||
|
||||
if not golden_path.exists():
|
||||
print(f"Error: Golden truth file not found: {golden_path}")
|
||||
return None
|
||||
|
||||
if not lang_path.exists():
|
||||
print(f"Error: Language file not found: {lang_path}")
|
||||
return None
|
||||
|
||||
def load_json(path):
|
||||
with open(path, 'r', encoding='utf-8') as f:
|
||||
return json.load(f)
|
||||
|
||||
def flatten_dict(d, parent_key='', separator='.'):
|
||||
items = []
|
||||
for k, v in d.items():
|
||||
new_key = f"{parent_key}{separator}{k}" if parent_key else k
|
||||
if isinstance(v, dict):
|
||||
items.extend(flatten_dict(v, new_key, separator).items())
|
||||
else:
|
||||
items.append((new_key, str(v)))
|
||||
return dict(items)
|
||||
|
||||
golden = load_json(golden_path)
|
||||
lang_data = load_json(lang_path)
|
||||
|
||||
golden_flat = flatten_dict(golden)
|
||||
lang_flat = flatten_dict(lang_data)
|
||||
|
||||
# Find untranslated
|
||||
untranslated = {}
|
||||
for key, value in golden_flat.items():
|
||||
if (key not in lang_flat or
|
||||
lang_flat.get(key) == value or
|
||||
(isinstance(lang_flat.get(key), str) and lang_flat.get(key).startswith("[UNTRANSLATED]"))):
|
||||
untranslated[key] = value
|
||||
|
||||
total = len(untranslated)
|
||||
print(f"Found {total} untranslated entries")
|
||||
|
||||
if total == 0:
|
||||
print("✓ Language is already complete!")
|
||||
return []
|
||||
|
||||
# Split into batches
|
||||
entries = list(untranslated.items())
|
||||
num_batches = (total + batch_size - 1) // batch_size
|
||||
|
||||
batch_files = []
|
||||
lang_code_safe = language_code.replace('-', '_')
|
||||
|
||||
for i in range(num_batches):
|
||||
start = i * batch_size
|
||||
end = min((i + 1) * batch_size, total)
|
||||
batch = dict(entries[start:end])
|
||||
|
||||
filename = f'{lang_code_safe}_batch_{i+1}_of_{num_batches}.json'
|
||||
with open(filename, 'w', encoding='utf-8') as f:
|
||||
json.dump(batch, f, ensure_ascii=False, separators=(',', ':'))
|
||||
|
||||
batch_files.append(filename)
|
||||
print(f" Created {filename} with {len(batch)} entries")
|
||||
|
||||
return batch_files
|
||||
|
||||
|
||||
def translate_batches(batch_files, language_code, api_key, timeout=600):
|
||||
"""Translate all batch files using GPT-5."""
|
||||
if not batch_files:
|
||||
return []
|
||||
|
||||
print(f"\n🤖 Translating {len(batch_files)} batches using GPT-5...")
|
||||
print(f"Timeout: {timeout}s ({timeout//60} minutes) per batch")
|
||||
|
||||
translated_files = []
|
||||
|
||||
for i, batch_file in enumerate(batch_files, 1):
|
||||
print(f"\n[{i}/{len(batch_files)}] Translating {batch_file}...")
|
||||
|
||||
# Always pass API key since it's required
|
||||
cmd = f'python3 scripts/translations/batch_translator.py "{batch_file}" --language {language_code} --api-key "{api_key}"'
|
||||
|
||||
# Run with timeout
|
||||
result = subprocess.run(cmd, shell=True, capture_output=True, text=True, timeout=timeout)
|
||||
|
||||
if result.stdout:
|
||||
print(result.stdout)
|
||||
if result.stderr:
|
||||
print(result.stderr, file=sys.stderr)
|
||||
|
||||
if result.returncode != 0:
|
||||
print(f"✗ Failed to translate {batch_file}")
|
||||
return None
|
||||
|
||||
translated_file = batch_file.replace('.json', '_translated.json')
|
||||
translated_files.append(translated_file)
|
||||
|
||||
# Small delay between batches
|
||||
if i < len(batch_files):
|
||||
time.sleep(1)
|
||||
|
||||
print(f"\n✓ All {len(batch_files)} batches translated successfully")
|
||||
return translated_files
|
||||
|
||||
|
||||
def merge_translations(translated_files, language_code):
|
||||
"""Merge all translated batch files."""
|
||||
if not translated_files:
|
||||
return None
|
||||
|
||||
print(f"\n🔗 Merging {len(translated_files)} translated batches...")
|
||||
|
||||
merged = {}
|
||||
for filename in translated_files:
|
||||
if not Path(filename).exists():
|
||||
print(f"Error: Translated file not found: {filename}")
|
||||
return None
|
||||
|
||||
with open(filename, 'r', encoding='utf-8') as f:
|
||||
merged.update(json.load(f))
|
||||
|
||||
lang_code_safe = language_code.replace('-', '_')
|
||||
merged_file = f'{lang_code_safe}_merged.json'
|
||||
|
||||
with open(merged_file, 'w', encoding='utf-8') as f:
|
||||
json.dump(merged, f, ensure_ascii=False, separators=(',', ':'))
|
||||
|
||||
print(f"✓ Merged {len(merged)} translations into {merged_file}")
|
||||
return merged_file
|
||||
|
||||
|
||||
def apply_translations(merged_file, language_code):
|
||||
"""Apply merged translations to the language file."""
|
||||
print(f"\n📝 Applying translations to {language_code}...")
|
||||
|
||||
cmd = f'python3 scripts/translations/translation_merger.py {language_code} apply-translations --translations-file {merged_file}'
|
||||
|
||||
if not run_command(cmd):
|
||||
print(f"✗ Failed to apply translations")
|
||||
return False
|
||||
|
||||
print(f"✓ Translations applied successfully")
|
||||
return True
|
||||
|
||||
|
||||
def beautify_translations(language_code):
|
||||
"""Beautify translation file to match en-GB structure."""
|
||||
print(f"\n✨ Beautifying {language_code} translation file...")
|
||||
|
||||
cmd = f'python3 scripts/translations/json_beautifier.py --language {language_code}'
|
||||
|
||||
if not run_command(cmd):
|
||||
print(f"✗ Failed to beautify translations")
|
||||
return False
|
||||
|
||||
print(f"✓ Translation file beautified")
|
||||
return True
|
||||
|
||||
|
||||
def cleanup_temp_files(language_code):
|
||||
"""Remove temporary batch files."""
|
||||
print(f"\n🧹 Cleaning up temporary files...")
|
||||
|
||||
lang_code_safe = language_code.replace('-', '_')
|
||||
patterns = [
|
||||
f'{lang_code_safe}_batch_*.json',
|
||||
f'{lang_code_safe}_merged.json'
|
||||
]
|
||||
|
||||
import glob
|
||||
removed = 0
|
||||
for pattern in patterns:
|
||||
for file in glob.glob(pattern):
|
||||
Path(file).unlink()
|
||||
removed += 1
|
||||
|
||||
print(f"✓ Removed {removed} temporary files")
|
||||
|
||||
|
||||
def verify_completion(language_code):
|
||||
"""Check final completion percentage."""
|
||||
print(f"\n📊 Verifying completion...")
|
||||
|
||||
cmd = f'python3 scripts/translations/translation_analyzer.py --language {language_code} --summary'
|
||||
run_command(cmd)
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(
|
||||
description='Automated translation pipeline for Stirling PDF',
|
||||
formatter_class=argparse.RawDescriptionHelpFormatter,
|
||||
epilog="""
|
||||
Examples:
|
||||
# Translate Spanish with API key in environment
|
||||
export OPENAI_API_KEY=your_key_here
|
||||
python3 scripts/translations/auto_translate.py es-ES
|
||||
|
||||
# Translate German with inline API key
|
||||
python3 scripts/translations/auto_translate.py de-DE --api-key YOUR_KEY
|
||||
|
||||
# Translate Italian with custom batch size
|
||||
python3 scripts/translations/auto_translate.py it-IT --batch-size 600
|
||||
|
||||
# Skip cleanup (keep temporary files for inspection)
|
||||
python3 scripts/translations/auto_translate.py fr-FR --no-cleanup
|
||||
"""
|
||||
)
|
||||
|
||||
parser.add_argument('language', help='Language code (e.g., es-ES, de-DE, zh-CN)')
|
||||
parser.add_argument('--api-key', help='OpenAI API key (or set OPENAI_API_KEY env var)')
|
||||
parser.add_argument('--batch-size', type=int, default=500, help='Entries per batch (default: 500)')
|
||||
parser.add_argument('--no-cleanup', action='store_true', help='Keep temporary batch files')
|
||||
parser.add_argument('--skip-verification', action='store_true', help='Skip final completion check')
|
||||
parser.add_argument('--timeout', type=int, default=600, help='Timeout per batch in seconds (default: 600 = 10 minutes)')
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
# Verify API key
|
||||
api_key = args.api_key or os.environ.get('OPENAI_API_KEY')
|
||||
if not api_key:
|
||||
print("Error: OpenAI API key required. Provide via --api-key or OPENAI_API_KEY environment variable")
|
||||
sys.exit(1)
|
||||
|
||||
print("="*60)
|
||||
print(f"Automated Translation Pipeline")
|
||||
print(f"Language: {args.language}")
|
||||
print(f"Batch Size: {args.batch_size} entries")
|
||||
print("="*60)
|
||||
|
||||
start_time = time.time()
|
||||
|
||||
try:
|
||||
# Step 1: Extract and split
|
||||
batch_files = extract_untranslated(args.language, args.batch_size)
|
||||
if batch_files is None:
|
||||
sys.exit(1)
|
||||
|
||||
if len(batch_files) == 0:
|
||||
print("\n✓ Nothing to translate!")
|
||||
sys.exit(0)
|
||||
|
||||
# Step 2: Translate all batches
|
||||
translated_files = translate_batches(batch_files, args.language, api_key, args.timeout)
|
||||
if translated_files is None:
|
||||
sys.exit(1)
|
||||
|
||||
# Step 3: Merge translations
|
||||
merged_file = merge_translations(translated_files, args.language)
|
||||
if merged_file is None:
|
||||
sys.exit(1)
|
||||
|
||||
# Step 4: Apply translations
|
||||
if not apply_translations(merged_file, args.language):
|
||||
sys.exit(1)
|
||||
|
||||
# Step 5: Beautify
|
||||
if not beautify_translations(args.language):
|
||||
sys.exit(1)
|
||||
|
||||
# Step 6: Cleanup
|
||||
if not args.no_cleanup:
|
||||
cleanup_temp_files(args.language)
|
||||
|
||||
# Step 7: Verify
|
||||
if not args.skip_verification:
|
||||
verify_completion(args.language)
|
||||
|
||||
elapsed = time.time() - start_time
|
||||
print("\n" + "="*60)
|
||||
print(f"✅ Translation pipeline completed successfully!")
|
||||
print(f"Time elapsed: {elapsed:.1f} seconds")
|
||||
print("="*60)
|
||||
|
||||
except KeyboardInterrupt:
|
||||
print("\n\n⚠ Translation interrupted by user")
|
||||
sys.exit(1)
|
||||
except Exception as e:
|
||||
print(f"\n\n✗ Error: {e}")
|
||||
import traceback
|
||||
traceback.print_exc()
|
||||
sys.exit(1)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
321
scripts/translations/batch_translator.py
Normal file
321
scripts/translations/batch_translator.py
Normal file
@@ -0,0 +1,321 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Batch Translation Script using OpenAI API
|
||||
Automatically translates JSON batch files to target language while preserving:
|
||||
- Placeholders: {n}, {total}, {filename}, {{variable}}
|
||||
- HTML tags: <strong>, </strong>, etc.
|
||||
- Technical terms: PDF, API, OAuth2, SAML2, JWT, etc.
|
||||
"""
|
||||
|
||||
import json
|
||||
import sys
|
||||
import argparse
|
||||
from pathlib import Path
|
||||
import time
|
||||
|
||||
try:
|
||||
from openai import OpenAI
|
||||
except ImportError:
|
||||
print("Error: openai package not installed. Install with: pip install openai")
|
||||
sys.exit(1)
|
||||
|
||||
|
||||
class BatchTranslator:
|
||||
def __init__(self, api_key: str, model: str = "gpt-5"):
|
||||
"""Initialize translator with OpenAI API key."""
|
||||
self.client = OpenAI(api_key=api_key)
|
||||
self.model = model
|
||||
|
||||
def get_translation_prompt(self, language_name: str, language_code: str) -> str:
|
||||
"""Generate the system prompt for translation."""
|
||||
return f"""You are a professional translator for Stirling PDF, an open-source PDF manipulation tool.
|
||||
|
||||
Translate the following JSON from English to {language_name} ({language_code}) for the Stirling PDF user interface.
|
||||
|
||||
CRITICAL RULES - MUST FOLLOW EXACTLY:
|
||||
|
||||
1. PRESERVE ALL PLACEHOLDERS EXACTLY AS-IS:
|
||||
- Single braces: {{{{n}}}}, {{{{total}}}}, {{{{filename}}}}, {{{{count}}}}, {{{{date}}}}, {{{{planName}}}}, {{{{toolName}}}}, {{{{variable}}}}
|
||||
- Double braces: {{{{{{{{variable}}}}}}}}
|
||||
- Never translate, modify, or remove these - they are template variables
|
||||
|
||||
2. KEEP ALL HTML TAGS INTACT:
|
||||
- <strong>, </strong>, <br>, <code>, </code>, etc.
|
||||
- Do not translate tag names, only text between tags
|
||||
|
||||
3. DO NOT TRANSLATE TECHNICAL TERMS:
|
||||
- File formats: PDF, JSON, CSV, XML, HTML, ZIP, DOCX, XLSX, PNG, JPG
|
||||
- Protocols: API, OAuth2, SAML2, JWT, SMTP, HTTP, HTTPS, SSL, TLS
|
||||
- Technologies: Git, GitHub, Google, PostHog, Scarf, LibreOffice, Ghostscript, Tesseract, OCR
|
||||
- Technical keywords: URL, URI, DPI, RGB, CMYK, QR
|
||||
- "Stirling PDF" - always keep as-is
|
||||
|
||||
4. MAINTAIN CONSISTENT TERMINOLOGY:
|
||||
- Use the SAME translation for repeated terms throughout
|
||||
- Do not introduce new terminology or synonyms
|
||||
- Keep UI action words consistent (e.g., "upload", "download", "compress")
|
||||
|
||||
5. PRESERVE SPECIAL KEYWORDS IN CONTEXT:
|
||||
- Mathematical expressions: "2n", "2n-1", "3n" (in page selection)
|
||||
- Special keywords: "all", "odd", "even" (in page contexts)
|
||||
- Code examples and technical patterns
|
||||
|
||||
6. JSON STRUCTURE:
|
||||
- Translate ONLY the values (text after :), NEVER the keys
|
||||
- Return ONLY valid JSON with exact same structure
|
||||
- Maintain all quotes, commas, and braces
|
||||
|
||||
7. TONE & STYLE:
|
||||
- Use appropriate formal/informal tone for {language_name} UI
|
||||
- Keep translations concise and user-friendly
|
||||
- Maintain the professional but accessible tone of the original
|
||||
|
||||
8. DO NOT ADD OR REMOVE TEXT:
|
||||
- Do not add explanations, comments, or extra text
|
||||
- Do not remove any part of the original meaning
|
||||
- Keep the same level of detail
|
||||
|
||||
Return ONLY the translated JSON. No markdown, no explanations, just the JSON object."""
|
||||
|
||||
def translate_batch(self, batch_data: dict, target_language: str, language_code: str) -> dict:
|
||||
"""Translate a batch file using OpenAI API."""
|
||||
# Convert batch to compact JSON for API
|
||||
input_json = json.dumps(batch_data, ensure_ascii=False, separators=(',', ':'))
|
||||
|
||||
print(f"Translating {len(batch_data)} entries to {target_language}...")
|
||||
print(f"Input size: {len(input_json)} characters")
|
||||
|
||||
try:
|
||||
# GPT-5 only supports temperature=1, so we don't include it
|
||||
response = self.client.chat.completions.create(
|
||||
model=self.model,
|
||||
messages=[
|
||||
{
|
||||
"role": "system",
|
||||
"content": self.get_translation_prompt(target_language, language_code)
|
||||
},
|
||||
{
|
||||
"role": "user",
|
||||
"content": f"Translate this JSON:\n\n{input_json}"
|
||||
}
|
||||
],
|
||||
)
|
||||
|
||||
translated_text = response.choices[0].message.content.strip()
|
||||
|
||||
# Remove markdown code blocks if present
|
||||
if translated_text.startswith("```"):
|
||||
lines = translated_text.split('\n')
|
||||
translated_text = '\n'.join(lines[1:-1])
|
||||
|
||||
# Parse the translated JSON
|
||||
translated_data = json.loads(translated_text)
|
||||
|
||||
print(f"✓ Translation complete")
|
||||
return translated_data
|
||||
|
||||
except json.JSONDecodeError as e:
|
||||
print(f"Error: AI returned invalid JSON: {e}")
|
||||
print(f"Response: {translated_text[:500]}...")
|
||||
raise
|
||||
except Exception as e:
|
||||
print(f"Error during translation: {e}")
|
||||
raise
|
||||
|
||||
def validate_translation(self, original: dict, translated: dict) -> bool:
|
||||
"""Validate that translation preserved all placeholders and structure."""
|
||||
issues = []
|
||||
|
||||
# Check that all keys are present
|
||||
if set(original.keys()) != set(translated.keys()):
|
||||
missing = set(original.keys()) - set(translated.keys())
|
||||
extra = set(translated.keys()) - set(original.keys())
|
||||
if missing:
|
||||
issues.append(f"Missing keys: {missing}")
|
||||
if extra:
|
||||
issues.append(f"Extra keys: {extra}")
|
||||
|
||||
# Check placeholders in each value
|
||||
import re
|
||||
placeholder_pattern = r'\{[^}]+\}|\{\{[^}]+\}\}'
|
||||
|
||||
for key in original.keys():
|
||||
if key not in translated:
|
||||
continue
|
||||
|
||||
orig_value = str(original[key])
|
||||
trans_value = str(translated[key])
|
||||
|
||||
# Find all placeholders in original
|
||||
orig_placeholders = set(re.findall(placeholder_pattern, orig_value))
|
||||
trans_placeholders = set(re.findall(placeholder_pattern, trans_value))
|
||||
|
||||
if orig_placeholders != trans_placeholders:
|
||||
issues.append(f"Placeholder mismatch in '{key}': {orig_placeholders} vs {trans_placeholders}")
|
||||
|
||||
if issues:
|
||||
print("\n⚠ Validation warnings:")
|
||||
for issue in issues[:10]: # Show first 10 issues
|
||||
print(f" - {issue}")
|
||||
if len(issues) > 10:
|
||||
print(f" ... and {len(issues) - 10} more issues")
|
||||
return False
|
||||
|
||||
print("✓ Validation passed")
|
||||
return True
|
||||
|
||||
|
||||
def get_language_info(language_code: str) -> tuple:
|
||||
"""Get full language name from code."""
|
||||
languages = {
|
||||
'zh-CN': ('Simplified Chinese', 'zh-CN'),
|
||||
'es-ES': ('Spanish', 'es-ES'),
|
||||
'it-IT': ('Italian', 'it-IT'),
|
||||
'de-DE': ('German', 'de-DE'),
|
||||
'ar-AR': ('Arabic', 'ar-AR'),
|
||||
'pt-BR': ('Brazilian Portuguese', 'pt-BR'),
|
||||
'ru-RU': ('Russian', 'ru-RU'),
|
||||
'fr-FR': ('French', 'fr-FR'),
|
||||
'ja-JP': ('Japanese', 'ja-JP'),
|
||||
'ko-KR': ('Korean', 'ko-KR'),
|
||||
'nl-NL': ('Dutch', 'nl-NL'),
|
||||
'pl-PL': ('Polish', 'pl-PL'),
|
||||
'sv-SE': ('Swedish', 'sv-SE'),
|
||||
'da-DK': ('Danish', 'da-DK'),
|
||||
'no-NB': ('Norwegian', 'no-NB'),
|
||||
'fi-FI': ('Finnish', 'fi-FI'),
|
||||
'tr-TR': ('Turkish', 'tr-TR'),
|
||||
'vi-VN': ('Vietnamese', 'vi-VN'),
|
||||
'th-TH': ('Thai', 'th-TH'),
|
||||
'id-ID': ('Indonesian', 'id-ID'),
|
||||
'hi-IN': ('Hindi', 'hi-IN'),
|
||||
'cs-CZ': ('Czech', 'cs-CZ'),
|
||||
'hu-HU': ('Hungarian', 'hu-HU'),
|
||||
'ro-RO': ('Romanian', 'ro-RO'),
|
||||
'uk-UA': ('Ukrainian', 'uk-UA'),
|
||||
'el-GR': ('Greek', 'el-GR'),
|
||||
'bg-BG': ('Bulgarian', 'bg-BG'),
|
||||
'hr-HR': ('Croatian', 'hr-HR'),
|
||||
'sk-SK': ('Slovak', 'sk-SK'),
|
||||
'sl-SI': ('Slovenian', 'sl-SI'),
|
||||
'ca-CA': ('Catalan', 'ca-CA'),
|
||||
}
|
||||
|
||||
return languages.get(language_code, (language_code, language_code))
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(
|
||||
description='Translate JSON batch files using OpenAI API',
|
||||
formatter_class=argparse.RawDescriptionHelpFormatter,
|
||||
epilog="""
|
||||
Examples:
|
||||
# Translate single batch file
|
||||
python batch_translator.py zh_CN_batch_1_of_4.json --api-key YOUR_KEY --language zh-CN
|
||||
|
||||
# Translate all batches for a language (with pattern)
|
||||
python batch_translator.py "zh_CN_batch_*_of_*.json" --api-key YOUR_KEY --language zh-CN
|
||||
|
||||
# Use environment variable for API key
|
||||
export OPENAI_API_KEY=your_key_here
|
||||
python batch_translator.py zh_CN_batch_1_of_4.json --language zh-CN
|
||||
|
||||
# Use different model
|
||||
python batch_translator.py file.json --api-key KEY --language es-ES --model gpt-4-turbo
|
||||
"""
|
||||
)
|
||||
|
||||
parser.add_argument('input_files', nargs='+', help='Input batch JSON file(s) or pattern')
|
||||
parser.add_argument('--api-key', help='OpenAI API key (or set OPENAI_API_KEY env var)')
|
||||
parser.add_argument('--language', '-l', required=True, help='Target language code (e.g., zh-CN, es-ES)')
|
||||
parser.add_argument('--model', default='gpt-5', help='OpenAI model to use (default: gpt-5, options: gpt-5-mini, gpt-5-nano)')
|
||||
parser.add_argument('--output-suffix', default='_translated', help='Suffix for output files (default: _translated)')
|
||||
parser.add_argument('--skip-validation', action='store_true', help='Skip validation checks')
|
||||
parser.add_argument('--delay', type=float, default=1.0, help='Delay between API calls in seconds (default: 1.0)')
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
# Get API key from args or environment
|
||||
import os
|
||||
api_key = args.api_key or os.environ.get('OPENAI_API_KEY')
|
||||
if not api_key:
|
||||
print("Error: OpenAI API key required. Provide via --api-key or OPENAI_API_KEY environment variable")
|
||||
sys.exit(1)
|
||||
|
||||
# Get language info
|
||||
language_name, language_code = get_language_info(args.language)
|
||||
|
||||
# Expand file patterns
|
||||
import glob
|
||||
input_files = []
|
||||
for pattern in args.input_files:
|
||||
matched = glob.glob(pattern)
|
||||
if matched:
|
||||
input_files.extend(matched)
|
||||
else:
|
||||
input_files.append(pattern) # Use as literal filename
|
||||
|
||||
if not input_files:
|
||||
print("Error: No input files found")
|
||||
sys.exit(1)
|
||||
|
||||
print(f"Batch Translator")
|
||||
print(f"Target Language: {language_name} ({language_code})")
|
||||
print(f"Model: {args.model}")
|
||||
print(f"Files to translate: {len(input_files)}")
|
||||
print("=" * 60)
|
||||
|
||||
# Initialize translator
|
||||
translator = BatchTranslator(api_key, args.model)
|
||||
|
||||
# Process each file
|
||||
successful = 0
|
||||
failed = 0
|
||||
|
||||
for i, input_file in enumerate(input_files, 1):
|
||||
print(f"\n[{i}/{len(input_files)}] Processing: {input_file}")
|
||||
|
||||
try:
|
||||
# Load input file
|
||||
with open(input_file, 'r', encoding='utf-8') as f:
|
||||
batch_data = json.load(f)
|
||||
|
||||
# Translate
|
||||
translated_data = translator.translate_batch(batch_data, language_name, language_code)
|
||||
|
||||
# Validate
|
||||
if not args.skip_validation:
|
||||
translator.validate_translation(batch_data, translated_data)
|
||||
|
||||
# Save output
|
||||
input_path = Path(input_file)
|
||||
output_file = input_path.stem + args.output_suffix + input_path.suffix
|
||||
|
||||
with open(output_file, 'w', encoding='utf-8') as f:
|
||||
json.dump(translated_data, f, ensure_ascii=False, separators=(',', ':'))
|
||||
|
||||
print(f"✓ Saved to: {output_file}")
|
||||
successful += 1
|
||||
|
||||
# Delay between API calls to avoid rate limits
|
||||
if i < len(input_files):
|
||||
time.sleep(args.delay)
|
||||
|
||||
except Exception as e:
|
||||
print(f"✗ Failed: {e}")
|
||||
failed += 1
|
||||
continue
|
||||
|
||||
# Summary
|
||||
print("\n" + "=" * 60)
|
||||
print(f"Translation complete!")
|
||||
print(f"Successful: {successful}/{len(input_files)}")
|
||||
if failed > 0:
|
||||
print(f"Failed: {failed}/{len(input_files)}")
|
||||
|
||||
sys.exit(0 if failed == 0 else 1)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
import os
|
||||
main()
|
||||
Reference in New Issue
Block a user