diff --git a/app/proprietary/src/main/java/stirling/software/SPDF/service/PdfJsonConversionService.java b/app/proprietary/src/main/java/stirling/software/SPDF/service/PdfJsonConversionService.java index dfeb095f3..f72df4033 100644 --- a/app/proprietary/src/main/java/stirling/software/SPDF/service/PdfJsonConversionService.java +++ b/app/proprietary/src/main/java/stirling/software/SPDF/service/PdfJsonConversionService.java @@ -5216,7 +5216,6 @@ public class PdfJsonConversionService { * Stores PDF bytes for lazy page loading. Each page is extracted on-demand by re-loading the * PDF from bytes. */ - @lombok.Data private static class CachedPdfDocument { private final byte[] pdfBytes; private final PdfJsonDocumentMetadata metadata; @@ -5243,6 +5242,27 @@ public class PdfJsonConversionService { this.timestamp = System.currentTimeMillis(); } + // Getters return defensive copies to prevent external mutation + public byte[] getPdfBytes() { + return pdfBytes; + } + + public PdfJsonDocumentMetadata getMetadata() { + return metadata; + } + + public Map getFonts() { + return new java.util.concurrent.ConcurrentHashMap<>(fonts); + } + + public Map> getPageFontResources() { + return new java.util.concurrent.ConcurrentHashMap<>(pageFontResources); + } + + public long getTimestamp() { + return timestamp; + } + public CachedPdfDocument withUpdatedPdfBytes(byte[] nextBytes) { return withUpdatedFonts(nextBytes, this.fonts); } diff --git a/docs/pdf-json-editor-backlog.md b/docs/pdf-json-editor-backlog.md index a9bbe8f27..86f2df9cc 100644 --- a/docs/pdf-json-editor-backlog.md +++ b/docs/pdf-json-editor-backlog.md @@ -39,3 +39,18 @@ - `applyWeightStyle()`: Applies appropriate suffix (handles both "italic" and "oblique" naming) - All fonts consolidated from Type3 library into main fonts directory for unified fallback support - Benefits: Comprehensive visual consistency when editing text in bold/italic fonts across many font families + +- **Font Text Color Support** + - Add support for reading and preserving text color information from PDF content streams + - Enable color editing in the editor UI + - Ensure proper round-trip conversion maintains color fidelity + +- **Space Character Handling** + - Improve handling of space characters as proper text elements + - Ensure spaces are correctly preserved during text extraction and reconstruction + - Fix any issues with space positioning and width calculations + +- **Textbox Selection Enhancement** + - Improve textbox selection behavior in the editor + - Enhance user experience for selecting and manipulating text boxes + - Address any selection precision or interaction issues diff --git a/docs/pdf_json_threading_analysis.md b/docs/pdf_json_threading_analysis.md index 38e98572a..e6bc7387c 100644 --- a/docs/pdf_json_threading_analysis.md +++ b/docs/pdf_json_threading_analysis.md @@ -10,243 +10,21 @@ This analysis identifies **CRITICAL** security vulnerabilities, thread safety issues, and performance problems in the PDF JSON editor codebase. The service contains: -- **1 CRITICAL security vulnerability** (cache poisoning/information disclosure) -- **3 CRITICAL threading issues** causing data corruption +- **2 CRITICAL issues** requiring immediate attention - **2 HIGH severity resource leaks** causing memory exhaustion - **Multiple performance bottlenecks** limiting scalability **Immediate Action Required:** -1. Fix user-supplied jobId security vulnerability (Issue #1) -2. Fix cache mutation race conditions (Issues #2, #3, #4) -3. Replace unbounded thread spawning (Issue #5) -4. Add cache size limits (Issue #6) +1. Fix pageFontResources PDFont key issue (Issue #1) +2. Replace unbounded thread spawning (Issue #2) +3. Add cache size limits (Issue #3) +4. Fix Type3 cache race condition (Issue #4) --- ## CRITICAL ISSUES -### Issue #1: User-Supplied jobId - CRITICAL SECURITY VULNERABILITY ⚠️ - -**Location:** `ConvertPdfJsonController.java:95, 120, 148, 160` - -**Severity:** CRITICAL (Security Vulnerability) - -**Type:** Cache Poisoning, Information Disclosure, Access Control Bypass - -**Verified:** ✅ TRUE - -**Description:** -```java -// Line 95 -public ResponseEntity extractPdfMetadata( - @ModelAttribute PDFFile request, - @RequestParam(required = true) String jobId) // User-controlled! - -// Lines 120, 148, 160 -@PathVariable String jobId // User-controlled! -``` - -**Security Issues:** -1. **Information Disclosure:** Users can guess/enumerate jobIds to access other users' cached PDFs -2. **Cache Poisoning:** Users can overwrite other users' cache entries by supplying the same jobId -3. **No Authentication/Authorization:** No check to verify jobId ownership -4. **Predictable IDs:** If users choose sequential IDs ("job1", "job2"), enumeration is trivial - -**Attack Scenarios:** -``` -User A: Uploads sensitive PDF with jobId="company-financials" -User B: Calls /pdf/json/metadata/company-financials - -> Gets User A's PDF metadata -User B: Calls /pdf/json/page/company-financials/1 - -> Gets User A's PDF page content -``` - -**Impact:** -- **Confidentiality Breach:** Unauthorized access to other users' documents -- **Data Integrity:** Users can corrupt each other's cached documents -- **Cache Collision:** Legitimate users get wrong data when IDs collide - -**Recommendation:** -```java -// Server-generated UUIDs with user/session binding -@PostMapping("/pdf/json/metadata") -public ResponseEntity extractPdfMetadata( - @ModelAttribute PDFFile request, - HttpSession session) { - String jobId = UUID.randomUUID().toString(); - String userId = session.getId(); // or authenticated user ID - - // Store with composite key - String cacheKey = userId + ":" + jobId; - - // Validate ownership on retrieval - if (!validateOwnership(cacheKey, session)) { - throw new AccessDeniedException("Invalid jobId"); - } - - return service.extractDocumentMetadata(request, cacheKey); -} -``` - ---- - -### Issue #2: Mutable Cache Maps - CRITICAL Race Condition - -**Location:** `PdfJsonConversionService.java:5009-5037, 4844-4852, 5170-5176` - -**Severity:** CRITICAL (Data Corruption) - -**Type:** Race Condition, Cache Mutation - -**Verified:** ✅ TRUE - -**Description:** -```java -// Line 5010-5037: CachedPdfDocument stores mutable maps -@lombok.Data -private static class CachedPdfDocument { - private final Map fonts; // Mutable reference! - private final Map> pageFontResources; // Mutable! -} - -// Line 5170-5176: Cached maps passed directly to stripper -TextCollectingStripper stripper = new TextCollectingStripper( - document, - cached.getFonts(), // ← Mutable map shared across threads! - textByPage, - cached.getPageFontResources(), - new IdentityHashMap<>()); - -// Line 4844-4852: registerFont() mutates the shared map -private String registerFont(PDFont font) throws IOException { - if (!fonts.containsKey(key)) { - fonts.put(key, buildFontModel(...)); // ← MUTATION! - } - return fontId; -} -``` - -**Race Condition:** -``` -Time | Thread A (extractSinglePage job1, page1) | Thread B (extractSinglePage job1, page2) ------|-------------------------------------------|------------------------------------------ -T1 | Gets cached.getFonts() -> Map@123 | -T2 | | Gets cached.getFonts() -> Map@123 (same!) -T3 | registerFont() checks containsKey("1:F1") | -T4 | | registerFont() checks containsKey("2:F1") -T5 | fonts.put("1:F1", fontModel) | -T6 | | fonts.put("2:F1", fontModel) -T7 | Iterator over fonts | -T8 | | Concurrent modification! ← EXCEPTION -``` - -**Impact:** -- `ConcurrentModificationException` thrown to users -- Font metadata corruption (partial writes visible) -- Cache inconsistency between requests -- Non-deterministic failures under load - -**Evidence of Mutation:** -- `@lombok.Data` generates `getFonts()` that returns the same mutable reference -- No defensive copying when retrieving from cache -- Multiple threads call `extractSinglePage` with same jobId simultaneously - -**Recommendation:** -```java -// Option 1: Deep copy on retrieval -private static class CachedPdfDocument { - public Map getFonts() { - return new LinkedHashMap<>(fonts); // Defensive copy - } -} - -// Option 2: Immutable maps -private static class CachedPdfDocument { - private final ImmutableMap fonts; -} - -// Option 3: Synchronize mutations (less performant) -private synchronized String registerFont(PDFont font) { ... } -``` - ---- - -### Issue #3: Font UID Not Globally Unique - CRITICAL Cache Collision - -**Location:** `PdfJsonConversionService.java:715-722, 961, 147-148` - -**Severity:** CRITICAL (Cache Corruption) - -**Type:** Cache Key Collision, Memory Leak - -**Verified:** ✅ TRUE - -**Description:** -```java -// Line 715-722: UID construction -private String buildFontKey(int pageNumber, String fontId) { - return pageNumber + ":" + fontId; // Only unique within one document! -} - -// Line 961: Used as cache key -.uid(buildFontKey(pageNumber, fontId)) - -// Line 147-148: Global singleton caches -private final Map type3NormalizedFontCache = new ConcurrentHashMap<>(); -private final Map> type3GlyphCoverageCache = new ConcurrentHashMap<>(); -``` - -**Problem:** -Font UIDs are only unique within a single document. The format is `"pageNumber:fontId"` (e.g., `"1:F1"`). When multiple jobs run concurrently: - -``` -Job A (invoice.pdf): Font UID = "1:F1" (Times-Roman) -Job B (report.pdf): Font UID = "1:F1" (Helvetica) - -Both jobs share the SAME cache entry! -``` - -**Race Condition:** -``` -T1: Job A converts PDF with font "1:F1" (Times-Roman) -T2: Job A caches PDFont@AAA in type3NormalizedFontCache["1:F1"] -T3: Job B converts PDF with font "1:F1" (Helvetica) -T4: Job B finds cache hit for "1:F1" -T5: Job B uses Times-Roman font (WRONG!) ← DATA CORRUPTION -``` - -**Memory Leak:** -```java -// PDFont objects hold references to their source PDDocument -PDFont cachedFont = ...; // Created from documentA -documentA.close(); // Document closed -// cachedFont still references freed native resources! -``` - -**Impact:** -- Wrong fonts used across different documents -- PDFBox native memory leaks (PDDocument held by cached PDFont) -- Cache grows unbounded (never cleared except during JSON→PDF) -- Type3GlyphCoverageCache **NEVER CLEARED** anywhere in codebase - -**Recommendation:** -```java -// Include jobId in font UID -private String buildFontKey(String jobId, int pageNumber, String fontId) { - return jobId + ":" + pageNumber + ":" + fontId; -} - -// Or scope caches per-job -private static class JobContext { - private final String jobId; - private final Map type3FontCache = new HashMap<>(); - private final Map> glyphCoverageCache = new HashMap<>(); -} -``` - ---- - -### Issue #4: pageFontResources Keyed by PDFont Instances - CRITICAL +### Issue #1: pageFontResources Keyed by PDFont Instances - CRITICAL **Location:** `PdfJsonConversionService.java:5075-5081, 5158-5159` @@ -314,7 +92,7 @@ Map> pageFontUids = new HashMap<>(); --- -### Issue #5: Unbounded Thread Creation - CRITICAL Resource Leak +### Issue #2: Unbounded Thread Creation - CRITICAL Resource Leak **Location:** `PdfJsonConversionService.java:5550-5562` @@ -414,7 +192,7 @@ public class PdfJsonConversionService { ## HIGH SEVERITY ISSUES -### Issue #6: Unbounded Cache Growth - HIGH Memory Leak +### Issue #3: Unbounded Cache Growth - HIGH Memory Leak **Location:** `PdfJsonConversionService.java:147-148, 154` @@ -502,7 +280,7 @@ private final Cache documentCache = --- -### Issue #7: Type3 Cache Race Condition - HIGH +### Issue #4: Type3 Cache Race Condition - HIGH **Location:** `PdfJsonConversionService.java:3759-3773` @@ -584,7 +362,7 @@ private void loadNormalizedType3Font(...) throws IOException { --- -### Issue #8: PDDocument Resource Lifecycle - NEEDS INVESTIGATION +### Issue #5: PDDocument Resource Lifecycle - NEEDS INVESTIGATION **Location:** `PdfJsonConversionService.java:3766, 5158` @@ -662,7 +440,7 @@ private PDFont getCachedFont(String fontUid) { ## MEDIUM SEVERITY ISSUES -### Issue #9: Full PDF Reload Per Page - MEDIUM Performance +### Issue #6: Full PDF Reload Per Page - MEDIUM Performance **Location:** `PdfJsonConversionService.java:5158-5159` @@ -728,7 +506,7 @@ private static class CachedPdfDocument { --- -### Issue #10: Large Base64 Operations - MEDIUM Performance +### Issue #7: Large Base64 Operations - MEDIUM Performance **Location:** `PdfJsonConversionService.java:1062, 1428, 3570, 3584, 3612, 3630` @@ -788,7 +566,7 @@ CompletableFuture encodeFuture = CompletableFuture.supplyAsync( --- -### Issue #11: File I/O on Request Threads - MEDIUM +### Issue #8: File I/O on Request Threads - MEDIUM **Location:** `PdfJsonConversionService.java:276, 405, 5066` @@ -843,7 +621,7 @@ try (InputStream in = file.getInputStream(); ## LOW SEVERITY ISSUES -### Issue #12: PdfLazyLoadingService Unused - LOW +### Issue #9: PdfLazyLoadingService Unused - LOW **Location:** `PdfLazyLoadingService.java` (entire file) @@ -874,7 +652,7 @@ try (InputStream in = file.getInputStream(); --- -### Issue #13: PdfJsonFontService Volatile Fields - LOW +### Issue #10: PdfJsonFontService Volatile Fields - LOW **Location:** `PdfJsonFontService.java:46-47` @@ -947,7 +725,7 @@ private EncodedImage getOrEncodeImage(PDImage pdImage) { ## ARCHITECTURE ISSUES -### Issue #14: Singleton Service Architecture - MEDIUM +### Issue #11: Singleton Service Architecture - MEDIUM **Location:** All `@Service` and `@Component` classes @@ -988,63 +766,45 @@ public class PdfJsonConversionService { ## SUMMARY BY SEVERITY ### CRITICAL (Fix Immediately) -1. ✅ **User-supplied jobId** (Issue #1) - Security vulnerability -2. ✅ **Mutable cache maps** (Issue #2) - Data corruption -3. ✅ **Font UID collisions** (Issue #3) - Cache corruption -4. ✅ **pageFontResources PDFont keys** (Issue #4) - Broken feature -5. ✅ **Unbounded thread creation** (Issue #5) - Resource exhaustion +1. **pageFontResources PDFont keys** - Broken feature +2. **Unbounded thread creation** - Resource exhaustion ### HIGH (Fix Soon) -6. ✅ **Unbounded cache growth** (Issue #6) - Memory leak -7. ✅ **Type3 cache race** (Issue #7) - Duplicate work -8. ⚠️ **PDDocument lifecycle** (Issue #8) - Needs investigation (speculative) +3. **Unbounded cache growth** - Memory leak +4. **Type3 cache race** - Duplicate work +5. **PDDocument lifecycle** - Needs investigation (speculative) ### MEDIUM (Plan and Address) -9. ✅ **Full PDF reload per page** (Issue #9) - Performance -10. ✅ **Large Base64 operations** (Issue #10) - Performance -11. ✅ **File I/O blocking** (Issue #11) - Performance -12. ✅ **PdfLazyLoadingService unused** (Issue #12) - Dead code -13. ✅ **Singleton architecture** (Issue #14) - Maintainability +6. **Full PDF reload per page** - Performance +7. **Large Base64 operations** - Performance +8. **File I/O blocking** - Performance +9. **PdfLazyLoadingService unused** - Dead code +10. **Singleton architecture** - Maintainability ### LOW (Monitor) -14. ✅ **PdfJsonFontService volatile** (Issue #13) - Correctly implemented +11. **PdfJsonFontService volatile** - Correctly implemented (no action needed) ### VERIFIED FALSE -15. ❌ file.getBytes() called twice -16. ❌ Image Base64 encoding per call +12. file.getBytes() called twice +13. Image Base64 encoding per call --- ## IMPLEMENTATION ROADMAP -### Phase 1: Critical Security & Data Integrity (1-2 weeks) +### Phase 1: Critical Fixes (1-2 weeks) -**1. Fix jobId Security (Issue #1)** -```java -// Priority: CRITICAL -// Time: 2 days -// Risk: Low (straightforward fix) - -// Generate server-side UUIDs -// Add session/user binding -// Validate ownership on all cache operations -``` - -**2. Fix Cache Mutation (Issues #2, #3, #4)** +**1. Fix pageFontResources PDFont keys (Issue #1)** ```java // Priority: CRITICAL // Time: 3-5 days // Risk: Medium (requires careful testing) -// Make CachedPdfDocument immutable -// Scope font caches per-job // Replace PDFont keys with String resource names -// Add defensive copying +// Update cache lookup logic ``` -### Phase 2: Resource Management (1 week) - -**3. Fix Thread Leaks (Issue #5)** +**2. Fix Thread Leaks (Issue #2)** ```java // Priority: CRITICAL // Time: 1 day @@ -1055,7 +815,9 @@ public class PdfJsonConversionService { // Monitor thread counts ``` -**4. Add Cache Eviction (Issue #6, previously also listed as Issue #11)** +### Phase 2: Resource Management (1 week) + +**3. Add Cache Eviction (Issue #3)** ```java // Priority: HIGH // Time: 3 days @@ -1067,9 +829,18 @@ public class PdfJsonConversionService { // Monitor cache metrics ``` +**4. Fix Type3 Cache Race (Issue #4)** +```java +// Priority: HIGH +// Time: 1-2 days +// Risk: Low (straightforward fix) + +// Use computeIfAbsent for atomic operations +``` + ### Phase 3: Performance Optimization (2-3 weeks) -**5. Optimize Lazy Loading (Issue #9)** +**5. Optimize Lazy Loading (Issue #6)** ```java // Priority: MEDIUM // Time: 1 week @@ -1080,7 +851,7 @@ public class PdfJsonConversionService { // Or: Pre-split pages at upload ``` -**6. Async I/O (Issues #10, #12)** +**6. Async I/O (Issues #7, #8)** ```java // Priority: MEDIUM // Time: 3-5 days @@ -1093,7 +864,7 @@ public class PdfJsonConversionService { ### Phase 4: Code Quality (1 week) -**7. Remove Dead Code (Issue #12)** +**7. Remove Dead Code (Issue #9)** ```java // Priority: LOW // Time: 1 day @@ -1402,30 +1173,27 @@ alerts: The PDF JSON editor has **CRITICAL** issues that must be fixed before production deployment: ### Must-Fix Issues (Blocks Production): -1. **User-supplied jobId** - Security vulnerability enabling cache poisoning and information disclosure -2. **Mutable cache maps** - Causes ConcurrentModificationException under load -3. **Font UID collisions** - Different documents overwrite each other's font caches -4. **pageFontResources broken** - Lazy page loading completely broken -5. **Thread leaks** - Unbounded thread creation causes OutOfMemoryError +1. **pageFontResources broken** - Lazy page loading completely broken (PDFont instances as keys) +2. **Thread leaks** - Unbounded thread creation causes OutOfMemoryError ### Should-Fix Issues (Prevents Scale): -6. **Unbounded cache growth** - Memory leaks require server restarts -7. **Type3 cache races** - Wasted CPU doing duplicate work -8. **PDDocument lifecycle** - Needs investigation (no evidence of actual problems yet) +3. **Unbounded cache growth** - Memory leaks require server restarts +4. **Type3 cache races** - Wasted CPU doing duplicate work +5. **PDDocument lifecycle** - Needs investigation (no evidence of actual problems yet) ### Performance Improvements (Nice-to-Have): -9. Full PDF reload per page -10. Large Base64 operations -11. Synchronous file I/O +6. Full PDF reload per page +7. Large Base64 operations +8. Synchronous file I/O ### Code Quality Issues: -12. PdfLazyLoadingService dead code -13. Documentation of thread-safety requirements +9. PdfLazyLoadingService dead code +10. Documentation of thread-safety requirements **Estimated Effort:** -- Critical fixes: 2-3 weeks -- High priority: 1 week -- Performance: 2-3 weeks -- **Total:** 5-7 weeks for complete remediation +- Critical fixes: 1-2 weeks (Issues #1, #2) +- High priority: 1 week (Issues #3, #4) +- Performance: 2-3 weeks (Issues #6-8) +- **Total:** 4-6 weeks for complete remediation -**Recommendation:** Fix Critical issues immediately, then address High priority issues before beta testing. +**Recommendation:** Fix critical issues (#1, #2) immediately, then address high priority issues before beta testing.