# PDF JSON Editor - Threading, Concurrency, and Performance Analysis **Date:** 2025-01-09 **Version:** 1.0 **Status:** Comprehensive analysis combining automated review and manual verification --- ## Executive Summary This analysis identifies **CRITICAL** security vulnerabilities, thread safety issues, and performance problems in the PDF JSON editor codebase. The service contains: - **2 CRITICAL issues** requiring immediate attention - **2 HIGH severity resource leaks** causing memory exhaustion - **Multiple performance bottlenecks** limiting scalability **Immediate Action Required:** 1. Fix pageFontResources PDFont key issue (Issue #1) 2. Replace unbounded thread spawning (Issue #2) 3. Add cache size limits (Issue #3) 4. Fix Type3 cache race condition (Issue #4) --- ## CRITICAL ISSUES ### Issue #1: pageFontResources Keyed by PDFont Instances - CRITICAL **Location:** `PdfJsonConversionService.java:5075-5081, 5158-5159` **Severity:** CRITICAL (Broken Functionality) **Type:** Object Identity Mismatch, Cache Miss **Verified:** ✅ TRUE **Description:** ```java // Line 5075-5081: Initial metadata extraction (first PDF load) try (PDDocument document = pdfDocumentFactory.load(pdfBytes, true)) { Map> pageFontResources = new HashMap<>(); for (PDPage page : document.getPages()) { Map resourceMap = collectFontsForPage(...); // resourceMap keys are PDFont instances from this document pageFontResources.put(pageNumber, resourceMap); } // Cache it documentCache.put(jobId, new CachedPdfDocument(..., pageFontResources, ...)); } // PDDocument closed, PDFont instances now reference freed document // Line 5158-5159: Lazy page extraction (reloads PDF) try (PDDocument document = pdfDocumentFactory.load(cached.getPdfBytes(), true)) { // NEW PDFont instances created! PDPage page = document.getPage(pageIndex); // Try to lookup fonts Map cachedResourceMap = cached.getPageFontResources().get(pageNum); // cachedResourceMap keys are OLD PDFont instances from closed document // Lookup using NEW PDFont instances String fontId = cachedResourceMap.get(newFont); // Always NULL! ← BUG } ``` **Why It Fails:** 1. **Object Identity:** `Map` uses PDFont object identity as key 2. **Different Instances:** Each PDF load creates new PDFont instances with different identities 3. **Lookup Fails:** `cachedResourceMap.get(newFont)` returns null because `newFont != oldFont` 4. **Defeat Caching Goal:** Every lazy page request rebuilds font metadata, defeating the cache **Impact:** - Lazy page loading doesn't reuse cached font metadata - CPU wasted rebuilding font info on every page request - Cache only stores garbage (unusable keys) - "Consistent font UID" feature completely broken **Evidence:** ```java // No code actually uses the cached pageFontResources successfully // Every extractSinglePage call rebuilds fonts from scratch ``` **Recommendation:** ```java // Use resource names as keys instead of PDFont objects Map> pageFontResources = new HashMap<>(); // Key: font resource name (e.g., "F1"), Value: font UID // Or use font UID directly Map> pageFontUids = new HashMap<>(); ``` --- ### Issue #2: Unbounded Thread Creation - CRITICAL Resource Leak **Location:** `PdfJsonConversionService.java:5550-5562` **Severity:** CRITICAL (Resource Exhaustion) **Type:** Thread Leak, Memory Leak **Verified:** ✅ TRUE **Description:** ```java private void scheduleDocumentCleanup(String jobId) { new Thread( () -> { try { Thread.sleep(TimeUnit.MINUTES.toMillis(30)); // Sleep 30 minutes! clearCachedDocument(jobId); log.debug("Auto-cleaned cached document for jobId: {}", jobId); } catch (InterruptedException e) { Thread.currentThread().interrupt(); } }) .start(); // Unmanaged thread! } ``` **Also in:** `PdfLazyLoadingService.java:256-269` (duplicate implementation) **Problems:** 1. **One Thread Per Upload:** Each PDF upload spawns a new thread 2. **No Thread Pool:** Unlimited thread creation (no cap) 3. **Long Sleep:** Threads sleep for 30 minutes holding resources 4. **No Cancellation:** Cannot stop cleanup threads if job completes early 5. **Non-Daemon:** Threads prevent JVM shutdown (daemon=false by default) 6. **No Monitoring:** No visibility into active cleanup threads **Impact Under Load:** ``` 100 concurrent uploads → 100 cleanup threads spawned Each thread: ~1MB stack + overhead Total: ~100MB+ wasted on sleeping threads 1000 uploads/day → 1000+ threads accumulate OS thread limit (ulimit -u) exceeded → OutOfMemoryError: unable to create new native thread ``` **Production Failure Scenario:** ``` 09:00 - Peak traffic: 500 PDFs uploaded 09:00 - 500 cleanup threads spawned (each sleeps 30 min) 09:15 - 400 more uploads → 400 more threads 09:30 - First batch wakes up, cleans, exits But new threads keep getting created faster than old ones die 10:00 - Thread count > 2000 10:15 - JVM hits OS thread limit 10:16 - Server crashes: "OutOfMemoryError: unable to create new native thread" ``` **Recommendation:** ```java @Service public class PdfJsonConversionService { // Fixed-size thread pool for cleanup private final ScheduledExecutorService cleanupScheduler = Executors.newScheduledThreadPool( 2, // Only 2 threads needed new ThreadFactoryBuilder() .setNameFormat("pdf-cache-cleanup-%d") .setDaemon(true) .build() ); private void scheduleDocumentCleanup(String jobId) { cleanupScheduler.schedule( () -> clearCachedDocument(jobId), 30, TimeUnit.MINUTES ); } @PreDestroy public void shutdown() { cleanupScheduler.shutdown(); try { if (!cleanupScheduler.awaitTermination(10, TimeUnit.SECONDS)) { cleanupScheduler.shutdownNow(); } } catch (InterruptedException e) { cleanupScheduler.shutdownNow(); Thread.currentThread().interrupt(); } } } ``` --- ## HIGH SEVERITY ISSUES ### Issue #3: Unbounded Cache Growth - HIGH Memory Leak **Location:** `PdfJsonConversionService.java:147-148, 154` **Severity:** HIGH (Memory Exhaustion) **Type:** Memory Leak, Missing Eviction Policy **Verified:** ✅ TRUE **Description:** ```java // Line 147: Type3 normalized fonts - cleared only in convertJsonToPdf (line 454) private final Map type3NormalizedFontCache = new ConcurrentHashMap<>(); // Line 148: Type3 glyph coverage - NEVER CLEARED anywhere in codebase! private final Map> type3GlyphCoverageCache = new ConcurrentHashMap<>(); // Line 154: Document cache - relies on buggy cleanup threads private final Map documentCache = new ConcurrentHashMap<>(); ``` **Growth Patterns:** **1. type3NormalizedFontCache:** - Written at line 3766: `type3NormalizedFontCache.put(fontModel.getUid(), font)` - Cleared only at line 454: `type3NormalizedFontCache.clear()` (JSON→PDF conversion) - **NOT cleared during PDF→JSON** conversion (most common operation) - Each PDFont holds references to native resources (C++ objects via JNI) - Grows unbounded during PDF→JSON operations **2. type3GlyphCoverageCache:** - Written at line 1122: `type3GlyphCoverageCache.put(fontUid, coverageSet)` - **NEVER CLEARED** in entire codebase (verified via grep) - Accumulates Set for every Type3 font ever processed - Each Set can contain thousands of integers (Unicode codepoints) - Pure memory leak **3. documentCache:** - Stores full PDF bytes in memory - Each entry can be 1MB-100MB+ (document bytes + metadata) - Relies on cleanup threads (which have issues from Issue #5) - If cleanup fails (exception, server restart), entries stay forever - No max size check **Impact:** ``` Long-running server processes 10,000 Type3 fonts: - type3GlyphCoverageCache: 10,000 entries × ~1KB each = 10MB - type3NormalizedFontCache: 1,000 cached fonts × ~100KB each = 100MB - documentCache: 50 active jobs × 10MB each = 500MB After 1 week: Caches grow to 1GB+ After 1 month: OutOfMemoryError, server restart required ``` **Recommendation:** ```java // Use Caffeine cache with eviction policies private final Cache type3NormalizedFontCache = Caffeine.newBuilder() .maximumSize(1000) // Max 1000 fonts .expireAfterAccess(1, TimeUnit.HOURS) // Expire after 1hr unused .removalListener((key, value, cause) -> { // Cleanup PDFont resources if needed }) .build(); private final Cache> type3GlyphCoverageCache = Caffeine.newBuilder() .maximumSize(5000) .expireAfterWrite(1, TimeUnit.HOURS) .build(); private final Cache documentCache = Caffeine.newBuilder() .maximumWeight(500_000_000) // 500MB max .weigher((String key, CachedPdfDocument doc) -> doc.getPdfBytes().length) .expireAfterWrite(30, TimeUnit.MINUTES) .removalListener((key, value, cause) -> { log.info("Evicted document {} (cause: {})", key, cause); }) .build(); ``` --- ### Issue #4: Type3 Cache Race Condition - HIGH **Location:** `PdfJsonConversionService.java:3759-3773` **Severity:** HIGH (Duplicate Work) **Type:** Check-Then-Act Race Condition **Verified:** ✅ TRUE **Description:** ```java private void loadNormalizedType3Font( PDDocument document, PdfJsonFont fontModel, List candidates, String originalFormat) throws IOException { if (fontModel.getUid() == null || candidates == null || candidates.isEmpty()) { return; } if (type3NormalizedFontCache.containsKey(fontModel.getUid())) { // CHECK return; } for (FontByteSource source : candidates) { PDFont font = loadFontFromSource(...); // EXPENSIVE: 10-50ms if (font != null) { type3NormalizedFontCache.put(fontModel.getUid(), font); // ACT log.info("Cached normalized font {} for Type3 {}", ...); break; } } } ``` **Race Condition:** ``` Thread A: Check cache for "1:F1" → MISS (line 3759) Thread B: Check cache for "1:F1" → MISS (line 3759) [both pass check!] Thread A: Load font from bytes (10ms I/O + parsing) Thread B: Load font from bytes (10ms I/O + parsing) ← DUPLICATE WORK Thread A: Put font in cache (line 3766) Thread B: Put font in cache (line 3766) [overwrites A's entry] ``` **Why ConcurrentHashMap Doesn't Help:** - ConcurrentHashMap prevents **corruption** (map state stays consistent) - ConcurrentHashMap does NOT prevent **duplicate work** (both threads compute) - The check (`containsKey`) and act (`put`) are separate operations **Impact:** - Wasted CPU cycles loading same font twice - Temporary memory spike (two fonts in heap simultaneously) - Font loading is expensive: Base64 decode + PDFBox parsing + font validation - Under high concurrency, 10+ threads could all load the same font **Recommendation:** ```java private void loadNormalizedType3Font(...) throws IOException { if (fontModel.getUid() == null || candidates == null || candidates.isEmpty()) { return; } // Atomic compute-if-absent type3NormalizedFontCache.computeIfAbsent(fontModel.getUid(), uid -> { for (FontByteSource source : candidates) { try { PDFont font = loadFontFromSource(...); if (font != null) { log.info("Cached normalized font {} for Type3 {}", ...); return font; } } catch (IOException e) { log.warn("Failed to load font from {}: {}", source.originLabel(), e.getMessage()); } } return null; }); } ``` --- ### Issue #5: PDDocument Resource Lifecycle - NEEDS INVESTIGATION **Location:** `PdfJsonConversionService.java:3766, 5158` **Severity:** UNKNOWN (Requires Investigation) **Type:** Unclear Resource Ownership **Verified:** ⚠️ SPECULATIVE (No concrete evidence of failure) **Description:** ```java // Line 3766: Cache PDFont created from a PDDocument try (PDDocument document = ...) { PDFont font = loadFontFromSource(document, fontModel, source, ...); type3NormalizedFontCache.put(fontModel.getUid(), font); } // PDDocument is closed here! // Later: cached PDFont is used with a DIFFERENT PDDocument try (PDDocument newDocument = ...) { PDFont cachedFont = type3NormalizedFontCache.get(fontUid); // Is cachedFont safe to use after original document closed? // Does it hold references to freed native resources? } ``` **Theoretical Concerns:** 1. **Native Memory:** PDFBox uses JNI for some operations 2. **Resource Ties:** PDFont may hold references to the source PDDocument 3. **Freed Resources:** Using PDFont after document closes could access freed memory 4. **Unclear Contract:** PDFBox documentation doesn't explicitly address font lifecycle **Current Status:** - ⚠️ **NO EVIDENCE OF ACTUAL FAILURES** - System appears to work in practice - ⚠️ **NO CRASHES OBSERVED** - No segmentation faults or memory corruption reported - ⚠️ **NO MEMORY LEAKS DETECTED** - No profiler data showing leaks - ⚠️ **PURELY THEORETICAL CONCERN** - Based on API design, not observed behavior **Why This May Actually Be Safe:** - PDFBox may create self-contained PDFont objects - Font data may be copied rather than referenced - PDFBox may be designed for this use case - Current code has been running without apparent issues **Required Investigation:** 1. **PDFBox Source Code Review:** Check if PDFont copies or references document data 2. **Load Testing:** Create PDFont, close document, use font in new document 3. **Memory Profiling:** Monitor for native memory leaks over extended runs 4. **PDFBox Documentation/Forums:** Search for guidance on font lifecycle **Recommendation:** - **Priority: MEDIUM** (needs investigation but not blocking) - Add monitoring for potential issues - Test font reuse after document closure explicitly - If problems found, cache serialized bytes instead of PDFont objects ```java // Option 1: Cache font bytes instead of PDFont objects private void cacheType3FontBytes(String fontUid, byte[] fontBytes) { type3FontBytesCache.put(fontUid, fontBytes); } // Option 2: Verify font is safe to use private PDFont getCachedFont(String fontUid) { PDFont cached = type3NormalizedFontCache.get(fontUid); if (cached != null && !isFontValid(cached)) { log.warn("Cached font {} is invalid, removing", fontUid); type3NormalizedFontCache.remove(fontUid); return null; } return cached; } ``` --- ## MEDIUM SEVERITY ISSUES ### Issue #6: Full PDF Reload Per Page - MEDIUM Performance **Location:** `PdfJsonConversionService.java:5158-5159` **Severity:** MEDIUM (Performance) **Type:** Inefficient I/O **Verified:** ✅ TRUE **Description:** ```java // extractSinglePage method try (PDDocument document = pdfDocumentFactory.load(cached.getPdfBytes(), true)) { // Full PDF loaded from bytes (10-100ms for large PDFs) PDPage page = document.getPage(pageIndex); // Extract just one page... } ``` **Problem:** Every page request loads the entire PDF from bytes: - 100-page PDF = Load 10MB, extract 1 page - 10 page requests = 10× full PDF loads (100MB I/O) - No incremental parsing or streaming **Impact:** ``` 100MB PDF, 50 pages requested sequentially: - Total I/O: 100MB × 50 = 5GB - Time: 50× parse time (5-10 seconds total) - Memory: 100MB peak per request Concurrent page requests for same PDF: - 10 threads × 100MB = 1GB temporary memory spike ``` **Why This Exists:** Lazy loading design trades memory (don't cache full extraction) for CPU (reload on demand). But the tradeoff is poor because: - PDFBox parsing is expensive - Repeated decompression of streams - Could cache extracted page data instead **Recommendation:** ```java // Option 1: Cache extracted page data private static class CachedPage { List textElements; List imageElements; // ... other page data } Map> pageCache = ...; // Option 2: Keep PDF open with RandomAccessFile private static class CachedPdfDocument { private final RandomAccessReadBufferedFile randomAccess; private final PDDocument document; // Keep open! } // Option 3: Pre-split pages at upload time // Store each page as separate lightweight JSON blob ``` --- ### Issue #7: Large Base64 Operations - MEDIUM Performance **Location:** `PdfJsonConversionService.java:1062, 1428, 3570, 3584, 3612, 3630` **Severity:** MEDIUM (Performance Bottleneck) **Type:** Synchronous Blocking Operation **Verified:** ✅ TRUE **Description:** ```java // Encode large font programs String base64 = Base64.getEncoder().encodeToString(fontBytes); // 10MB → 13MB // Decode large font programs byte[] bytes = Base64.getDecoder().decode(pdfProgram); // 13MB → 10MB ``` **Problem:** - Large fonts (embedded TrueType, Type3) can be 5-10MB - Base64 encoding inflates size by ~33% - All encoding/decoding is synchronous on request threads - CPU-intensive operation (20-50ms for 10MB) **Impact:** ``` 100 concurrent requests processing 10MB fonts: - Each request: 30ms CPU time for Base64 - All threads blocked on encoding simultaneously - Thread pool saturation (if using fixed-size pool) - Other requests starved waiting for threads Large PDF with 50 fonts: - 50 × 30ms = 1.5 seconds just for Base64 operations - User perceives slowness ``` **Recommendation:** ```java // Option 1: Size limits private static final int MAX_FONT_SIZE = 10 * 1024 * 1024; // 10MB if (fontBytes.length > MAX_FONT_SIZE) { throw new IllegalArgumentException("Font too large: " + fontBytes.length); } // Option 2: Streaming Base64 (for very large files) OutputStream base64Out = Base64.getEncoder().wrap(outputStream); inputStream.transferTo(base64Out); // Option 3: Async processing CompletableFuture encodeFuture = CompletableFuture.supplyAsync( () -> Base64.getEncoder().encodeToString(fontBytes), fontEncodingExecutor ); ``` --- ### Issue #8: File I/O on Request Threads - MEDIUM **Location:** `PdfJsonConversionService.java:276, 405, 5066` **Severity:** MEDIUM (Performance) **Type:** Blocking I/O **Verified:** ❌ PARTIALLY TRUE **Description:** ```java // Line 276: Write upload to disk file.transferTo(originalFile.getFile()); // Line 405: Read full file into memory byte[] cachedPdfBytes = Files.readAllBytes(workingPath); // Line 5066: Get uploaded file bytes byte[] pdfBytes = file.getBytes(); ``` **Clarification:** - These are in DIFFERENT methods (not double-reads within one operation) - Each method reads the file once - Still synchronous blocking I/O **Impact:** - Large uploads (100MB) block request thread for seconds - No async or streaming support - Thread pool saturation under high upload volume **Recommendation:** ```java // Async file I/O CompletableFuture uploadFuture = CompletableFuture.supplyAsync( () -> { Path tempPath = Files.createTempFile("pdf-upload", ".pdf"); file.transferTo(tempPath.toFile()); return tempPath; }, fileIoExecutor ); // Stream large files try (InputStream in = file.getInputStream(); OutputStream out = Files.newOutputStream(targetPath)) { in.transferTo(out); } ``` --- ## LOW SEVERITY ISSUES ### Issue #9: PdfLazyLoadingService Unused - LOW **Location:** `PdfLazyLoadingService.java` (entire file) **Severity:** LOW (Code Quality) **Type:** Dead Code **Verified:** ✅ TRUE **Description:** - Complete service implementation exists - Has its own `documentCache` and cleanup logic - Duplicates functionality in `PdfJsonConversionService` - Not wired to any controller - Not imported by any other class **Impact:** - Code maintenance burden - Confusing for developers - Potential for accidental use in future - Cache divergence if both ever get used **Recommendation:** ```java // Delete PdfLazyLoadingService.java entirely // Or clearly mark as @Deprecated with explanation ``` --- ### Issue #10: PdfJsonFontService Volatile Fields - LOW **Location:** `PdfJsonFontService.java:46-47` **Severity:** LOW (Actually Correct) **Type:** None (Good Practice) **Verified:** ✅ TRUE (No issue, correctly implemented) **Description:** ```java private volatile boolean pythonCffConverterAvailable; private volatile boolean fontForgeCffConverterAvailable; @PostConstruct private void initialiseCffConverterAvailability() { pythonCffConverterAvailable = isCommandAvailable(pythonCommand); fontForgeCffConverterAvailable = isCommandAvailable(fontforgeCommand); } ``` **Why This Is Correct:** - `volatile` ensures visibility across threads - Set once at startup - Read many times (thread-safe) - No synchronization needed **Recommendation:** None - this is good practice. --- ## VERIFIED FALSE CLAIMS ### Claim: file.getBytes() Called Twice **Status:** ❌ FALSE **Explanation:** The claim stated that `file.getBytes()` is called twice (lines 446, 5065). Investigation shows: - Line 446: `convertJsonToPdf` method - Line 5065: `extractDocumentMetadata` method - These are DIFFERENT methods for DIFFERENT operations - Each method calls `getBytes()` only once **Conclusion:** Not a double-read issue. --- ### Claim: Image Base64 Encoding Per Call **Status:** ❌ FALSE **Explanation:** The claim stated images are Base64-encoded on every call. Investigation shows: ```java // PdfJsonImageService.java:430-450 private EncodedImage getOrEncodeImage(PDImage pdImage) { COSBase key = xObject.getCOSObject(); EncodedImage cached = imageCache.get(key); // Cache check! if (cached != null) { return cached; // Cache hit } EncodedImage encoded = encodeImage(pdImage); imageCache.put(key, encoded); // Cache miss, encode and store return encoded; } ``` **Conclusion:** Images ARE cached. Only stencil and inline images bypass cache. --- ## ARCHITECTURE ISSUES ### Issue #11: Singleton Service Architecture - MEDIUM **Location:** All `@Service` and `@Component` classes **Severity:** MEDIUM (Maintainability) **Type:** Architectural Pattern **Description:** All services use default singleton scope: ```java @Service // Defaults to singleton public class PdfJsonConversionService { // Shared instance variables across all requests private final Map type3NormalizedFontCache = ...; } ``` **Implications:** ✅ **Good:** - Most dependencies are stateless and injected - Caches use ConcurrentHashMap (thread-safe) - No mutable instance variables beyond caches ⚠️ **Risks:** - Singleton means shared state across all requests - Requires careful synchronization - Easy for future developers to introduce thread-unsafe code - Difficult to test concurrent scenarios **Recommendation:** - Document thread-safety requirements prominently - Add unit tests for concurrent access - Consider request-scoped services for mutable state - Code review checklist for new instance variables --- ## SUMMARY BY SEVERITY ### CRITICAL (Fix Immediately) 1. **pageFontResources PDFont keys** - Broken feature 2. **Unbounded thread creation** - Resource exhaustion ### HIGH (Fix Soon) 3. **Unbounded cache growth** - Memory leak 4. **Type3 cache race** - Duplicate work 5. **PDDocument lifecycle** - Needs investigation (speculative) ### MEDIUM (Plan and Address) 6. **Full PDF reload per page** - Performance 7. **Large Base64 operations** - Performance 8. **File I/O blocking** - Performance 9. **PdfLazyLoadingService unused** - Dead code 10. **Singleton architecture** - Maintainability ### LOW (Monitor) 11. **PdfJsonFontService volatile** - Correctly implemented (no action needed) ### VERIFIED FALSE 12. file.getBytes() called twice 13. Image Base64 encoding per call --- ## IMPLEMENTATION ROADMAP ### Phase 1: Critical Fixes (1-2 weeks) **1. Fix pageFontResources PDFont keys (Issue #1)** ```java // Priority: CRITICAL // Time: 3-5 days // Risk: Medium (requires careful testing) // Replace PDFont keys with String resource names // Update cache lookup logic ``` **2. Fix Thread Leaks (Issue #2)** ```java // Priority: CRITICAL // Time: 1 day // Risk: Low (well-understood solution) // Replace new Thread() with ScheduledExecutorService // Add @PreDestroy cleanup // Monitor thread counts ``` ### Phase 2: Resource Management (1 week) **3. Add Cache Eviction (Issue #3)** ```java // Priority: HIGH // Time: 3 days // Risk: Low (library-based solution) // Integrate Caffeine cache // Set size limits, TTL // Add eviction logging // Monitor cache metrics ``` **4. Fix Type3 Cache Race (Issue #4)** ```java // Priority: HIGH // Time: 1-2 days // Risk: Low (straightforward fix) // Use computeIfAbsent for atomic operations ``` ### Phase 3: Performance Optimization (2-3 weeks) **5. Optimize Lazy Loading (Issue #6)** ```java // Priority: MEDIUM // Time: 1 week // Risk: Medium (requires benchmarking) // Cache extracted page data // Or: Keep PDDocument open with RandomAccessFile // Or: Pre-split pages at upload ``` **6. Async I/O (Issues #7, #8)** ```java // Priority: MEDIUM // Time: 3-5 days // Risk: Medium (requires async architecture changes) // Add dedicated I/O thread pool // Async file operations // Stream large files ``` ### Phase 4: Code Quality (1 week) **7. Remove Dead Code (Issue #9)** ```java // Priority: LOW // Time: 1 day // Risk: None // Delete PdfLazyLoadingService // Clean up unused imports ``` **8. Documentation & Testing** ```java // Priority: MEDIUM // Time: 3-5 days // Add thread-safety documentation // Concurrent integration tests // Load testing scripts ``` --- ## TESTING STRATEGY ### 1. Concurrency Tests ```java @SpringBootTest @TestInstance(TestInstance.Lifecycle.PER_CLASS) class ConcurrencyTest { @Test void testConcurrentCacheAccess() throws Exception { ExecutorService executor = Executors.newFixedThreadPool(20); CountDownLatch latch = new CountDownLatch(100); List> futures = new ArrayList<>(); // 100 requests across 10 jobIds (10 requests per job) for (int i = 0; i < 100; i++) { String jobId = "job-" + (i % 10); int pageNum = (i % 5) + 1; futures.add(executor.submit(() -> { try { service.extractSinglePage(jobId, pageNum); } catch (Exception e) { log.error("Concurrent access failed", e); throw e; } finally { latch.countDown(); } })); } // Wait for completion assertTrue(latch.await(60, TimeUnit.SECONDS)); // Check for exceptions for (Future future : futures) { future.get(); // Throws if any task failed } } @Test void testCacheNotCorrupted() throws Exception { // Upload document String jobId = "test-job"; service.extractDocumentMetadata(testPdf, jobId); // Concurrent page requests ExecutorService executor = Executors.newFixedThreadPool(10); List> futures = new ArrayList<>(); for (int i = 0; i < 50; i++) { int page = (i % 10) + 1; futures.add(executor.submit(() -> service.extractSinglePage(jobId, page))); } // All should succeed without ConcurrentModificationException for (Future future : futures) { assertNotNull(future.get()); } } } ``` ### 2. Memory Leak Tests ```java @Test void testCacheDoesNotGrowUnbounded() { long initialHeap = getHeapUsage(); // Process 10,000 small PDFs with Type3 fonts for (int i = 0; i < 10000; i++) { service.convertPdfToJson(createTestPdfWithType3Fonts()); } // Force GC System.gc(); Thread.sleep(1000); long finalHeap = getHeapUsage(); long growth = finalHeap - initialHeap; // Cache should not grow beyond reasonable limit assertThat(growth).isLessThan(100_000_000); // 100MB max } @Test void testThreadsNotLeaking() { int initialThreads = getActiveThreadCount(); // Upload 100 PDFs (spawns 100 cleanup threads) for (int i = 0; i < 100; i++) { service.extractDocumentMetadata(testPdf, "job-" + i); } int peakThreads = getActiveThreadCount(); // Should not create 100+ threads assertThat(peakThreads - initialThreads).isLessThan(10); } private long getHeapUsage() { Runtime runtime = Runtime.getRuntime(); return runtime.totalMemory() - runtime.freeMemory(); } private int getActiveThreadCount() { return Thread.getAllStackTraces().size(); } ``` ### 3. Security Tests ```java @Test void testJobIdIsolation() { // User A uploads PDF String jobIdA = service.extractDocumentMetadata(userAPdf, sessionA); // User B tries to access User A's jobId assertThrows(AccessDeniedException.class, () -> { service.extractSinglePage(jobIdA, 1, sessionB); }); } @Test void testJobIdUnpredictable() { Set jobIds = new HashSet<>(); for (int i = 0; i < 1000; i++) { String jobId = service.extractDocumentMetadata(testPdf, session); jobIds.add(jobId); } // All jobIds should be unique UUIDs assertThat(jobIds).hasSize(1000); // Should not be sequential List sorted = new ArrayList<>(jobIds); Collections.sort(sorted); assertThat(sorted).isNotEqualTo(new ArrayList<>(jobIds)); } ``` ### 4. Performance Tests ```java @Test void testLargeFilePerformance() { // 100MB PDF byte[] largePdf = createLargePdf(100 * 1024 * 1024); long start = System.currentTimeMillis(); String json = service.convertPdfToJson(largePdf); long duration = System.currentTimeMillis() - start; // Should complete in reasonable time assertThat(duration).isLessThan(30_000); // 30 seconds } @Test void testConcurrentThroughput() throws Exception { ExecutorService executor = Executors.newFixedThreadPool(50); CountDownLatch latch = new CountDownLatch(500); long start = System.currentTimeMillis(); for (int i = 0; i < 500; i++) { executor.submit(() -> { try { service.convertPdfToJson(testPdf); } finally { latch.countDown(); } }); } latch.await(); long duration = System.currentTimeMillis() - start; // 500 conversions should complete in reasonable time double throughput = 500.0 / (duration / 1000.0); assertThat(throughput).isGreaterThan(10); // At least 10 conversions/sec } ``` --- ## MONITORING & METRICS ### Recommended Metrics (Micrometer) ```java @Service public class PdfJsonConversionService { private final MeterRegistry meterRegistry; // Cache size gauges @PostConstruct void registerMetrics() { Gauge.builder("pdf.cache.document.size", documentCache, Map::size) .description("Number of cached documents") .register(meterRegistry); Gauge.builder("pdf.cache.type3font.size", type3NormalizedFontCache, Map::size) .description("Number of cached Type3 fonts") .register(meterRegistry); Gauge.builder("pdf.cache.coverage.size", type3GlyphCoverageCache, Map::size) .description("Number of cached glyph coverage sets") .register(meterRegistry); Gauge.builder("pdf.threads.cleanup", this::getCleanupThreadCount) .description("Active cleanup threads") .register(meterRegistry); } // Operation timers public String convertPdfToJson(byte[] pdfBytes) { Timer.Sample sample = Timer.start(meterRegistry); try { String result = doConvertPdfToJson(pdfBytes); sample.stop(meterRegistry.timer("pdf.convert.toJson")); return result; } catch (Exception e) { meterRegistry.counter("pdf.convert.errors", "operation", "toJson").increment(); throw e; } } // Cache hit/miss counters private PDFont getCachedType3Font(String uid) { PDFont cached = type3NormalizedFontCache.get(uid); if (cached != null) { meterRegistry.counter("pdf.cache.type3font.hits").increment(); } else { meterRegistry.counter("pdf.cache.type3font.misses").increment(); } return cached; } } ``` ### Alerts ```yaml alerts: # Cache growth - name: DocumentCacheTooLarge condition: pdf_cache_document_size > 100 severity: warning - name: Type3CacheTooLarge condition: pdf_cache_type3font_size > 1000 severity: warning # Thread leaks - name: TooManyCleanupThreads condition: pdf_threads_cleanup > 10 severity: critical # Memory pressure - name: HeapUsageHigh condition: jvm_memory_used_bytes / jvm_memory_max_bytes > 0.8 severity: warning # Performance - name: SlowConversions condition: pdf_convert_toJson{quantile="0.95"} > 10s severity: warning # Error rate - name: HighErrorRate condition: rate(pdf_convert_errors[5m]) > 0.1 severity: critical ``` --- ## CONCLUSION The PDF JSON editor has **CRITICAL** issues that must be fixed before production deployment: ### Must-Fix Issues (Blocks Production): 1. **pageFontResources broken** - Lazy page loading completely broken (PDFont instances as keys) 2. **Thread leaks** - Unbounded thread creation causes OutOfMemoryError ### Should-Fix Issues (Prevents Scale): 3. **Unbounded cache growth** - Memory leaks require server restarts 4. **Type3 cache races** - Wasted CPU doing duplicate work 5. **PDDocument lifecycle** - Needs investigation (no evidence of actual problems yet) ### Performance Improvements (Nice-to-Have): 6. Full PDF reload per page 7. Large Base64 operations 8. Synchronous file I/O ### Code Quality Issues: 9. PdfLazyLoadingService dead code 10. Documentation of thread-safety requirements **Estimated Effort:** - Critical fixes: 1-2 weeks (Issues #1, #2) - High priority: 1 week (Issues #3, #4) - Performance: 2-3 weeks (Issues #6-8) - **Total:** 4-6 weeks for complete remediation **Recommendation:** Fix critical issues (#1, #2) immediately, then address high priority issues before beta testing.