Stirling-PDF/docs/pdf_json_threading_analysis.md
Anthony Stirling a1b69b4b8b docs update
2025-11-10 10:16:57 +00:00

1200 lines
34 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# PDF JSON Editor - Threading, Concurrency, and Performance Analysis
**Date:** 2025-01-09
**Version:** 1.0
**Status:** Comprehensive analysis combining automated review and manual verification
---
## Executive Summary
This analysis identifies **CRITICAL** security vulnerabilities, thread safety issues, and performance problems in the PDF JSON editor codebase. The service contains:
- **2 CRITICAL issues** requiring immediate attention
- **2 HIGH severity resource leaks** causing memory exhaustion
- **Multiple performance bottlenecks** limiting scalability
**Immediate Action Required:**
1. Fix pageFontResources PDFont key issue (Issue #1)
2. Replace unbounded thread spawning (Issue #2)
3. Add cache size limits (Issue #3)
4. Fix Type3 cache race condition (Issue #4)
---
## CRITICAL ISSUES
### Issue #1: pageFontResources Keyed by PDFont Instances - CRITICAL
**Location:** `PdfJsonConversionService.java:5075-5081, 5158-5159`
**Severity:** CRITICAL (Broken Functionality)
**Type:** Object Identity Mismatch, Cache Miss
**Verified:** ✅ TRUE
**Description:**
```java
// Line 5075-5081: Initial metadata extraction (first PDF load)
try (PDDocument document = pdfDocumentFactory.load(pdfBytes, true)) {
Map<Integer, Map<PDFont, String>> pageFontResources = new HashMap<>();
for (PDPage page : document.getPages()) {
Map<PDFont, String> resourceMap = collectFontsForPage(...);
// resourceMap keys are PDFont instances from this document
pageFontResources.put(pageNumber, resourceMap);
}
// Cache it
documentCache.put(jobId, new CachedPdfDocument(..., pageFontResources, ...));
} // PDDocument closed, PDFont instances now reference freed document
// Line 5158-5159: Lazy page extraction (reloads PDF)
try (PDDocument document = pdfDocumentFactory.load(cached.getPdfBytes(), true)) {
// NEW PDFont instances created!
PDPage page = document.getPage(pageIndex);
// Try to lookup fonts
Map<PDFont, String> cachedResourceMap = cached.getPageFontResources().get(pageNum);
// cachedResourceMap keys are OLD PDFont instances from closed document
// Lookup using NEW PDFont instances
String fontId = cachedResourceMap.get(newFont); // Always NULL! ← BUG
}
```
**Why It Fails:**
1. **Object Identity:** `Map<PDFont, String>` uses PDFont object identity as key
2. **Different Instances:** Each PDF load creates new PDFont instances with different identities
3. **Lookup Fails:** `cachedResourceMap.get(newFont)` returns null because `newFont != oldFont`
4. **Defeat Caching Goal:** Every lazy page request rebuilds font metadata, defeating the cache
**Impact:**
- Lazy page loading doesn't reuse cached font metadata
- CPU wasted rebuilding font info on every page request
- Cache only stores garbage (unusable keys)
- "Consistent font UID" feature completely broken
**Evidence:**
```java
// No code actually uses the cached pageFontResources successfully
// Every extractSinglePage call rebuilds fonts from scratch
```
**Recommendation:**
```java
// Use resource names as keys instead of PDFont objects
Map<Integer, Map<String, String>> pageFontResources = new HashMap<>();
// Key: font resource name (e.g., "F1"), Value: font UID
// Or use font UID directly
Map<Integer, Set<String>> pageFontUids = new HashMap<>();
```
---
### Issue #2: Unbounded Thread Creation - CRITICAL Resource Leak
**Location:** `PdfJsonConversionService.java:5550-5562`
**Severity:** CRITICAL (Resource Exhaustion)
**Type:** Thread Leak, Memory Leak
**Verified:** ✅ TRUE
**Description:**
```java
private void scheduleDocumentCleanup(String jobId) {
new Thread(
() -> {
try {
Thread.sleep(TimeUnit.MINUTES.toMillis(30)); // Sleep 30 minutes!
clearCachedDocument(jobId);
log.debug("Auto-cleaned cached document for jobId: {}", jobId);
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
}
})
.start(); // Unmanaged thread!
}
```
**Also in:** `PdfLazyLoadingService.java:256-269` (duplicate implementation)
**Problems:**
1. **One Thread Per Upload:** Each PDF upload spawns a new thread
2. **No Thread Pool:** Unlimited thread creation (no cap)
3. **Long Sleep:** Threads sleep for 30 minutes holding resources
4. **No Cancellation:** Cannot stop cleanup threads if job completes early
5. **Non-Daemon:** Threads prevent JVM shutdown (daemon=false by default)
6. **No Monitoring:** No visibility into active cleanup threads
**Impact Under Load:**
```
100 concurrent uploads → 100 cleanup threads spawned
Each thread: ~1MB stack + overhead
Total: ~100MB+ wasted on sleeping threads
1000 uploads/day → 1000+ threads accumulate
OS thread limit (ulimit -u) exceeded → OutOfMemoryError: unable to create new native thread
```
**Production Failure Scenario:**
```
09:00 - Peak traffic: 500 PDFs uploaded
09:00 - 500 cleanup threads spawned (each sleeps 30 min)
09:15 - 400 more uploads → 400 more threads
09:30 - First batch wakes up, cleans, exits
But new threads keep getting created faster than old ones die
10:00 - Thread count > 2000
10:15 - JVM hits OS thread limit
10:16 - Server crashes: "OutOfMemoryError: unable to create new native thread"
```
**Recommendation:**
```java
@Service
public class PdfJsonConversionService {
// Fixed-size thread pool for cleanup
private final ScheduledExecutorService cleanupScheduler =
Executors.newScheduledThreadPool(
2, // Only 2 threads needed
new ThreadFactoryBuilder()
.setNameFormat("pdf-cache-cleanup-%d")
.setDaemon(true)
.build()
);
private void scheduleDocumentCleanup(String jobId) {
cleanupScheduler.schedule(
() -> clearCachedDocument(jobId),
30,
TimeUnit.MINUTES
);
}
@PreDestroy
public void shutdown() {
cleanupScheduler.shutdown();
try {
if (!cleanupScheduler.awaitTermination(10, TimeUnit.SECONDS)) {
cleanupScheduler.shutdownNow();
}
} catch (InterruptedException e) {
cleanupScheduler.shutdownNow();
Thread.currentThread().interrupt();
}
}
}
```
---
## HIGH SEVERITY ISSUES
### Issue #3: Unbounded Cache Growth - HIGH Memory Leak
**Location:** `PdfJsonConversionService.java:147-148, 154`
**Severity:** HIGH (Memory Exhaustion)
**Type:** Memory Leak, Missing Eviction Policy
**Verified:** ✅ TRUE
**Description:**
```java
// Line 147: Type3 normalized fonts - cleared only in convertJsonToPdf (line 454)
private final Map<String, PDFont> type3NormalizedFontCache = new ConcurrentHashMap<>();
// Line 148: Type3 glyph coverage - NEVER CLEARED anywhere in codebase!
private final Map<String, Set<Integer>> type3GlyphCoverageCache = new ConcurrentHashMap<>();
// Line 154: Document cache - relies on buggy cleanup threads
private final Map<String, CachedPdfDocument> documentCache = new ConcurrentHashMap<>();
```
**Growth Patterns:**
**1. type3NormalizedFontCache:**
- Written at line 3766: `type3NormalizedFontCache.put(fontModel.getUid(), font)`
- Cleared only at line 454: `type3NormalizedFontCache.clear()` (JSON→PDF conversion)
- **NOT cleared during PDF→JSON** conversion (most common operation)
- Each PDFont holds references to native resources (C++ objects via JNI)
- Grows unbounded during PDF→JSON operations
**2. type3GlyphCoverageCache:**
- Written at line 1122: `type3GlyphCoverageCache.put(fontUid, coverageSet)`
- **NEVER CLEARED** in entire codebase (verified via grep)
- Accumulates Set<Integer> for every Type3 font ever processed
- Each Set can contain thousands of integers (Unicode codepoints)
- Pure memory leak
**3. documentCache:**
- Stores full PDF bytes in memory
- Each entry can be 1MB-100MB+ (document bytes + metadata)
- Relies on cleanup threads (which have issues from Issue #5)
- If cleanup fails (exception, server restart), entries stay forever
- No max size check
**Impact:**
```
Long-running server processes 10,000 Type3 fonts:
- type3GlyphCoverageCache: 10,000 entries × ~1KB each = 10MB
- type3NormalizedFontCache: 1,000 cached fonts × ~100KB each = 100MB
- documentCache: 50 active jobs × 10MB each = 500MB
After 1 week: Caches grow to 1GB+
After 1 month: OutOfMemoryError, server restart required
```
**Recommendation:**
```java
// Use Caffeine cache with eviction policies
private final Cache<String, PDFont> type3NormalizedFontCache =
Caffeine.newBuilder()
.maximumSize(1000) // Max 1000 fonts
.expireAfterAccess(1, TimeUnit.HOURS) // Expire after 1hr unused
.removalListener((key, value, cause) -> {
// Cleanup PDFont resources if needed
})
.build();
private final Cache<String, Set<Integer>> type3GlyphCoverageCache =
Caffeine.newBuilder()
.maximumSize(5000)
.expireAfterWrite(1, TimeUnit.HOURS)
.build();
private final Cache<String, CachedPdfDocument> documentCache =
Caffeine.newBuilder()
.maximumWeight(500_000_000) // 500MB max
.weigher((String key, CachedPdfDocument doc) ->
doc.getPdfBytes().length)
.expireAfterWrite(30, TimeUnit.MINUTES)
.removalListener((key, value, cause) -> {
log.info("Evicted document {} (cause: {})", key, cause);
})
.build();
```
---
### Issue #4: Type3 Cache Race Condition - HIGH
**Location:** `PdfJsonConversionService.java:3759-3773`
**Severity:** HIGH (Duplicate Work)
**Type:** Check-Then-Act Race Condition
**Verified:** ✅ TRUE
**Description:**
```java
private void loadNormalizedType3Font(
PDDocument document,
PdfJsonFont fontModel,
List<FontByteSource> candidates,
String originalFormat) throws IOException {
if (fontModel.getUid() == null || candidates == null || candidates.isEmpty()) {
return;
}
if (type3NormalizedFontCache.containsKey(fontModel.getUid())) { // CHECK
return;
}
for (FontByteSource source : candidates) {
PDFont font = loadFontFromSource(...); // EXPENSIVE: 10-50ms
if (font != null) {
type3NormalizedFontCache.put(fontModel.getUid(), font); // ACT
log.info("Cached normalized font {} for Type3 {}", ...);
break;
}
}
}
```
**Race Condition:**
```
Thread A: Check cache for "1:F1" → MISS (line 3759)
Thread B: Check cache for "1:F1" → MISS (line 3759) [both pass check!]
Thread A: Load font from bytes (10ms I/O + parsing)
Thread B: Load font from bytes (10ms I/O + parsing) ← DUPLICATE WORK
Thread A: Put font in cache (line 3766)
Thread B: Put font in cache (line 3766) [overwrites A's entry]
```
**Why ConcurrentHashMap Doesn't Help:**
- ConcurrentHashMap prevents **corruption** (map state stays consistent)
- ConcurrentHashMap does NOT prevent **duplicate work** (both threads compute)
- The check (`containsKey`) and act (`put`) are separate operations
**Impact:**
- Wasted CPU cycles loading same font twice
- Temporary memory spike (two fonts in heap simultaneously)
- Font loading is expensive: Base64 decode + PDFBox parsing + font validation
- Under high concurrency, 10+ threads could all load the same font
**Recommendation:**
```java
private void loadNormalizedType3Font(...) throws IOException {
if (fontModel.getUid() == null || candidates == null || candidates.isEmpty()) {
return;
}
// Atomic compute-if-absent
type3NormalizedFontCache.computeIfAbsent(fontModel.getUid(), uid -> {
for (FontByteSource source : candidates) {
try {
PDFont font = loadFontFromSource(...);
if (font != null) {
log.info("Cached normalized font {} for Type3 {}", ...);
return font;
}
} catch (IOException e) {
log.warn("Failed to load font from {}: {}", source.originLabel(), e.getMessage());
}
}
return null;
});
}
```
---
### Issue #5: PDDocument Resource Lifecycle - NEEDS INVESTIGATION
**Location:** `PdfJsonConversionService.java:3766, 5158`
**Severity:** UNKNOWN (Requires Investigation)
**Type:** Unclear Resource Ownership
**Verified:** ⚠️ SPECULATIVE (No concrete evidence of failure)
**Description:**
```java
// Line 3766: Cache PDFont created from a PDDocument
try (PDDocument document = ...) {
PDFont font = loadFontFromSource(document, fontModel, source, ...);
type3NormalizedFontCache.put(fontModel.getUid(), font);
} // PDDocument is closed here!
// Later: cached PDFont is used with a DIFFERENT PDDocument
try (PDDocument newDocument = ...) {
PDFont cachedFont = type3NormalizedFontCache.get(fontUid);
// Is cachedFont safe to use after original document closed?
// Does it hold references to freed native resources?
}
```
**Theoretical Concerns:**
1. **Native Memory:** PDFBox uses JNI for some operations
2. **Resource Ties:** PDFont may hold references to the source PDDocument
3. **Freed Resources:** Using PDFont after document closes could access freed memory
4. **Unclear Contract:** PDFBox documentation doesn't explicitly address font lifecycle
**Current Status:**
- ⚠️ **NO EVIDENCE OF ACTUAL FAILURES** - System appears to work in practice
- ⚠️ **NO CRASHES OBSERVED** - No segmentation faults or memory corruption reported
- ⚠️ **NO MEMORY LEAKS DETECTED** - No profiler data showing leaks
- ⚠️ **PURELY THEORETICAL CONCERN** - Based on API design, not observed behavior
**Why This May Actually Be Safe:**
- PDFBox may create self-contained PDFont objects
- Font data may be copied rather than referenced
- PDFBox may be designed for this use case
- Current code has been running without apparent issues
**Required Investigation:**
1. **PDFBox Source Code Review:** Check if PDFont copies or references document data
2. **Load Testing:** Create PDFont, close document, use font in new document
3. **Memory Profiling:** Monitor for native memory leaks over extended runs
4. **PDFBox Documentation/Forums:** Search for guidance on font lifecycle
**Recommendation:**
- **Priority: MEDIUM** (needs investigation but not blocking)
- Add monitoring for potential issues
- Test font reuse after document closure explicitly
- If problems found, cache serialized bytes instead of PDFont objects
```java
// Option 1: Cache font bytes instead of PDFont objects
private void cacheType3FontBytes(String fontUid, byte[] fontBytes) {
type3FontBytesCache.put(fontUid, fontBytes);
}
// Option 2: Verify font is safe to use
private PDFont getCachedFont(String fontUid) {
PDFont cached = type3NormalizedFontCache.get(fontUid);
if (cached != null && !isFontValid(cached)) {
log.warn("Cached font {} is invalid, removing", fontUid);
type3NormalizedFontCache.remove(fontUid);
return null;
}
return cached;
}
```
---
## MEDIUM SEVERITY ISSUES
### Issue #6: Full PDF Reload Per Page - MEDIUM Performance
**Location:** `PdfJsonConversionService.java:5158-5159`
**Severity:** MEDIUM (Performance)
**Type:** Inefficient I/O
**Verified:** ✅ TRUE
**Description:**
```java
// extractSinglePage method
try (PDDocument document = pdfDocumentFactory.load(cached.getPdfBytes(), true)) {
// Full PDF loaded from bytes (10-100ms for large PDFs)
PDPage page = document.getPage(pageIndex);
// Extract just one page...
}
```
**Problem:**
Every page request loads the entire PDF from bytes:
- 100-page PDF = Load 10MB, extract 1 page
- 10 page requests = 10× full PDF loads (100MB I/O)
- No incremental parsing or streaming
**Impact:**
```
100MB PDF, 50 pages requested sequentially:
- Total I/O: 100MB × 50 = 5GB
- Time: 50× parse time (5-10 seconds total)
- Memory: 100MB peak per request
Concurrent page requests for same PDF:
- 10 threads × 100MB = 1GB temporary memory spike
```
**Why This Exists:**
Lazy loading design trades memory (don't cache full extraction) for CPU (reload on demand). But the tradeoff is poor because:
- PDFBox parsing is expensive
- Repeated decompression of streams
- Could cache extracted page data instead
**Recommendation:**
```java
// Option 1: Cache extracted page data
private static class CachedPage {
List<PdfJsonTextElement> textElements;
List<PdfJsonImageElement> imageElements;
// ... other page data
}
Map<String, Map<Integer, CachedPage>> pageCache = ...;
// Option 2: Keep PDF open with RandomAccessFile
private static class CachedPdfDocument {
private final RandomAccessReadBufferedFile randomAccess;
private final PDDocument document; // Keep open!
}
// Option 3: Pre-split pages at upload time
// Store each page as separate lightweight JSON blob
```
---
### Issue #7: Large Base64 Operations - MEDIUM Performance
**Location:** `PdfJsonConversionService.java:1062, 1428, 3570, 3584, 3612, 3630`
**Severity:** MEDIUM (Performance Bottleneck)
**Type:** Synchronous Blocking Operation
**Verified:** ✅ TRUE
**Description:**
```java
// Encode large font programs
String base64 = Base64.getEncoder().encodeToString(fontBytes); // 10MB → 13MB
// Decode large font programs
byte[] bytes = Base64.getDecoder().decode(pdfProgram); // 13MB → 10MB
```
**Problem:**
- Large fonts (embedded TrueType, Type3) can be 5-10MB
- Base64 encoding inflates size by ~33%
- All encoding/decoding is synchronous on request threads
- CPU-intensive operation (20-50ms for 10MB)
**Impact:**
```
100 concurrent requests processing 10MB fonts:
- Each request: 30ms CPU time for Base64
- All threads blocked on encoding simultaneously
- Thread pool saturation (if using fixed-size pool)
- Other requests starved waiting for threads
Large PDF with 50 fonts:
- 50 × 30ms = 1.5 seconds just for Base64 operations
- User perceives slowness
```
**Recommendation:**
```java
// Option 1: Size limits
private static final int MAX_FONT_SIZE = 10 * 1024 * 1024; // 10MB
if (fontBytes.length > MAX_FONT_SIZE) {
throw new IllegalArgumentException("Font too large: " + fontBytes.length);
}
// Option 2: Streaming Base64 (for very large files)
OutputStream base64Out = Base64.getEncoder().wrap(outputStream);
inputStream.transferTo(base64Out);
// Option 3: Async processing
CompletableFuture<String> encodeFuture = CompletableFuture.supplyAsync(
() -> Base64.getEncoder().encodeToString(fontBytes),
fontEncodingExecutor
);
```
---
### Issue #8: File I/O on Request Threads - MEDIUM
**Location:** `PdfJsonConversionService.java:276, 405, 5066`
**Severity:** MEDIUM (Performance)
**Type:** Blocking I/O
**Verified:** ❌ PARTIALLY TRUE
**Description:**
```java
// Line 276: Write upload to disk
file.transferTo(originalFile.getFile());
// Line 405: Read full file into memory
byte[] cachedPdfBytes = Files.readAllBytes(workingPath);
// Line 5066: Get uploaded file bytes
byte[] pdfBytes = file.getBytes();
```
**Clarification:**
- These are in DIFFERENT methods (not double-reads within one operation)
- Each method reads the file once
- Still synchronous blocking I/O
**Impact:**
- Large uploads (100MB) block request thread for seconds
- No async or streaming support
- Thread pool saturation under high upload volume
**Recommendation:**
```java
// Async file I/O
CompletableFuture<Path> uploadFuture = CompletableFuture.supplyAsync(
() -> {
Path tempPath = Files.createTempFile("pdf-upload", ".pdf");
file.transferTo(tempPath.toFile());
return tempPath;
},
fileIoExecutor
);
// Stream large files
try (InputStream in = file.getInputStream();
OutputStream out = Files.newOutputStream(targetPath)) {
in.transferTo(out);
}
```
---
## LOW SEVERITY ISSUES
### Issue #9: PdfLazyLoadingService Unused - LOW
**Location:** `PdfLazyLoadingService.java` (entire file)
**Severity:** LOW (Code Quality)
**Type:** Dead Code
**Verified:** ✅ TRUE
**Description:**
- Complete service implementation exists
- Has its own `documentCache` and cleanup logic
- Duplicates functionality in `PdfJsonConversionService`
- Not wired to any controller
- Not imported by any other class
**Impact:**
- Code maintenance burden
- Confusing for developers
- Potential for accidental use in future
- Cache divergence if both ever get used
**Recommendation:**
```java
// Delete PdfLazyLoadingService.java entirely
// Or clearly mark as @Deprecated with explanation
```
---
### Issue #10: PdfJsonFontService Volatile Fields - LOW
**Location:** `PdfJsonFontService.java:46-47`
**Severity:** LOW (Actually Correct)
**Type:** None (Good Practice)
**Verified:** ✅ TRUE (No issue, correctly implemented)
**Description:**
```java
private volatile boolean pythonCffConverterAvailable;
private volatile boolean fontForgeCffConverterAvailable;
@PostConstruct
private void initialiseCffConverterAvailability() {
pythonCffConverterAvailable = isCommandAvailable(pythonCommand);
fontForgeCffConverterAvailable = isCommandAvailable(fontforgeCommand);
}
```
**Why This Is Correct:**
- `volatile` ensures visibility across threads
- Set once at startup
- Read many times (thread-safe)
- No synchronization needed
**Recommendation:** None - this is good practice.
---
## VERIFIED FALSE CLAIMS
### Claim: file.getBytes() Called Twice
**Status:** ❌ FALSE
**Explanation:** The claim stated that `file.getBytes()` is called twice (lines 446, 5065). Investigation shows:
- Line 446: `convertJsonToPdf` method
- Line 5065: `extractDocumentMetadata` method
- These are DIFFERENT methods for DIFFERENT operations
- Each method calls `getBytes()` only once
**Conclusion:** Not a double-read issue.
---
### Claim: Image Base64 Encoding Per Call
**Status:** ❌ FALSE
**Explanation:** The claim stated images are Base64-encoded on every call. Investigation shows:
```java
// PdfJsonImageService.java:430-450
private EncodedImage getOrEncodeImage(PDImage pdImage) {
COSBase key = xObject.getCOSObject();
EncodedImage cached = imageCache.get(key); // Cache check!
if (cached != null) {
return cached; // Cache hit
}
EncodedImage encoded = encodeImage(pdImage);
imageCache.put(key, encoded); // Cache miss, encode and store
return encoded;
}
```
**Conclusion:** Images ARE cached. Only stencil and inline images bypass cache.
---
## ARCHITECTURE ISSUES
### Issue #11: Singleton Service Architecture - MEDIUM
**Location:** All `@Service` and `@Component` classes
**Severity:** MEDIUM (Maintainability)
**Type:** Architectural Pattern
**Description:**
All services use default singleton scope:
```java
@Service // Defaults to singleton
public class PdfJsonConversionService {
// Shared instance variables across all requests
private final Map<String, PDFont> type3NormalizedFontCache = ...;
}
```
**Implications:**
**Good:**
- Most dependencies are stateless and injected
- Caches use ConcurrentHashMap (thread-safe)
- No mutable instance variables beyond caches
⚠️ **Risks:**
- Singleton means shared state across all requests
- Requires careful synchronization
- Easy for future developers to introduce thread-unsafe code
- Difficult to test concurrent scenarios
**Recommendation:**
- Document thread-safety requirements prominently
- Add unit tests for concurrent access
- Consider request-scoped services for mutable state
- Code review checklist for new instance variables
---
## SUMMARY BY SEVERITY
### CRITICAL (Fix Immediately)
1. **pageFontResources PDFont keys** - Broken feature
2. **Unbounded thread creation** - Resource exhaustion
### HIGH (Fix Soon)
3. **Unbounded cache growth** - Memory leak
4. **Type3 cache race** - Duplicate work
5. **PDDocument lifecycle** - Needs investigation (speculative)
### MEDIUM (Plan and Address)
6. **Full PDF reload per page** - Performance
7. **Large Base64 operations** - Performance
8. **File I/O blocking** - Performance
9. **PdfLazyLoadingService unused** - Dead code
10. **Singleton architecture** - Maintainability
### LOW (Monitor)
11. **PdfJsonFontService volatile** - Correctly implemented (no action needed)
### VERIFIED FALSE
12. file.getBytes() called twice
13. Image Base64 encoding per call
---
## IMPLEMENTATION ROADMAP
### Phase 1: Critical Fixes (1-2 weeks)
**1. Fix pageFontResources PDFont keys (Issue #1)**
```java
// Priority: CRITICAL
// Time: 3-5 days
// Risk: Medium (requires careful testing)
// Replace PDFont keys with String resource names
// Update cache lookup logic
```
**2. Fix Thread Leaks (Issue #2)**
```java
// Priority: CRITICAL
// Time: 1 day
// Risk: Low (well-understood solution)
// Replace new Thread() with ScheduledExecutorService
// Add @PreDestroy cleanup
// Monitor thread counts
```
### Phase 2: Resource Management (1 week)
**3. Add Cache Eviction (Issue #3)**
```java
// Priority: HIGH
// Time: 3 days
// Risk: Low (library-based solution)
// Integrate Caffeine cache
// Set size limits, TTL
// Add eviction logging
// Monitor cache metrics
```
**4. Fix Type3 Cache Race (Issue #4)**
```java
// Priority: HIGH
// Time: 1-2 days
// Risk: Low (straightforward fix)
// Use computeIfAbsent for atomic operations
```
### Phase 3: Performance Optimization (2-3 weeks)
**5. Optimize Lazy Loading (Issue #6)**
```java
// Priority: MEDIUM
// Time: 1 week
// Risk: Medium (requires benchmarking)
// Cache extracted page data
// Or: Keep PDDocument open with RandomAccessFile
// Or: Pre-split pages at upload
```
**6. Async I/O (Issues #7, #8)**
```java
// Priority: MEDIUM
// Time: 3-5 days
// Risk: Medium (requires async architecture changes)
// Add dedicated I/O thread pool
// Async file operations
// Stream large files
```
### Phase 4: Code Quality (1 week)
**7. Remove Dead Code (Issue #9)**
```java
// Priority: LOW
// Time: 1 day
// Risk: None
// Delete PdfLazyLoadingService
// Clean up unused imports
```
**8. Documentation & Testing**
```java
// Priority: MEDIUM
// Time: 3-5 days
// Add thread-safety documentation
// Concurrent integration tests
// Load testing scripts
```
---
## TESTING STRATEGY
### 1. Concurrency Tests
```java
@SpringBootTest
@TestInstance(TestInstance.Lifecycle.PER_CLASS)
class ConcurrencyTest {
@Test
void testConcurrentCacheAccess() throws Exception {
ExecutorService executor = Executors.newFixedThreadPool(20);
CountDownLatch latch = new CountDownLatch(100);
List<Future<?>> futures = new ArrayList<>();
// 100 requests across 10 jobIds (10 requests per job)
for (int i = 0; i < 100; i++) {
String jobId = "job-" + (i % 10);
int pageNum = (i % 5) + 1;
futures.add(executor.submit(() -> {
try {
service.extractSinglePage(jobId, pageNum);
} catch (Exception e) {
log.error("Concurrent access failed", e);
throw e;
} finally {
latch.countDown();
}
}));
}
// Wait for completion
assertTrue(latch.await(60, TimeUnit.SECONDS));
// Check for exceptions
for (Future<?> future : futures) {
future.get(); // Throws if any task failed
}
}
@Test
void testCacheNotCorrupted() throws Exception {
// Upload document
String jobId = "test-job";
service.extractDocumentMetadata(testPdf, jobId);
// Concurrent page requests
ExecutorService executor = Executors.newFixedThreadPool(10);
List<Future<byte[]>> futures = new ArrayList<>();
for (int i = 0; i < 50; i++) {
int page = (i % 10) + 1;
futures.add(executor.submit(() ->
service.extractSinglePage(jobId, page)));
}
// All should succeed without ConcurrentModificationException
for (Future<byte[]> future : futures) {
assertNotNull(future.get());
}
}
}
```
### 2. Memory Leak Tests
```java
@Test
void testCacheDoesNotGrowUnbounded() {
long initialHeap = getHeapUsage();
// Process 10,000 small PDFs with Type3 fonts
for (int i = 0; i < 10000; i++) {
service.convertPdfToJson(createTestPdfWithType3Fonts());
}
// Force GC
System.gc();
Thread.sleep(1000);
long finalHeap = getHeapUsage();
long growth = finalHeap - initialHeap;
// Cache should not grow beyond reasonable limit
assertThat(growth).isLessThan(100_000_000); // 100MB max
}
@Test
void testThreadsNotLeaking() {
int initialThreads = getActiveThreadCount();
// Upload 100 PDFs (spawns 100 cleanup threads)
for (int i = 0; i < 100; i++) {
service.extractDocumentMetadata(testPdf, "job-" + i);
}
int peakThreads = getActiveThreadCount();
// Should not create 100+ threads
assertThat(peakThreads - initialThreads).isLessThan(10);
}
private long getHeapUsage() {
Runtime runtime = Runtime.getRuntime();
return runtime.totalMemory() - runtime.freeMemory();
}
private int getActiveThreadCount() {
return Thread.getAllStackTraces().size();
}
```
### 3. Security Tests
```java
@Test
void testJobIdIsolation() {
// User A uploads PDF
String jobIdA = service.extractDocumentMetadata(userAPdf, sessionA);
// User B tries to access User A's jobId
assertThrows(AccessDeniedException.class, () -> {
service.extractSinglePage(jobIdA, 1, sessionB);
});
}
@Test
void testJobIdUnpredictable() {
Set<String> jobIds = new HashSet<>();
for (int i = 0; i < 1000; i++) {
String jobId = service.extractDocumentMetadata(testPdf, session);
jobIds.add(jobId);
}
// All jobIds should be unique UUIDs
assertThat(jobIds).hasSize(1000);
// Should not be sequential
List<String> sorted = new ArrayList<>(jobIds);
Collections.sort(sorted);
assertThat(sorted).isNotEqualTo(new ArrayList<>(jobIds));
}
```
### 4. Performance Tests
```java
@Test
void testLargeFilePerformance() {
// 100MB PDF
byte[] largePdf = createLargePdf(100 * 1024 * 1024);
long start = System.currentTimeMillis();
String json = service.convertPdfToJson(largePdf);
long duration = System.currentTimeMillis() - start;
// Should complete in reasonable time
assertThat(duration).isLessThan(30_000); // 30 seconds
}
@Test
void testConcurrentThroughput() throws Exception {
ExecutorService executor = Executors.newFixedThreadPool(50);
CountDownLatch latch = new CountDownLatch(500);
long start = System.currentTimeMillis();
for (int i = 0; i < 500; i++) {
executor.submit(() -> {
try {
service.convertPdfToJson(testPdf);
} finally {
latch.countDown();
}
});
}
latch.await();
long duration = System.currentTimeMillis() - start;
// 500 conversions should complete in reasonable time
double throughput = 500.0 / (duration / 1000.0);
assertThat(throughput).isGreaterThan(10); // At least 10 conversions/sec
}
```
---
## MONITORING & METRICS
### Recommended Metrics (Micrometer)
```java
@Service
public class PdfJsonConversionService {
private final MeterRegistry meterRegistry;
// Cache size gauges
@PostConstruct
void registerMetrics() {
Gauge.builder("pdf.cache.document.size", documentCache, Map::size)
.description("Number of cached documents")
.register(meterRegistry);
Gauge.builder("pdf.cache.type3font.size", type3NormalizedFontCache, Map::size)
.description("Number of cached Type3 fonts")
.register(meterRegistry);
Gauge.builder("pdf.cache.coverage.size", type3GlyphCoverageCache, Map::size)
.description("Number of cached glyph coverage sets")
.register(meterRegistry);
Gauge.builder("pdf.threads.cleanup", this::getCleanupThreadCount)
.description("Active cleanup threads")
.register(meterRegistry);
}
// Operation timers
public String convertPdfToJson(byte[] pdfBytes) {
Timer.Sample sample = Timer.start(meterRegistry);
try {
String result = doConvertPdfToJson(pdfBytes);
sample.stop(meterRegistry.timer("pdf.convert.toJson"));
return result;
} catch (Exception e) {
meterRegistry.counter("pdf.convert.errors", "operation", "toJson").increment();
throw e;
}
}
// Cache hit/miss counters
private PDFont getCachedType3Font(String uid) {
PDFont cached = type3NormalizedFontCache.get(uid);
if (cached != null) {
meterRegistry.counter("pdf.cache.type3font.hits").increment();
} else {
meterRegistry.counter("pdf.cache.type3font.misses").increment();
}
return cached;
}
}
```
### Alerts
```yaml
alerts:
# Cache growth
- name: DocumentCacheTooLarge
condition: pdf_cache_document_size > 100
severity: warning
- name: Type3CacheTooLarge
condition: pdf_cache_type3font_size > 1000
severity: warning
# Thread leaks
- name: TooManyCleanupThreads
condition: pdf_threads_cleanup > 10
severity: critical
# Memory pressure
- name: HeapUsageHigh
condition: jvm_memory_used_bytes / jvm_memory_max_bytes > 0.8
severity: warning
# Performance
- name: SlowConversions
condition: pdf_convert_toJson{quantile="0.95"} > 10s
severity: warning
# Error rate
- name: HighErrorRate
condition: rate(pdf_convert_errors[5m]) > 0.1
severity: critical
```
---
## CONCLUSION
The PDF JSON editor has **CRITICAL** issues that must be fixed before production deployment:
### Must-Fix Issues (Blocks Production):
1. **pageFontResources broken** - Lazy page loading completely broken (PDFont instances as keys)
2. **Thread leaks** - Unbounded thread creation causes OutOfMemoryError
### Should-Fix Issues (Prevents Scale):
3. **Unbounded cache growth** - Memory leaks require server restarts
4. **Type3 cache races** - Wasted CPU doing duplicate work
5. **PDDocument lifecycle** - Needs investigation (no evidence of actual problems yet)
### Performance Improvements (Nice-to-Have):
6. Full PDF reload per page
7. Large Base64 operations
8. Synchronous file I/O
### Code Quality Issues:
9. PdfLazyLoadingService dead code
10. Documentation of thread-safety requirements
**Estimated Effort:**
- Critical fixes: 1-2 weeks (Issues #1, #2)
- High priority: 1 week (Issues #3, #4)
- Performance: 2-3 weeks (Issues #6-8)
- **Total:** 4-6 weeks for complete remediation
**Recommendation:** Fix critical issues (#1, #2) immediately, then address high priority issues before beta testing.