mirror of
https://github.com/Frooodle/Stirling-PDF.git
synced 2025-11-16 01:21:16 +01:00
1200 lines
34 KiB
Markdown
1200 lines
34 KiB
Markdown
# PDF JSON Editor - Threading, Concurrency, and Performance Analysis
|
||
|
||
**Date:** 2025-01-09
|
||
**Version:** 1.0
|
||
**Status:** Comprehensive analysis combining automated review and manual verification
|
||
|
||
---
|
||
|
||
## Executive Summary
|
||
|
||
This analysis identifies **CRITICAL** security vulnerabilities, thread safety issues, and performance problems in the PDF JSON editor codebase. The service contains:
|
||
|
||
- **2 CRITICAL issues** requiring immediate attention
|
||
- **2 HIGH severity resource leaks** causing memory exhaustion
|
||
- **Multiple performance bottlenecks** limiting scalability
|
||
|
||
**Immediate Action Required:**
|
||
1. Fix pageFontResources PDFont key issue (Issue #1)
|
||
2. Replace unbounded thread spawning (Issue #2)
|
||
3. Add cache size limits (Issue #3)
|
||
4. Fix Type3 cache race condition (Issue #4)
|
||
|
||
---
|
||
|
||
## CRITICAL ISSUES
|
||
|
||
### Issue #1: pageFontResources Keyed by PDFont Instances - CRITICAL
|
||
|
||
**Location:** `PdfJsonConversionService.java:5075-5081, 5158-5159`
|
||
|
||
**Severity:** CRITICAL (Broken Functionality)
|
||
|
||
**Type:** Object Identity Mismatch, Cache Miss
|
||
|
||
**Verified:** ✅ TRUE
|
||
|
||
**Description:**
|
||
```java
|
||
// Line 5075-5081: Initial metadata extraction (first PDF load)
|
||
try (PDDocument document = pdfDocumentFactory.load(pdfBytes, true)) {
|
||
Map<Integer, Map<PDFont, String>> pageFontResources = new HashMap<>();
|
||
for (PDPage page : document.getPages()) {
|
||
Map<PDFont, String> resourceMap = collectFontsForPage(...);
|
||
// resourceMap keys are PDFont instances from this document
|
||
pageFontResources.put(pageNumber, resourceMap);
|
||
}
|
||
// Cache it
|
||
documentCache.put(jobId, new CachedPdfDocument(..., pageFontResources, ...));
|
||
} // PDDocument closed, PDFont instances now reference freed document
|
||
|
||
// Line 5158-5159: Lazy page extraction (reloads PDF)
|
||
try (PDDocument document = pdfDocumentFactory.load(cached.getPdfBytes(), true)) {
|
||
// NEW PDFont instances created!
|
||
PDPage page = document.getPage(pageIndex);
|
||
|
||
// Try to lookup fonts
|
||
Map<PDFont, String> cachedResourceMap = cached.getPageFontResources().get(pageNum);
|
||
// cachedResourceMap keys are OLD PDFont instances from closed document
|
||
|
||
// Lookup using NEW PDFont instances
|
||
String fontId = cachedResourceMap.get(newFont); // Always NULL! ← BUG
|
||
}
|
||
```
|
||
|
||
**Why It Fails:**
|
||
1. **Object Identity:** `Map<PDFont, String>` uses PDFont object identity as key
|
||
2. **Different Instances:** Each PDF load creates new PDFont instances with different identities
|
||
3. **Lookup Fails:** `cachedResourceMap.get(newFont)` returns null because `newFont != oldFont`
|
||
4. **Defeat Caching Goal:** Every lazy page request rebuilds font metadata, defeating the cache
|
||
|
||
**Impact:**
|
||
- Lazy page loading doesn't reuse cached font metadata
|
||
- CPU wasted rebuilding font info on every page request
|
||
- Cache only stores garbage (unusable keys)
|
||
- "Consistent font UID" feature completely broken
|
||
|
||
**Evidence:**
|
||
```java
|
||
// No code actually uses the cached pageFontResources successfully
|
||
// Every extractSinglePage call rebuilds fonts from scratch
|
||
```
|
||
|
||
**Recommendation:**
|
||
```java
|
||
// Use resource names as keys instead of PDFont objects
|
||
Map<Integer, Map<String, String>> pageFontResources = new HashMap<>();
|
||
// Key: font resource name (e.g., "F1"), Value: font UID
|
||
|
||
// Or use font UID directly
|
||
Map<Integer, Set<String>> pageFontUids = new HashMap<>();
|
||
```
|
||
|
||
---
|
||
|
||
### Issue #2: Unbounded Thread Creation - CRITICAL Resource Leak
|
||
|
||
**Location:** `PdfJsonConversionService.java:5550-5562`
|
||
|
||
**Severity:** CRITICAL (Resource Exhaustion)
|
||
|
||
**Type:** Thread Leak, Memory Leak
|
||
|
||
**Verified:** ✅ TRUE
|
||
|
||
**Description:**
|
||
```java
|
||
private void scheduleDocumentCleanup(String jobId) {
|
||
new Thread(
|
||
() -> {
|
||
try {
|
||
Thread.sleep(TimeUnit.MINUTES.toMillis(30)); // Sleep 30 minutes!
|
||
clearCachedDocument(jobId);
|
||
log.debug("Auto-cleaned cached document for jobId: {}", jobId);
|
||
} catch (InterruptedException e) {
|
||
Thread.currentThread().interrupt();
|
||
}
|
||
})
|
||
.start(); // Unmanaged thread!
|
||
}
|
||
```
|
||
|
||
**Also in:** `PdfLazyLoadingService.java:256-269` (duplicate implementation)
|
||
|
||
**Problems:**
|
||
1. **One Thread Per Upload:** Each PDF upload spawns a new thread
|
||
2. **No Thread Pool:** Unlimited thread creation (no cap)
|
||
3. **Long Sleep:** Threads sleep for 30 minutes holding resources
|
||
4. **No Cancellation:** Cannot stop cleanup threads if job completes early
|
||
5. **Non-Daemon:** Threads prevent JVM shutdown (daemon=false by default)
|
||
6. **No Monitoring:** No visibility into active cleanup threads
|
||
|
||
**Impact Under Load:**
|
||
```
|
||
100 concurrent uploads → 100 cleanup threads spawned
|
||
Each thread: ~1MB stack + overhead
|
||
Total: ~100MB+ wasted on sleeping threads
|
||
|
||
1000 uploads/day → 1000+ threads accumulate
|
||
OS thread limit (ulimit -u) exceeded → OutOfMemoryError: unable to create new native thread
|
||
```
|
||
|
||
**Production Failure Scenario:**
|
||
```
|
||
09:00 - Peak traffic: 500 PDFs uploaded
|
||
09:00 - 500 cleanup threads spawned (each sleeps 30 min)
|
||
09:15 - 400 more uploads → 400 more threads
|
||
09:30 - First batch wakes up, cleans, exits
|
||
But new threads keep getting created faster than old ones die
|
||
10:00 - Thread count > 2000
|
||
10:15 - JVM hits OS thread limit
|
||
10:16 - Server crashes: "OutOfMemoryError: unable to create new native thread"
|
||
```
|
||
|
||
**Recommendation:**
|
||
```java
|
||
@Service
|
||
public class PdfJsonConversionService {
|
||
// Fixed-size thread pool for cleanup
|
||
private final ScheduledExecutorService cleanupScheduler =
|
||
Executors.newScheduledThreadPool(
|
||
2, // Only 2 threads needed
|
||
new ThreadFactoryBuilder()
|
||
.setNameFormat("pdf-cache-cleanup-%d")
|
||
.setDaemon(true)
|
||
.build()
|
||
);
|
||
|
||
private void scheduleDocumentCleanup(String jobId) {
|
||
cleanupScheduler.schedule(
|
||
() -> clearCachedDocument(jobId),
|
||
30,
|
||
TimeUnit.MINUTES
|
||
);
|
||
}
|
||
|
||
@PreDestroy
|
||
public void shutdown() {
|
||
cleanupScheduler.shutdown();
|
||
try {
|
||
if (!cleanupScheduler.awaitTermination(10, TimeUnit.SECONDS)) {
|
||
cleanupScheduler.shutdownNow();
|
||
}
|
||
} catch (InterruptedException e) {
|
||
cleanupScheduler.shutdownNow();
|
||
Thread.currentThread().interrupt();
|
||
}
|
||
}
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## HIGH SEVERITY ISSUES
|
||
|
||
### Issue #3: Unbounded Cache Growth - HIGH Memory Leak
|
||
|
||
**Location:** `PdfJsonConversionService.java:147-148, 154`
|
||
|
||
**Severity:** HIGH (Memory Exhaustion)
|
||
|
||
**Type:** Memory Leak, Missing Eviction Policy
|
||
|
||
**Verified:** ✅ TRUE
|
||
|
||
**Description:**
|
||
```java
|
||
// Line 147: Type3 normalized fonts - cleared only in convertJsonToPdf (line 454)
|
||
private final Map<String, PDFont> type3NormalizedFontCache = new ConcurrentHashMap<>();
|
||
|
||
// Line 148: Type3 glyph coverage - NEVER CLEARED anywhere in codebase!
|
||
private final Map<String, Set<Integer>> type3GlyphCoverageCache = new ConcurrentHashMap<>();
|
||
|
||
// Line 154: Document cache - relies on buggy cleanup threads
|
||
private final Map<String, CachedPdfDocument> documentCache = new ConcurrentHashMap<>();
|
||
```
|
||
|
||
**Growth Patterns:**
|
||
|
||
**1. type3NormalizedFontCache:**
|
||
- Written at line 3766: `type3NormalizedFontCache.put(fontModel.getUid(), font)`
|
||
- Cleared only at line 454: `type3NormalizedFontCache.clear()` (JSON→PDF conversion)
|
||
- **NOT cleared during PDF→JSON** conversion (most common operation)
|
||
- Each PDFont holds references to native resources (C++ objects via JNI)
|
||
- Grows unbounded during PDF→JSON operations
|
||
|
||
**2. type3GlyphCoverageCache:**
|
||
- Written at line 1122: `type3GlyphCoverageCache.put(fontUid, coverageSet)`
|
||
- **NEVER CLEARED** in entire codebase (verified via grep)
|
||
- Accumulates Set<Integer> for every Type3 font ever processed
|
||
- Each Set can contain thousands of integers (Unicode codepoints)
|
||
- Pure memory leak
|
||
|
||
**3. documentCache:**
|
||
- Stores full PDF bytes in memory
|
||
- Each entry can be 1MB-100MB+ (document bytes + metadata)
|
||
- Relies on cleanup threads (which have issues from Issue #5)
|
||
- If cleanup fails (exception, server restart), entries stay forever
|
||
- No max size check
|
||
|
||
**Impact:**
|
||
```
|
||
Long-running server processes 10,000 Type3 fonts:
|
||
- type3GlyphCoverageCache: 10,000 entries × ~1KB each = 10MB
|
||
- type3NormalizedFontCache: 1,000 cached fonts × ~100KB each = 100MB
|
||
- documentCache: 50 active jobs × 10MB each = 500MB
|
||
|
||
After 1 week: Caches grow to 1GB+
|
||
After 1 month: OutOfMemoryError, server restart required
|
||
```
|
||
|
||
**Recommendation:**
|
||
```java
|
||
// Use Caffeine cache with eviction policies
|
||
private final Cache<String, PDFont> type3NormalizedFontCache =
|
||
Caffeine.newBuilder()
|
||
.maximumSize(1000) // Max 1000 fonts
|
||
.expireAfterAccess(1, TimeUnit.HOURS) // Expire after 1hr unused
|
||
.removalListener((key, value, cause) -> {
|
||
// Cleanup PDFont resources if needed
|
||
})
|
||
.build();
|
||
|
||
private final Cache<String, Set<Integer>> type3GlyphCoverageCache =
|
||
Caffeine.newBuilder()
|
||
.maximumSize(5000)
|
||
.expireAfterWrite(1, TimeUnit.HOURS)
|
||
.build();
|
||
|
||
private final Cache<String, CachedPdfDocument> documentCache =
|
||
Caffeine.newBuilder()
|
||
.maximumWeight(500_000_000) // 500MB max
|
||
.weigher((String key, CachedPdfDocument doc) ->
|
||
doc.getPdfBytes().length)
|
||
.expireAfterWrite(30, TimeUnit.MINUTES)
|
||
.removalListener((key, value, cause) -> {
|
||
log.info("Evicted document {} (cause: {})", key, cause);
|
||
})
|
||
.build();
|
||
```
|
||
|
||
---
|
||
|
||
### Issue #4: Type3 Cache Race Condition - HIGH
|
||
|
||
**Location:** `PdfJsonConversionService.java:3759-3773`
|
||
|
||
**Severity:** HIGH (Duplicate Work)
|
||
|
||
**Type:** Check-Then-Act Race Condition
|
||
|
||
**Verified:** ✅ TRUE
|
||
|
||
**Description:**
|
||
```java
|
||
private void loadNormalizedType3Font(
|
||
PDDocument document,
|
||
PdfJsonFont fontModel,
|
||
List<FontByteSource> candidates,
|
||
String originalFormat) throws IOException {
|
||
if (fontModel.getUid() == null || candidates == null || candidates.isEmpty()) {
|
||
return;
|
||
}
|
||
if (type3NormalizedFontCache.containsKey(fontModel.getUid())) { // CHECK
|
||
return;
|
||
}
|
||
for (FontByteSource source : candidates) {
|
||
PDFont font = loadFontFromSource(...); // EXPENSIVE: 10-50ms
|
||
if (font != null) {
|
||
type3NormalizedFontCache.put(fontModel.getUid(), font); // ACT
|
||
log.info("Cached normalized font {} for Type3 {}", ...);
|
||
break;
|
||
}
|
||
}
|
||
}
|
||
```
|
||
|
||
**Race Condition:**
|
||
```
|
||
Thread A: Check cache for "1:F1" → MISS (line 3759)
|
||
Thread B: Check cache for "1:F1" → MISS (line 3759) [both pass check!]
|
||
Thread A: Load font from bytes (10ms I/O + parsing)
|
||
Thread B: Load font from bytes (10ms I/O + parsing) ← DUPLICATE WORK
|
||
Thread A: Put font in cache (line 3766)
|
||
Thread B: Put font in cache (line 3766) [overwrites A's entry]
|
||
```
|
||
|
||
**Why ConcurrentHashMap Doesn't Help:**
|
||
- ConcurrentHashMap prevents **corruption** (map state stays consistent)
|
||
- ConcurrentHashMap does NOT prevent **duplicate work** (both threads compute)
|
||
- The check (`containsKey`) and act (`put`) are separate operations
|
||
|
||
**Impact:**
|
||
- Wasted CPU cycles loading same font twice
|
||
- Temporary memory spike (two fonts in heap simultaneously)
|
||
- Font loading is expensive: Base64 decode + PDFBox parsing + font validation
|
||
- Under high concurrency, 10+ threads could all load the same font
|
||
|
||
**Recommendation:**
|
||
```java
|
||
private void loadNormalizedType3Font(...) throws IOException {
|
||
if (fontModel.getUid() == null || candidates == null || candidates.isEmpty()) {
|
||
return;
|
||
}
|
||
|
||
// Atomic compute-if-absent
|
||
type3NormalizedFontCache.computeIfAbsent(fontModel.getUid(), uid -> {
|
||
for (FontByteSource source : candidates) {
|
||
try {
|
||
PDFont font = loadFontFromSource(...);
|
||
if (font != null) {
|
||
log.info("Cached normalized font {} for Type3 {}", ...);
|
||
return font;
|
||
}
|
||
} catch (IOException e) {
|
||
log.warn("Failed to load font from {}: {}", source.originLabel(), e.getMessage());
|
||
}
|
||
}
|
||
return null;
|
||
});
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
### Issue #5: PDDocument Resource Lifecycle - NEEDS INVESTIGATION
|
||
|
||
**Location:** `PdfJsonConversionService.java:3766, 5158`
|
||
|
||
**Severity:** UNKNOWN (Requires Investigation)
|
||
|
||
**Type:** Unclear Resource Ownership
|
||
|
||
**Verified:** ⚠️ SPECULATIVE (No concrete evidence of failure)
|
||
|
||
**Description:**
|
||
```java
|
||
// Line 3766: Cache PDFont created from a PDDocument
|
||
try (PDDocument document = ...) {
|
||
PDFont font = loadFontFromSource(document, fontModel, source, ...);
|
||
type3NormalizedFontCache.put(fontModel.getUid(), font);
|
||
} // PDDocument is closed here!
|
||
|
||
// Later: cached PDFont is used with a DIFFERENT PDDocument
|
||
try (PDDocument newDocument = ...) {
|
||
PDFont cachedFont = type3NormalizedFontCache.get(fontUid);
|
||
// Is cachedFont safe to use after original document closed?
|
||
// Does it hold references to freed native resources?
|
||
}
|
||
```
|
||
|
||
**Theoretical Concerns:**
|
||
1. **Native Memory:** PDFBox uses JNI for some operations
|
||
2. **Resource Ties:** PDFont may hold references to the source PDDocument
|
||
3. **Freed Resources:** Using PDFont after document closes could access freed memory
|
||
4. **Unclear Contract:** PDFBox documentation doesn't explicitly address font lifecycle
|
||
|
||
**Current Status:**
|
||
- ⚠️ **NO EVIDENCE OF ACTUAL FAILURES** - System appears to work in practice
|
||
- ⚠️ **NO CRASHES OBSERVED** - No segmentation faults or memory corruption reported
|
||
- ⚠️ **NO MEMORY LEAKS DETECTED** - No profiler data showing leaks
|
||
- ⚠️ **PURELY THEORETICAL CONCERN** - Based on API design, not observed behavior
|
||
|
||
**Why This May Actually Be Safe:**
|
||
- PDFBox may create self-contained PDFont objects
|
||
- Font data may be copied rather than referenced
|
||
- PDFBox may be designed for this use case
|
||
- Current code has been running without apparent issues
|
||
|
||
**Required Investigation:**
|
||
1. **PDFBox Source Code Review:** Check if PDFont copies or references document data
|
||
2. **Load Testing:** Create PDFont, close document, use font in new document
|
||
3. **Memory Profiling:** Monitor for native memory leaks over extended runs
|
||
4. **PDFBox Documentation/Forums:** Search for guidance on font lifecycle
|
||
|
||
**Recommendation:**
|
||
- **Priority: MEDIUM** (needs investigation but not blocking)
|
||
- Add monitoring for potential issues
|
||
- Test font reuse after document closure explicitly
|
||
- If problems found, cache serialized bytes instead of PDFont objects
|
||
|
||
```java
|
||
// Option 1: Cache font bytes instead of PDFont objects
|
||
private void cacheType3FontBytes(String fontUid, byte[] fontBytes) {
|
||
type3FontBytesCache.put(fontUid, fontBytes);
|
||
}
|
||
|
||
// Option 2: Verify font is safe to use
|
||
private PDFont getCachedFont(String fontUid) {
|
||
PDFont cached = type3NormalizedFontCache.get(fontUid);
|
||
if (cached != null && !isFontValid(cached)) {
|
||
log.warn("Cached font {} is invalid, removing", fontUid);
|
||
type3NormalizedFontCache.remove(fontUid);
|
||
return null;
|
||
}
|
||
return cached;
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## MEDIUM SEVERITY ISSUES
|
||
|
||
### Issue #6: Full PDF Reload Per Page - MEDIUM Performance
|
||
|
||
**Location:** `PdfJsonConversionService.java:5158-5159`
|
||
|
||
**Severity:** MEDIUM (Performance)
|
||
|
||
**Type:** Inefficient I/O
|
||
|
||
**Verified:** ✅ TRUE
|
||
|
||
**Description:**
|
||
```java
|
||
// extractSinglePage method
|
||
try (PDDocument document = pdfDocumentFactory.load(cached.getPdfBytes(), true)) {
|
||
// Full PDF loaded from bytes (10-100ms for large PDFs)
|
||
PDPage page = document.getPage(pageIndex);
|
||
// Extract just one page...
|
||
}
|
||
```
|
||
|
||
**Problem:**
|
||
Every page request loads the entire PDF from bytes:
|
||
- 100-page PDF = Load 10MB, extract 1 page
|
||
- 10 page requests = 10× full PDF loads (100MB I/O)
|
||
- No incremental parsing or streaming
|
||
|
||
**Impact:**
|
||
```
|
||
100MB PDF, 50 pages requested sequentially:
|
||
- Total I/O: 100MB × 50 = 5GB
|
||
- Time: 50× parse time (5-10 seconds total)
|
||
- Memory: 100MB peak per request
|
||
|
||
Concurrent page requests for same PDF:
|
||
- 10 threads × 100MB = 1GB temporary memory spike
|
||
```
|
||
|
||
**Why This Exists:**
|
||
Lazy loading design trades memory (don't cache full extraction) for CPU (reload on demand). But the tradeoff is poor because:
|
||
- PDFBox parsing is expensive
|
||
- Repeated decompression of streams
|
||
- Could cache extracted page data instead
|
||
|
||
**Recommendation:**
|
||
```java
|
||
// Option 1: Cache extracted page data
|
||
private static class CachedPage {
|
||
List<PdfJsonTextElement> textElements;
|
||
List<PdfJsonImageElement> imageElements;
|
||
// ... other page data
|
||
}
|
||
|
||
Map<String, Map<Integer, CachedPage>> pageCache = ...;
|
||
|
||
// Option 2: Keep PDF open with RandomAccessFile
|
||
private static class CachedPdfDocument {
|
||
private final RandomAccessReadBufferedFile randomAccess;
|
||
private final PDDocument document; // Keep open!
|
||
}
|
||
|
||
// Option 3: Pre-split pages at upload time
|
||
// Store each page as separate lightweight JSON blob
|
||
```
|
||
|
||
---
|
||
|
||
### Issue #7: Large Base64 Operations - MEDIUM Performance
|
||
|
||
**Location:** `PdfJsonConversionService.java:1062, 1428, 3570, 3584, 3612, 3630`
|
||
|
||
**Severity:** MEDIUM (Performance Bottleneck)
|
||
|
||
**Type:** Synchronous Blocking Operation
|
||
|
||
**Verified:** ✅ TRUE
|
||
|
||
**Description:**
|
||
```java
|
||
// Encode large font programs
|
||
String base64 = Base64.getEncoder().encodeToString(fontBytes); // 10MB → 13MB
|
||
|
||
// Decode large font programs
|
||
byte[] bytes = Base64.getDecoder().decode(pdfProgram); // 13MB → 10MB
|
||
```
|
||
|
||
**Problem:**
|
||
- Large fonts (embedded TrueType, Type3) can be 5-10MB
|
||
- Base64 encoding inflates size by ~33%
|
||
- All encoding/decoding is synchronous on request threads
|
||
- CPU-intensive operation (20-50ms for 10MB)
|
||
|
||
**Impact:**
|
||
```
|
||
100 concurrent requests processing 10MB fonts:
|
||
- Each request: 30ms CPU time for Base64
|
||
- All threads blocked on encoding simultaneously
|
||
- Thread pool saturation (if using fixed-size pool)
|
||
- Other requests starved waiting for threads
|
||
|
||
Large PDF with 50 fonts:
|
||
- 50 × 30ms = 1.5 seconds just for Base64 operations
|
||
- User perceives slowness
|
||
```
|
||
|
||
**Recommendation:**
|
||
```java
|
||
// Option 1: Size limits
|
||
private static final int MAX_FONT_SIZE = 10 * 1024 * 1024; // 10MB
|
||
|
||
if (fontBytes.length > MAX_FONT_SIZE) {
|
||
throw new IllegalArgumentException("Font too large: " + fontBytes.length);
|
||
}
|
||
|
||
// Option 2: Streaming Base64 (for very large files)
|
||
OutputStream base64Out = Base64.getEncoder().wrap(outputStream);
|
||
inputStream.transferTo(base64Out);
|
||
|
||
// Option 3: Async processing
|
||
CompletableFuture<String> encodeFuture = CompletableFuture.supplyAsync(
|
||
() -> Base64.getEncoder().encodeToString(fontBytes),
|
||
fontEncodingExecutor
|
||
);
|
||
```
|
||
|
||
---
|
||
|
||
### Issue #8: File I/O on Request Threads - MEDIUM
|
||
|
||
**Location:** `PdfJsonConversionService.java:276, 405, 5066`
|
||
|
||
**Severity:** MEDIUM (Performance)
|
||
|
||
**Type:** Blocking I/O
|
||
|
||
**Verified:** ❌ PARTIALLY TRUE
|
||
|
||
**Description:**
|
||
```java
|
||
// Line 276: Write upload to disk
|
||
file.transferTo(originalFile.getFile());
|
||
|
||
// Line 405: Read full file into memory
|
||
byte[] cachedPdfBytes = Files.readAllBytes(workingPath);
|
||
|
||
// Line 5066: Get uploaded file bytes
|
||
byte[] pdfBytes = file.getBytes();
|
||
```
|
||
|
||
**Clarification:**
|
||
- These are in DIFFERENT methods (not double-reads within one operation)
|
||
- Each method reads the file once
|
||
- Still synchronous blocking I/O
|
||
|
||
**Impact:**
|
||
- Large uploads (100MB) block request thread for seconds
|
||
- No async or streaming support
|
||
- Thread pool saturation under high upload volume
|
||
|
||
**Recommendation:**
|
||
```java
|
||
// Async file I/O
|
||
CompletableFuture<Path> uploadFuture = CompletableFuture.supplyAsync(
|
||
() -> {
|
||
Path tempPath = Files.createTempFile("pdf-upload", ".pdf");
|
||
file.transferTo(tempPath.toFile());
|
||
return tempPath;
|
||
},
|
||
fileIoExecutor
|
||
);
|
||
|
||
// Stream large files
|
||
try (InputStream in = file.getInputStream();
|
||
OutputStream out = Files.newOutputStream(targetPath)) {
|
||
in.transferTo(out);
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## LOW SEVERITY ISSUES
|
||
|
||
### Issue #9: PdfLazyLoadingService Unused - LOW
|
||
|
||
**Location:** `PdfLazyLoadingService.java` (entire file)
|
||
|
||
**Severity:** LOW (Code Quality)
|
||
|
||
**Type:** Dead Code
|
||
|
||
**Verified:** ✅ TRUE
|
||
|
||
**Description:**
|
||
- Complete service implementation exists
|
||
- Has its own `documentCache` and cleanup logic
|
||
- Duplicates functionality in `PdfJsonConversionService`
|
||
- Not wired to any controller
|
||
- Not imported by any other class
|
||
|
||
**Impact:**
|
||
- Code maintenance burden
|
||
- Confusing for developers
|
||
- Potential for accidental use in future
|
||
- Cache divergence if both ever get used
|
||
|
||
**Recommendation:**
|
||
```java
|
||
// Delete PdfLazyLoadingService.java entirely
|
||
// Or clearly mark as @Deprecated with explanation
|
||
```
|
||
|
||
---
|
||
|
||
### Issue #10: PdfJsonFontService Volatile Fields - LOW
|
||
|
||
**Location:** `PdfJsonFontService.java:46-47`
|
||
|
||
**Severity:** LOW (Actually Correct)
|
||
|
||
**Type:** None (Good Practice)
|
||
|
||
**Verified:** ✅ TRUE (No issue, correctly implemented)
|
||
|
||
**Description:**
|
||
```java
|
||
private volatile boolean pythonCffConverterAvailable;
|
||
private volatile boolean fontForgeCffConverterAvailable;
|
||
|
||
@PostConstruct
|
||
private void initialiseCffConverterAvailability() {
|
||
pythonCffConverterAvailable = isCommandAvailable(pythonCommand);
|
||
fontForgeCffConverterAvailable = isCommandAvailable(fontforgeCommand);
|
||
}
|
||
```
|
||
|
||
**Why This Is Correct:**
|
||
- `volatile` ensures visibility across threads
|
||
- Set once at startup
|
||
- Read many times (thread-safe)
|
||
- No synchronization needed
|
||
|
||
**Recommendation:** None - this is good practice.
|
||
|
||
---
|
||
|
||
## VERIFIED FALSE CLAIMS
|
||
|
||
### Claim: file.getBytes() Called Twice
|
||
|
||
**Status:** ❌ FALSE
|
||
|
||
**Explanation:** The claim stated that `file.getBytes()` is called twice (lines 446, 5065). Investigation shows:
|
||
- Line 446: `convertJsonToPdf` method
|
||
- Line 5065: `extractDocumentMetadata` method
|
||
- These are DIFFERENT methods for DIFFERENT operations
|
||
- Each method calls `getBytes()` only once
|
||
|
||
**Conclusion:** Not a double-read issue.
|
||
|
||
---
|
||
|
||
### Claim: Image Base64 Encoding Per Call
|
||
|
||
**Status:** ❌ FALSE
|
||
|
||
**Explanation:** The claim stated images are Base64-encoded on every call. Investigation shows:
|
||
```java
|
||
// PdfJsonImageService.java:430-450
|
||
private EncodedImage getOrEncodeImage(PDImage pdImage) {
|
||
COSBase key = xObject.getCOSObject();
|
||
EncodedImage cached = imageCache.get(key); // Cache check!
|
||
if (cached != null) {
|
||
return cached; // Cache hit
|
||
}
|
||
EncodedImage encoded = encodeImage(pdImage);
|
||
imageCache.put(key, encoded); // Cache miss, encode and store
|
||
return encoded;
|
||
}
|
||
```
|
||
|
||
**Conclusion:** Images ARE cached. Only stencil and inline images bypass cache.
|
||
|
||
---
|
||
|
||
## ARCHITECTURE ISSUES
|
||
|
||
### Issue #11: Singleton Service Architecture - MEDIUM
|
||
|
||
**Location:** All `@Service` and `@Component` classes
|
||
|
||
**Severity:** MEDIUM (Maintainability)
|
||
|
||
**Type:** Architectural Pattern
|
||
|
||
**Description:**
|
||
All services use default singleton scope:
|
||
```java
|
||
@Service // Defaults to singleton
|
||
public class PdfJsonConversionService {
|
||
// Shared instance variables across all requests
|
||
private final Map<String, PDFont> type3NormalizedFontCache = ...;
|
||
}
|
||
```
|
||
|
||
**Implications:**
|
||
✅ **Good:**
|
||
- Most dependencies are stateless and injected
|
||
- Caches use ConcurrentHashMap (thread-safe)
|
||
- No mutable instance variables beyond caches
|
||
|
||
⚠️ **Risks:**
|
||
- Singleton means shared state across all requests
|
||
- Requires careful synchronization
|
||
- Easy for future developers to introduce thread-unsafe code
|
||
- Difficult to test concurrent scenarios
|
||
|
||
**Recommendation:**
|
||
- Document thread-safety requirements prominently
|
||
- Add unit tests for concurrent access
|
||
- Consider request-scoped services for mutable state
|
||
- Code review checklist for new instance variables
|
||
|
||
---
|
||
|
||
## SUMMARY BY SEVERITY
|
||
|
||
### CRITICAL (Fix Immediately)
|
||
1. **pageFontResources PDFont keys** - Broken feature
|
||
2. **Unbounded thread creation** - Resource exhaustion
|
||
|
||
### HIGH (Fix Soon)
|
||
3. **Unbounded cache growth** - Memory leak
|
||
4. **Type3 cache race** - Duplicate work
|
||
5. **PDDocument lifecycle** - Needs investigation (speculative)
|
||
|
||
### MEDIUM (Plan and Address)
|
||
6. **Full PDF reload per page** - Performance
|
||
7. **Large Base64 operations** - Performance
|
||
8. **File I/O blocking** - Performance
|
||
9. **PdfLazyLoadingService unused** - Dead code
|
||
10. **Singleton architecture** - Maintainability
|
||
|
||
### LOW (Monitor)
|
||
11. **PdfJsonFontService volatile** - Correctly implemented (no action needed)
|
||
|
||
### VERIFIED FALSE
|
||
12. file.getBytes() called twice
|
||
13. Image Base64 encoding per call
|
||
|
||
---
|
||
|
||
## IMPLEMENTATION ROADMAP
|
||
|
||
### Phase 1: Critical Fixes (1-2 weeks)
|
||
|
||
**1. Fix pageFontResources PDFont keys (Issue #1)**
|
||
```java
|
||
// Priority: CRITICAL
|
||
// Time: 3-5 days
|
||
// Risk: Medium (requires careful testing)
|
||
|
||
// Replace PDFont keys with String resource names
|
||
// Update cache lookup logic
|
||
```
|
||
|
||
**2. Fix Thread Leaks (Issue #2)**
|
||
```java
|
||
// Priority: CRITICAL
|
||
// Time: 1 day
|
||
// Risk: Low (well-understood solution)
|
||
|
||
// Replace new Thread() with ScheduledExecutorService
|
||
// Add @PreDestroy cleanup
|
||
// Monitor thread counts
|
||
```
|
||
|
||
### Phase 2: Resource Management (1 week)
|
||
|
||
**3. Add Cache Eviction (Issue #3)**
|
||
```java
|
||
// Priority: HIGH
|
||
// Time: 3 days
|
||
// Risk: Low (library-based solution)
|
||
|
||
// Integrate Caffeine cache
|
||
// Set size limits, TTL
|
||
// Add eviction logging
|
||
// Monitor cache metrics
|
||
```
|
||
|
||
**4. Fix Type3 Cache Race (Issue #4)**
|
||
```java
|
||
// Priority: HIGH
|
||
// Time: 1-2 days
|
||
// Risk: Low (straightforward fix)
|
||
|
||
// Use computeIfAbsent for atomic operations
|
||
```
|
||
|
||
### Phase 3: Performance Optimization (2-3 weeks)
|
||
|
||
**5. Optimize Lazy Loading (Issue #6)**
|
||
```java
|
||
// Priority: MEDIUM
|
||
// Time: 1 week
|
||
// Risk: Medium (requires benchmarking)
|
||
|
||
// Cache extracted page data
|
||
// Or: Keep PDDocument open with RandomAccessFile
|
||
// Or: Pre-split pages at upload
|
||
```
|
||
|
||
**6. Async I/O (Issues #7, #8)**
|
||
```java
|
||
// Priority: MEDIUM
|
||
// Time: 3-5 days
|
||
// Risk: Medium (requires async architecture changes)
|
||
|
||
// Add dedicated I/O thread pool
|
||
// Async file operations
|
||
// Stream large files
|
||
```
|
||
|
||
### Phase 4: Code Quality (1 week)
|
||
|
||
**7. Remove Dead Code (Issue #9)**
|
||
```java
|
||
// Priority: LOW
|
||
// Time: 1 day
|
||
// Risk: None
|
||
|
||
// Delete PdfLazyLoadingService
|
||
// Clean up unused imports
|
||
```
|
||
|
||
**8. Documentation & Testing**
|
||
```java
|
||
// Priority: MEDIUM
|
||
// Time: 3-5 days
|
||
|
||
// Add thread-safety documentation
|
||
// Concurrent integration tests
|
||
// Load testing scripts
|
||
```
|
||
|
||
---
|
||
|
||
## TESTING STRATEGY
|
||
|
||
### 1. Concurrency Tests
|
||
|
||
```java
|
||
@SpringBootTest
|
||
@TestInstance(TestInstance.Lifecycle.PER_CLASS)
|
||
class ConcurrencyTest {
|
||
|
||
@Test
|
||
void testConcurrentCacheAccess() throws Exception {
|
||
ExecutorService executor = Executors.newFixedThreadPool(20);
|
||
CountDownLatch latch = new CountDownLatch(100);
|
||
List<Future<?>> futures = new ArrayList<>();
|
||
|
||
// 100 requests across 10 jobIds (10 requests per job)
|
||
for (int i = 0; i < 100; i++) {
|
||
String jobId = "job-" + (i % 10);
|
||
int pageNum = (i % 5) + 1;
|
||
|
||
futures.add(executor.submit(() -> {
|
||
try {
|
||
service.extractSinglePage(jobId, pageNum);
|
||
} catch (Exception e) {
|
||
log.error("Concurrent access failed", e);
|
||
throw e;
|
||
} finally {
|
||
latch.countDown();
|
||
}
|
||
}));
|
||
}
|
||
|
||
// Wait for completion
|
||
assertTrue(latch.await(60, TimeUnit.SECONDS));
|
||
|
||
// Check for exceptions
|
||
for (Future<?> future : futures) {
|
||
future.get(); // Throws if any task failed
|
||
}
|
||
}
|
||
|
||
@Test
|
||
void testCacheNotCorrupted() throws Exception {
|
||
// Upload document
|
||
String jobId = "test-job";
|
||
service.extractDocumentMetadata(testPdf, jobId);
|
||
|
||
// Concurrent page requests
|
||
ExecutorService executor = Executors.newFixedThreadPool(10);
|
||
List<Future<byte[]>> futures = new ArrayList<>();
|
||
|
||
for (int i = 0; i < 50; i++) {
|
||
int page = (i % 10) + 1;
|
||
futures.add(executor.submit(() ->
|
||
service.extractSinglePage(jobId, page)));
|
||
}
|
||
|
||
// All should succeed without ConcurrentModificationException
|
||
for (Future<byte[]> future : futures) {
|
||
assertNotNull(future.get());
|
||
}
|
||
}
|
||
}
|
||
```
|
||
|
||
### 2. Memory Leak Tests
|
||
|
||
```java
|
||
@Test
|
||
void testCacheDoesNotGrowUnbounded() {
|
||
long initialHeap = getHeapUsage();
|
||
|
||
// Process 10,000 small PDFs with Type3 fonts
|
||
for (int i = 0; i < 10000; i++) {
|
||
service.convertPdfToJson(createTestPdfWithType3Fonts());
|
||
}
|
||
|
||
// Force GC
|
||
System.gc();
|
||
Thread.sleep(1000);
|
||
|
||
long finalHeap = getHeapUsage();
|
||
long growth = finalHeap - initialHeap;
|
||
|
||
// Cache should not grow beyond reasonable limit
|
||
assertThat(growth).isLessThan(100_000_000); // 100MB max
|
||
}
|
||
|
||
@Test
|
||
void testThreadsNotLeaking() {
|
||
int initialThreads = getActiveThreadCount();
|
||
|
||
// Upload 100 PDFs (spawns 100 cleanup threads)
|
||
for (int i = 0; i < 100; i++) {
|
||
service.extractDocumentMetadata(testPdf, "job-" + i);
|
||
}
|
||
|
||
int peakThreads = getActiveThreadCount();
|
||
|
||
// Should not create 100+ threads
|
||
assertThat(peakThreads - initialThreads).isLessThan(10);
|
||
}
|
||
|
||
private long getHeapUsage() {
|
||
Runtime runtime = Runtime.getRuntime();
|
||
return runtime.totalMemory() - runtime.freeMemory();
|
||
}
|
||
|
||
private int getActiveThreadCount() {
|
||
return Thread.getAllStackTraces().size();
|
||
}
|
||
```
|
||
|
||
### 3. Security Tests
|
||
|
||
```java
|
||
@Test
|
||
void testJobIdIsolation() {
|
||
// User A uploads PDF
|
||
String jobIdA = service.extractDocumentMetadata(userAPdf, sessionA);
|
||
|
||
// User B tries to access User A's jobId
|
||
assertThrows(AccessDeniedException.class, () -> {
|
||
service.extractSinglePage(jobIdA, 1, sessionB);
|
||
});
|
||
}
|
||
|
||
@Test
|
||
void testJobIdUnpredictable() {
|
||
Set<String> jobIds = new HashSet<>();
|
||
|
||
for (int i = 0; i < 1000; i++) {
|
||
String jobId = service.extractDocumentMetadata(testPdf, session);
|
||
jobIds.add(jobId);
|
||
}
|
||
|
||
// All jobIds should be unique UUIDs
|
||
assertThat(jobIds).hasSize(1000);
|
||
|
||
// Should not be sequential
|
||
List<String> sorted = new ArrayList<>(jobIds);
|
||
Collections.sort(sorted);
|
||
assertThat(sorted).isNotEqualTo(new ArrayList<>(jobIds));
|
||
}
|
||
```
|
||
|
||
### 4. Performance Tests
|
||
|
||
```java
|
||
@Test
|
||
void testLargeFilePerformance() {
|
||
// 100MB PDF
|
||
byte[] largePdf = createLargePdf(100 * 1024 * 1024);
|
||
|
||
long start = System.currentTimeMillis();
|
||
String json = service.convertPdfToJson(largePdf);
|
||
long duration = System.currentTimeMillis() - start;
|
||
|
||
// Should complete in reasonable time
|
||
assertThat(duration).isLessThan(30_000); // 30 seconds
|
||
}
|
||
|
||
@Test
|
||
void testConcurrentThroughput() throws Exception {
|
||
ExecutorService executor = Executors.newFixedThreadPool(50);
|
||
CountDownLatch latch = new CountDownLatch(500);
|
||
|
||
long start = System.currentTimeMillis();
|
||
|
||
for (int i = 0; i < 500; i++) {
|
||
executor.submit(() -> {
|
||
try {
|
||
service.convertPdfToJson(testPdf);
|
||
} finally {
|
||
latch.countDown();
|
||
}
|
||
});
|
||
}
|
||
|
||
latch.await();
|
||
long duration = System.currentTimeMillis() - start;
|
||
|
||
// 500 conversions should complete in reasonable time
|
||
double throughput = 500.0 / (duration / 1000.0);
|
||
assertThat(throughput).isGreaterThan(10); // At least 10 conversions/sec
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## MONITORING & METRICS
|
||
|
||
### Recommended Metrics (Micrometer)
|
||
|
||
```java
|
||
@Service
|
||
public class PdfJsonConversionService {
|
||
|
||
private final MeterRegistry meterRegistry;
|
||
|
||
// Cache size gauges
|
||
@PostConstruct
|
||
void registerMetrics() {
|
||
Gauge.builder("pdf.cache.document.size", documentCache, Map::size)
|
||
.description("Number of cached documents")
|
||
.register(meterRegistry);
|
||
|
||
Gauge.builder("pdf.cache.type3font.size", type3NormalizedFontCache, Map::size)
|
||
.description("Number of cached Type3 fonts")
|
||
.register(meterRegistry);
|
||
|
||
Gauge.builder("pdf.cache.coverage.size", type3GlyphCoverageCache, Map::size)
|
||
.description("Number of cached glyph coverage sets")
|
||
.register(meterRegistry);
|
||
|
||
Gauge.builder("pdf.threads.cleanup", this::getCleanupThreadCount)
|
||
.description("Active cleanup threads")
|
||
.register(meterRegistry);
|
||
}
|
||
|
||
// Operation timers
|
||
public String convertPdfToJson(byte[] pdfBytes) {
|
||
Timer.Sample sample = Timer.start(meterRegistry);
|
||
try {
|
||
String result = doConvertPdfToJson(pdfBytes);
|
||
sample.stop(meterRegistry.timer("pdf.convert.toJson"));
|
||
return result;
|
||
} catch (Exception e) {
|
||
meterRegistry.counter("pdf.convert.errors", "operation", "toJson").increment();
|
||
throw e;
|
||
}
|
||
}
|
||
|
||
// Cache hit/miss counters
|
||
private PDFont getCachedType3Font(String uid) {
|
||
PDFont cached = type3NormalizedFontCache.get(uid);
|
||
if (cached != null) {
|
||
meterRegistry.counter("pdf.cache.type3font.hits").increment();
|
||
} else {
|
||
meterRegistry.counter("pdf.cache.type3font.misses").increment();
|
||
}
|
||
return cached;
|
||
}
|
||
}
|
||
```
|
||
|
||
### Alerts
|
||
|
||
```yaml
|
||
alerts:
|
||
# Cache growth
|
||
- name: DocumentCacheTooLarge
|
||
condition: pdf_cache_document_size > 100
|
||
severity: warning
|
||
|
||
- name: Type3CacheTooLarge
|
||
condition: pdf_cache_type3font_size > 1000
|
||
severity: warning
|
||
|
||
# Thread leaks
|
||
- name: TooManyCleanupThreads
|
||
condition: pdf_threads_cleanup > 10
|
||
severity: critical
|
||
|
||
# Memory pressure
|
||
- name: HeapUsageHigh
|
||
condition: jvm_memory_used_bytes / jvm_memory_max_bytes > 0.8
|
||
severity: warning
|
||
|
||
# Performance
|
||
- name: SlowConversions
|
||
condition: pdf_convert_toJson{quantile="0.95"} > 10s
|
||
severity: warning
|
||
|
||
# Error rate
|
||
- name: HighErrorRate
|
||
condition: rate(pdf_convert_errors[5m]) > 0.1
|
||
severity: critical
|
||
```
|
||
|
||
---
|
||
|
||
## CONCLUSION
|
||
|
||
The PDF JSON editor has **CRITICAL** issues that must be fixed before production deployment:
|
||
|
||
### Must-Fix Issues (Blocks Production):
|
||
1. **pageFontResources broken** - Lazy page loading completely broken (PDFont instances as keys)
|
||
2. **Thread leaks** - Unbounded thread creation causes OutOfMemoryError
|
||
|
||
### Should-Fix Issues (Prevents Scale):
|
||
3. **Unbounded cache growth** - Memory leaks require server restarts
|
||
4. **Type3 cache races** - Wasted CPU doing duplicate work
|
||
5. **PDDocument lifecycle** - Needs investigation (no evidence of actual problems yet)
|
||
|
||
### Performance Improvements (Nice-to-Have):
|
||
6. Full PDF reload per page
|
||
7. Large Base64 operations
|
||
8. Synchronous file I/O
|
||
|
||
### Code Quality Issues:
|
||
9. PdfLazyLoadingService dead code
|
||
10. Documentation of thread-safety requirements
|
||
|
||
**Estimated Effort:**
|
||
- Critical fixes: 1-2 weeks (Issues #1, #2)
|
||
- High priority: 1 week (Issues #3, #4)
|
||
- Performance: 2-3 weeks (Issues #6-8)
|
||
- **Total:** 4-6 weeks for complete remediation
|
||
|
||
**Recommendation:** Fix critical issues (#1, #2) immediately, then address high priority issues before beta testing.
|