mirror of
https://github.com/Frooodle/Stirling-PDF.git
synced 2025-11-16 01:21:16 +01:00
1432 lines
42 KiB
Markdown
1432 lines
42 KiB
Markdown
# PDF JSON Editor - Threading, Concurrency, and Performance Analysis
|
||
|
||
**Date:** 2025-01-09
|
||
**Version:** 1.0
|
||
**Status:** Comprehensive analysis combining automated review and manual verification
|
||
|
||
---
|
||
|
||
## Executive Summary
|
||
|
||
This analysis identifies **CRITICAL** security vulnerabilities, thread safety issues, and performance problems in the PDF JSON editor codebase. The service contains:
|
||
|
||
- **1 CRITICAL security vulnerability** (cache poisoning/information disclosure)
|
||
- **3 CRITICAL threading issues** causing data corruption
|
||
- **2 HIGH severity resource leaks** causing memory exhaustion
|
||
- **Multiple performance bottlenecks** limiting scalability
|
||
|
||
**Immediate Action Required:**
|
||
1. Fix user-supplied jobId security vulnerability (Issue #1)
|
||
2. Fix cache mutation race conditions (Issues #2, #3, #4)
|
||
3. Replace unbounded thread spawning (Issue #5)
|
||
4. Add cache size limits (Issue #6)
|
||
|
||
---
|
||
|
||
## CRITICAL ISSUES
|
||
|
||
### Issue #1: User-Supplied jobId - CRITICAL SECURITY VULNERABILITY ⚠️
|
||
|
||
**Location:** `ConvertPdfJsonController.java:95, 120, 148, 160`
|
||
|
||
**Severity:** CRITICAL (Security Vulnerability)
|
||
|
||
**Type:** Cache Poisoning, Information Disclosure, Access Control Bypass
|
||
|
||
**Verified:** ✅ TRUE
|
||
|
||
**Description:**
|
||
```java
|
||
// Line 95
|
||
public ResponseEntity<byte[]> extractPdfMetadata(
|
||
@ModelAttribute PDFFile request,
|
||
@RequestParam(required = true) String jobId) // User-controlled!
|
||
|
||
// Lines 120, 148, 160
|
||
@PathVariable String jobId // User-controlled!
|
||
```
|
||
|
||
**Security Issues:**
|
||
1. **Information Disclosure:** Users can guess/enumerate jobIds to access other users' cached PDFs
|
||
2. **Cache Poisoning:** Users can overwrite other users' cache entries by supplying the same jobId
|
||
3. **No Authentication/Authorization:** No check to verify jobId ownership
|
||
4. **Predictable IDs:** If users choose sequential IDs ("job1", "job2"), enumeration is trivial
|
||
|
||
**Attack Scenarios:**
|
||
```
|
||
User A: Uploads sensitive PDF with jobId="company-financials"
|
||
User B: Calls /pdf/json/metadata/company-financials
|
||
-> Gets User A's PDF metadata
|
||
User B: Calls /pdf/json/page/company-financials/1
|
||
-> Gets User A's PDF page content
|
||
```
|
||
|
||
**Impact:**
|
||
- **Confidentiality Breach:** Unauthorized access to other users' documents
|
||
- **Data Integrity:** Users can corrupt each other's cached documents
|
||
- **Cache Collision:** Legitimate users get wrong data when IDs collide
|
||
|
||
**Recommendation:**
|
||
```java
|
||
// Server-generated UUIDs with user/session binding
|
||
@PostMapping("/pdf/json/metadata")
|
||
public ResponseEntity<byte[]> extractPdfMetadata(
|
||
@ModelAttribute PDFFile request,
|
||
HttpSession session) {
|
||
String jobId = UUID.randomUUID().toString();
|
||
String userId = session.getId(); // or authenticated user ID
|
||
|
||
// Store with composite key
|
||
String cacheKey = userId + ":" + jobId;
|
||
|
||
// Validate ownership on retrieval
|
||
if (!validateOwnership(cacheKey, session)) {
|
||
throw new AccessDeniedException("Invalid jobId");
|
||
}
|
||
|
||
return service.extractDocumentMetadata(request, cacheKey);
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
### Issue #2: Mutable Cache Maps - CRITICAL Race Condition
|
||
|
||
**Location:** `PdfJsonConversionService.java:5009-5037, 4844-4852, 5170-5176`
|
||
|
||
**Severity:** CRITICAL (Data Corruption)
|
||
|
||
**Type:** Race Condition, Cache Mutation
|
||
|
||
**Verified:** ✅ TRUE
|
||
|
||
**Description:**
|
||
```java
|
||
// Line 5010-5037: CachedPdfDocument stores mutable maps
|
||
@lombok.Data
|
||
private static class CachedPdfDocument {
|
||
private final Map<String, PdfJsonFont> fonts; // Mutable reference!
|
||
private final Map<Integer, Map<PDFont, String>> pageFontResources; // Mutable!
|
||
}
|
||
|
||
// Line 5170-5176: Cached maps passed directly to stripper
|
||
TextCollectingStripper stripper = new TextCollectingStripper(
|
||
document,
|
||
cached.getFonts(), // ← Mutable map shared across threads!
|
||
textByPage,
|
||
cached.getPageFontResources(),
|
||
new IdentityHashMap<>());
|
||
|
||
// Line 4844-4852: registerFont() mutates the shared map
|
||
private String registerFont(PDFont font) throws IOException {
|
||
if (!fonts.containsKey(key)) {
|
||
fonts.put(key, buildFontModel(...)); // ← MUTATION!
|
||
}
|
||
return fontId;
|
||
}
|
||
```
|
||
|
||
**Race Condition:**
|
||
```
|
||
Time | Thread A (extractSinglePage job1, page1) | Thread B (extractSinglePage job1, page2)
|
||
-----|-------------------------------------------|------------------------------------------
|
||
T1 | Gets cached.getFonts() -> Map@123 |
|
||
T2 | | Gets cached.getFonts() -> Map@123 (same!)
|
||
T3 | registerFont() checks containsKey("1:F1") |
|
||
T4 | | registerFont() checks containsKey("2:F1")
|
||
T5 | fonts.put("1:F1", fontModel) |
|
||
T6 | | fonts.put("2:F1", fontModel)
|
||
T7 | Iterator over fonts |
|
||
T8 | | Concurrent modification! ← EXCEPTION
|
||
```
|
||
|
||
**Impact:**
|
||
- `ConcurrentModificationException` thrown to users
|
||
- Font metadata corruption (partial writes visible)
|
||
- Cache inconsistency between requests
|
||
- Non-deterministic failures under load
|
||
|
||
**Evidence of Mutation:**
|
||
- `@lombok.Data` generates `getFonts()` that returns the same mutable reference
|
||
- No defensive copying when retrieving from cache
|
||
- Multiple threads call `extractSinglePage` with same jobId simultaneously
|
||
|
||
**Recommendation:**
|
||
```java
|
||
// Option 1: Deep copy on retrieval
|
||
private static class CachedPdfDocument {
|
||
public Map<String, PdfJsonFont> getFonts() {
|
||
return new LinkedHashMap<>(fonts); // Defensive copy
|
||
}
|
||
}
|
||
|
||
// Option 2: Immutable maps
|
||
private static class CachedPdfDocument {
|
||
private final ImmutableMap<String, PdfJsonFont> fonts;
|
||
}
|
||
|
||
// Option 3: Synchronize mutations (less performant)
|
||
private synchronized String registerFont(PDFont font) { ... }
|
||
```
|
||
|
||
---
|
||
|
||
### Issue #3: Font UID Not Globally Unique - CRITICAL Cache Collision
|
||
|
||
**Location:** `PdfJsonConversionService.java:715-722, 961, 147-148`
|
||
|
||
**Severity:** CRITICAL (Cache Corruption)
|
||
|
||
**Type:** Cache Key Collision, Memory Leak
|
||
|
||
**Verified:** ✅ TRUE
|
||
|
||
**Description:**
|
||
```java
|
||
// Line 715-722: UID construction
|
||
private String buildFontKey(int pageNumber, String fontId) {
|
||
return pageNumber + ":" + fontId; // Only unique within one document!
|
||
}
|
||
|
||
// Line 961: Used as cache key
|
||
.uid(buildFontKey(pageNumber, fontId))
|
||
|
||
// Line 147-148: Global singleton caches
|
||
private final Map<String, PDFont> type3NormalizedFontCache = new ConcurrentHashMap<>();
|
||
private final Map<String, Set<Integer>> type3GlyphCoverageCache = new ConcurrentHashMap<>();
|
||
```
|
||
|
||
**Problem:**
|
||
Font UIDs are only unique within a single document. The format is `"pageNumber:fontId"` (e.g., `"1:F1"`). When multiple jobs run concurrently:
|
||
|
||
```
|
||
Job A (invoice.pdf): Font UID = "1:F1" (Times-Roman)
|
||
Job B (report.pdf): Font UID = "1:F1" (Helvetica)
|
||
|
||
Both jobs share the SAME cache entry!
|
||
```
|
||
|
||
**Race Condition:**
|
||
```
|
||
T1: Job A converts PDF with font "1:F1" (Times-Roman)
|
||
T2: Job A caches PDFont@AAA in type3NormalizedFontCache["1:F1"]
|
||
T3: Job B converts PDF with font "1:F1" (Helvetica)
|
||
T4: Job B finds cache hit for "1:F1"
|
||
T5: Job B uses Times-Roman font (WRONG!) ← DATA CORRUPTION
|
||
```
|
||
|
||
**Memory Leak:**
|
||
```java
|
||
// PDFont objects hold references to their source PDDocument
|
||
PDFont cachedFont = ...; // Created from documentA
|
||
documentA.close(); // Document closed
|
||
// cachedFont still references freed native resources!
|
||
```
|
||
|
||
**Impact:**
|
||
- Wrong fonts used across different documents
|
||
- PDFBox native memory leaks (PDDocument held by cached PDFont)
|
||
- Cache grows unbounded (never cleared except during JSON→PDF)
|
||
- Type3GlyphCoverageCache **NEVER CLEARED** anywhere in codebase
|
||
|
||
**Recommendation:**
|
||
```java
|
||
// Include jobId in font UID
|
||
private String buildFontKey(String jobId, int pageNumber, String fontId) {
|
||
return jobId + ":" + pageNumber + ":" + fontId;
|
||
}
|
||
|
||
// Or scope caches per-job
|
||
private static class JobContext {
|
||
private final String jobId;
|
||
private final Map<String, PDFont> type3FontCache = new HashMap<>();
|
||
private final Map<String, Set<Integer>> glyphCoverageCache = new HashMap<>();
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
### Issue #4: pageFontResources Keyed by PDFont Instances - CRITICAL
|
||
|
||
**Location:** `PdfJsonConversionService.java:5075-5081, 5158-5159`
|
||
|
||
**Severity:** CRITICAL (Broken Functionality)
|
||
|
||
**Type:** Object Identity Mismatch, Cache Miss
|
||
|
||
**Verified:** ✅ TRUE
|
||
|
||
**Description:**
|
||
```java
|
||
// Line 5075-5081: Initial metadata extraction (first PDF load)
|
||
try (PDDocument document = pdfDocumentFactory.load(pdfBytes, true)) {
|
||
Map<Integer, Map<PDFont, String>> pageFontResources = new HashMap<>();
|
||
for (PDPage page : document.getPages()) {
|
||
Map<PDFont, String> resourceMap = collectFontsForPage(...);
|
||
// resourceMap keys are PDFont instances from this document
|
||
pageFontResources.put(pageNumber, resourceMap);
|
||
}
|
||
// Cache it
|
||
documentCache.put(jobId, new CachedPdfDocument(..., pageFontResources, ...));
|
||
} // PDDocument closed, PDFont instances now reference freed document
|
||
|
||
// Line 5158-5159: Lazy page extraction (reloads PDF)
|
||
try (PDDocument document = pdfDocumentFactory.load(cached.getPdfBytes(), true)) {
|
||
// NEW PDFont instances created!
|
||
PDPage page = document.getPage(pageIndex);
|
||
|
||
// Try to lookup fonts
|
||
Map<PDFont, String> cachedResourceMap = cached.getPageFontResources().get(pageNum);
|
||
// cachedResourceMap keys are OLD PDFont instances from closed document
|
||
|
||
// Lookup using NEW PDFont instances
|
||
String fontId = cachedResourceMap.get(newFont); // Always NULL! ← BUG
|
||
}
|
||
```
|
||
|
||
**Why It Fails:**
|
||
1. **Object Identity:** `Map<PDFont, String>` uses PDFont object identity as key
|
||
2. **Different Instances:** Each PDF load creates new PDFont instances with different identities
|
||
3. **Lookup Fails:** `cachedResourceMap.get(newFont)` returns null because `newFont != oldFont`
|
||
4. **Defeat Caching Goal:** Every lazy page request rebuilds font metadata, defeating the cache
|
||
|
||
**Impact:**
|
||
- Lazy page loading doesn't reuse cached font metadata
|
||
- CPU wasted rebuilding font info on every page request
|
||
- Cache only stores garbage (unusable keys)
|
||
- "Consistent font UID" feature completely broken
|
||
|
||
**Evidence:**
|
||
```java
|
||
// No code actually uses the cached pageFontResources successfully
|
||
// Every extractSinglePage call rebuilds fonts from scratch
|
||
```
|
||
|
||
**Recommendation:**
|
||
```java
|
||
// Use resource names as keys instead of PDFont objects
|
||
Map<Integer, Map<String, String>> pageFontResources = new HashMap<>();
|
||
// Key: font resource name (e.g., "F1"), Value: font UID
|
||
|
||
// Or use font UID directly
|
||
Map<Integer, Set<String>> pageFontUids = new HashMap<>();
|
||
```
|
||
|
||
---
|
||
|
||
### Issue #5: Unbounded Thread Creation - CRITICAL Resource Leak
|
||
|
||
**Location:** `PdfJsonConversionService.java:5550-5562`
|
||
|
||
**Severity:** CRITICAL (Resource Exhaustion)
|
||
|
||
**Type:** Thread Leak, Memory Leak
|
||
|
||
**Verified:** ✅ TRUE
|
||
|
||
**Description:**
|
||
```java
|
||
private void scheduleDocumentCleanup(String jobId) {
|
||
new Thread(
|
||
() -> {
|
||
try {
|
||
Thread.sleep(TimeUnit.MINUTES.toMillis(30)); // Sleep 30 minutes!
|
||
clearCachedDocument(jobId);
|
||
log.debug("Auto-cleaned cached document for jobId: {}", jobId);
|
||
} catch (InterruptedException e) {
|
||
Thread.currentThread().interrupt();
|
||
}
|
||
})
|
||
.start(); // Unmanaged thread!
|
||
}
|
||
```
|
||
|
||
**Also in:** `PdfLazyLoadingService.java:256-269` (duplicate implementation)
|
||
|
||
**Problems:**
|
||
1. **One Thread Per Upload:** Each PDF upload spawns a new thread
|
||
2. **No Thread Pool:** Unlimited thread creation (no cap)
|
||
3. **Long Sleep:** Threads sleep for 30 minutes holding resources
|
||
4. **No Cancellation:** Cannot stop cleanup threads if job completes early
|
||
5. **Non-Daemon:** Threads prevent JVM shutdown (daemon=false by default)
|
||
6. **No Monitoring:** No visibility into active cleanup threads
|
||
|
||
**Impact Under Load:**
|
||
```
|
||
100 concurrent uploads → 100 cleanup threads spawned
|
||
Each thread: ~1MB stack + overhead
|
||
Total: ~100MB+ wasted on sleeping threads
|
||
|
||
1000 uploads/day → 1000+ threads accumulate
|
||
OS thread limit (ulimit -u) exceeded → OutOfMemoryError: unable to create new native thread
|
||
```
|
||
|
||
**Production Failure Scenario:**
|
||
```
|
||
09:00 - Peak traffic: 500 PDFs uploaded
|
||
09:00 - 500 cleanup threads spawned (each sleeps 30 min)
|
||
09:15 - 400 more uploads → 400 more threads
|
||
09:30 - First batch wakes up, cleans, exits
|
||
But new threads keep getting created faster than old ones die
|
||
10:00 - Thread count > 2000
|
||
10:15 - JVM hits OS thread limit
|
||
10:16 - Server crashes: "OutOfMemoryError: unable to create new native thread"
|
||
```
|
||
|
||
**Recommendation:**
|
||
```java
|
||
@Service
|
||
public class PdfJsonConversionService {
|
||
// Fixed-size thread pool for cleanup
|
||
private final ScheduledExecutorService cleanupScheduler =
|
||
Executors.newScheduledThreadPool(
|
||
2, // Only 2 threads needed
|
||
new ThreadFactoryBuilder()
|
||
.setNameFormat("pdf-cache-cleanup-%d")
|
||
.setDaemon(true)
|
||
.build()
|
||
);
|
||
|
||
private void scheduleDocumentCleanup(String jobId) {
|
||
cleanupScheduler.schedule(
|
||
() -> clearCachedDocument(jobId),
|
||
30,
|
||
TimeUnit.MINUTES
|
||
);
|
||
}
|
||
|
||
@PreDestroy
|
||
public void shutdown() {
|
||
cleanupScheduler.shutdown();
|
||
try {
|
||
if (!cleanupScheduler.awaitTermination(10, TimeUnit.SECONDS)) {
|
||
cleanupScheduler.shutdownNow();
|
||
}
|
||
} catch (InterruptedException e) {
|
||
cleanupScheduler.shutdownNow();
|
||
Thread.currentThread().interrupt();
|
||
}
|
||
}
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## HIGH SEVERITY ISSUES
|
||
|
||
### Issue #6: Unbounded Cache Growth - HIGH Memory Leak
|
||
|
||
**Location:** `PdfJsonConversionService.java:147-148, 154`
|
||
|
||
**Severity:** HIGH (Memory Exhaustion)
|
||
|
||
**Type:** Memory Leak, Missing Eviction Policy
|
||
|
||
**Verified:** ✅ TRUE
|
||
|
||
**Description:**
|
||
```java
|
||
// Line 147: Type3 normalized fonts - cleared only in convertJsonToPdf (line 454)
|
||
private final Map<String, PDFont> type3NormalizedFontCache = new ConcurrentHashMap<>();
|
||
|
||
// Line 148: Type3 glyph coverage - NEVER CLEARED anywhere in codebase!
|
||
private final Map<String, Set<Integer>> type3GlyphCoverageCache = new ConcurrentHashMap<>();
|
||
|
||
// Line 154: Document cache - relies on buggy cleanup threads
|
||
private final Map<String, CachedPdfDocument> documentCache = new ConcurrentHashMap<>();
|
||
```
|
||
|
||
**Growth Patterns:**
|
||
|
||
**1. type3NormalizedFontCache:**
|
||
- Written at line 3766: `type3NormalizedFontCache.put(fontModel.getUid(), font)`
|
||
- Cleared only at line 454: `type3NormalizedFontCache.clear()` (JSON→PDF conversion)
|
||
- **NOT cleared during PDF→JSON** conversion (most common operation)
|
||
- Each PDFont holds references to native resources (C++ objects via JNI)
|
||
- Grows unbounded during PDF→JSON operations
|
||
|
||
**2. type3GlyphCoverageCache:**
|
||
- Written at line 1122: `type3GlyphCoverageCache.put(fontUid, coverageSet)`
|
||
- **NEVER CLEARED** in entire codebase (verified via grep)
|
||
- Accumulates Set<Integer> for every Type3 font ever processed
|
||
- Each Set can contain thousands of integers (Unicode codepoints)
|
||
- Pure memory leak
|
||
|
||
**3. documentCache:**
|
||
- Stores full PDF bytes in memory
|
||
- Each entry can be 1MB-100MB+ (document bytes + metadata)
|
||
- Relies on cleanup threads (which have issues from Issue #5)
|
||
- If cleanup fails (exception, server restart), entries stay forever
|
||
- No max size check
|
||
|
||
**Impact:**
|
||
```
|
||
Long-running server processes 10,000 Type3 fonts:
|
||
- type3GlyphCoverageCache: 10,000 entries × ~1KB each = 10MB
|
||
- type3NormalizedFontCache: 1,000 cached fonts × ~100KB each = 100MB
|
||
- documentCache: 50 active jobs × 10MB each = 500MB
|
||
|
||
After 1 week: Caches grow to 1GB+
|
||
After 1 month: OutOfMemoryError, server restart required
|
||
```
|
||
|
||
**Recommendation:**
|
||
```java
|
||
// Use Caffeine cache with eviction policies
|
||
private final Cache<String, PDFont> type3NormalizedFontCache =
|
||
Caffeine.newBuilder()
|
||
.maximumSize(1000) // Max 1000 fonts
|
||
.expireAfterAccess(1, TimeUnit.HOURS) // Expire after 1hr unused
|
||
.removalListener((key, value, cause) -> {
|
||
// Cleanup PDFont resources if needed
|
||
})
|
||
.build();
|
||
|
||
private final Cache<String, Set<Integer>> type3GlyphCoverageCache =
|
||
Caffeine.newBuilder()
|
||
.maximumSize(5000)
|
||
.expireAfterWrite(1, TimeUnit.HOURS)
|
||
.build();
|
||
|
||
private final Cache<String, CachedPdfDocument> documentCache =
|
||
Caffeine.newBuilder()
|
||
.maximumWeight(500_000_000) // 500MB max
|
||
.weigher((String key, CachedPdfDocument doc) ->
|
||
doc.getPdfBytes().length)
|
||
.expireAfterWrite(30, TimeUnit.MINUTES)
|
||
.removalListener((key, value, cause) -> {
|
||
log.info("Evicted document {} (cause: {})", key, cause);
|
||
})
|
||
.build();
|
||
```
|
||
|
||
---
|
||
|
||
### Issue #7: Type3 Cache Race Condition - HIGH
|
||
|
||
**Location:** `PdfJsonConversionService.java:3759-3773`
|
||
|
||
**Severity:** HIGH (Duplicate Work)
|
||
|
||
**Type:** Check-Then-Act Race Condition
|
||
|
||
**Verified:** ✅ TRUE
|
||
|
||
**Description:**
|
||
```java
|
||
private void loadNormalizedType3Font(
|
||
PDDocument document,
|
||
PdfJsonFont fontModel,
|
||
List<FontByteSource> candidates,
|
||
String originalFormat) throws IOException {
|
||
if (fontModel.getUid() == null || candidates == null || candidates.isEmpty()) {
|
||
return;
|
||
}
|
||
if (type3NormalizedFontCache.containsKey(fontModel.getUid())) { // CHECK
|
||
return;
|
||
}
|
||
for (FontByteSource source : candidates) {
|
||
PDFont font = loadFontFromSource(...); // EXPENSIVE: 10-50ms
|
||
if (font != null) {
|
||
type3NormalizedFontCache.put(fontModel.getUid(), font); // ACT
|
||
log.info("Cached normalized font {} for Type3 {}", ...);
|
||
break;
|
||
}
|
||
}
|
||
}
|
||
```
|
||
|
||
**Race Condition:**
|
||
```
|
||
Thread A: Check cache for "1:F1" → MISS (line 3759)
|
||
Thread B: Check cache for "1:F1" → MISS (line 3759) [both pass check!]
|
||
Thread A: Load font from bytes (10ms I/O + parsing)
|
||
Thread B: Load font from bytes (10ms I/O + parsing) ← DUPLICATE WORK
|
||
Thread A: Put font in cache (line 3766)
|
||
Thread B: Put font in cache (line 3766) [overwrites A's entry]
|
||
```
|
||
|
||
**Why ConcurrentHashMap Doesn't Help:**
|
||
- ConcurrentHashMap prevents **corruption** (map state stays consistent)
|
||
- ConcurrentHashMap does NOT prevent **duplicate work** (both threads compute)
|
||
- The check (`containsKey`) and act (`put`) are separate operations
|
||
|
||
**Impact:**
|
||
- Wasted CPU cycles loading same font twice
|
||
- Temporary memory spike (two fonts in heap simultaneously)
|
||
- Font loading is expensive: Base64 decode + PDFBox parsing + font validation
|
||
- Under high concurrency, 10+ threads could all load the same font
|
||
|
||
**Recommendation:**
|
||
```java
|
||
private void loadNormalizedType3Font(...) throws IOException {
|
||
if (fontModel.getUid() == null || candidates == null || candidates.isEmpty()) {
|
||
return;
|
||
}
|
||
|
||
// Atomic compute-if-absent
|
||
type3NormalizedFontCache.computeIfAbsent(fontModel.getUid(), uid -> {
|
||
for (FontByteSource source : candidates) {
|
||
try {
|
||
PDFont font = loadFontFromSource(...);
|
||
if (font != null) {
|
||
log.info("Cached normalized font {} for Type3 {}", ...);
|
||
return font;
|
||
}
|
||
} catch (IOException e) {
|
||
log.warn("Failed to load font from {}: {}", source.originLabel(), e.getMessage());
|
||
}
|
||
}
|
||
return null;
|
||
});
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
### Issue #8: PDDocument Resource Lifecycle - NEEDS INVESTIGATION
|
||
|
||
**Location:** `PdfJsonConversionService.java:3766, 5158`
|
||
|
||
**Severity:** UNKNOWN (Requires Investigation)
|
||
|
||
**Type:** Unclear Resource Ownership
|
||
|
||
**Verified:** ⚠️ SPECULATIVE (No concrete evidence of failure)
|
||
|
||
**Description:**
|
||
```java
|
||
// Line 3766: Cache PDFont created from a PDDocument
|
||
try (PDDocument document = ...) {
|
||
PDFont font = loadFontFromSource(document, fontModel, source, ...);
|
||
type3NormalizedFontCache.put(fontModel.getUid(), font);
|
||
} // PDDocument is closed here!
|
||
|
||
// Later: cached PDFont is used with a DIFFERENT PDDocument
|
||
try (PDDocument newDocument = ...) {
|
||
PDFont cachedFont = type3NormalizedFontCache.get(fontUid);
|
||
// Is cachedFont safe to use after original document closed?
|
||
// Does it hold references to freed native resources?
|
||
}
|
||
```
|
||
|
||
**Theoretical Concerns:**
|
||
1. **Native Memory:** PDFBox uses JNI for some operations
|
||
2. **Resource Ties:** PDFont may hold references to the source PDDocument
|
||
3. **Freed Resources:** Using PDFont after document closes could access freed memory
|
||
4. **Unclear Contract:** PDFBox documentation doesn't explicitly address font lifecycle
|
||
|
||
**Current Status:**
|
||
- ⚠️ **NO EVIDENCE OF ACTUAL FAILURES** - System appears to work in practice
|
||
- ⚠️ **NO CRASHES OBSERVED** - No segmentation faults or memory corruption reported
|
||
- ⚠️ **NO MEMORY LEAKS DETECTED** - No profiler data showing leaks
|
||
- ⚠️ **PURELY THEORETICAL CONCERN** - Based on API design, not observed behavior
|
||
|
||
**Why This May Actually Be Safe:**
|
||
- PDFBox may create self-contained PDFont objects
|
||
- Font data may be copied rather than referenced
|
||
- PDFBox may be designed for this use case
|
||
- Current code has been running without apparent issues
|
||
|
||
**Required Investigation:**
|
||
1. **PDFBox Source Code Review:** Check if PDFont copies or references document data
|
||
2. **Load Testing:** Create PDFont, close document, use font in new document
|
||
3. **Memory Profiling:** Monitor for native memory leaks over extended runs
|
||
4. **PDFBox Documentation/Forums:** Search for guidance on font lifecycle
|
||
|
||
**Recommendation:**
|
||
- **Priority: MEDIUM** (needs investigation but not blocking)
|
||
- Add monitoring for potential issues
|
||
- Test font reuse after document closure explicitly
|
||
- If problems found, cache serialized bytes instead of PDFont objects
|
||
|
||
```java
|
||
// Option 1: Cache font bytes instead of PDFont objects
|
||
private void cacheType3FontBytes(String fontUid, byte[] fontBytes) {
|
||
type3FontBytesCache.put(fontUid, fontBytes);
|
||
}
|
||
|
||
// Option 2: Verify font is safe to use
|
||
private PDFont getCachedFont(String fontUid) {
|
||
PDFont cached = type3NormalizedFontCache.get(fontUid);
|
||
if (cached != null && !isFontValid(cached)) {
|
||
log.warn("Cached font {} is invalid, removing", fontUid);
|
||
type3NormalizedFontCache.remove(fontUid);
|
||
return null;
|
||
}
|
||
return cached;
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## MEDIUM SEVERITY ISSUES
|
||
|
||
### Issue #9: Full PDF Reload Per Page - MEDIUM Performance
|
||
|
||
**Location:** `PdfJsonConversionService.java:5158-5159`
|
||
|
||
**Severity:** MEDIUM (Performance)
|
||
|
||
**Type:** Inefficient I/O
|
||
|
||
**Verified:** ✅ TRUE
|
||
|
||
**Description:**
|
||
```java
|
||
// extractSinglePage method
|
||
try (PDDocument document = pdfDocumentFactory.load(cached.getPdfBytes(), true)) {
|
||
// Full PDF loaded from bytes (10-100ms for large PDFs)
|
||
PDPage page = document.getPage(pageIndex);
|
||
// Extract just one page...
|
||
}
|
||
```
|
||
|
||
**Problem:**
|
||
Every page request loads the entire PDF from bytes:
|
||
- 100-page PDF = Load 10MB, extract 1 page
|
||
- 10 page requests = 10× full PDF loads (100MB I/O)
|
||
- No incremental parsing or streaming
|
||
|
||
**Impact:**
|
||
```
|
||
100MB PDF, 50 pages requested sequentially:
|
||
- Total I/O: 100MB × 50 = 5GB
|
||
- Time: 50× parse time (5-10 seconds total)
|
||
- Memory: 100MB peak per request
|
||
|
||
Concurrent page requests for same PDF:
|
||
- 10 threads × 100MB = 1GB temporary memory spike
|
||
```
|
||
|
||
**Why This Exists:**
|
||
Lazy loading design trades memory (don't cache full extraction) for CPU (reload on demand). But the tradeoff is poor because:
|
||
- PDFBox parsing is expensive
|
||
- Repeated decompression of streams
|
||
- Could cache extracted page data instead
|
||
|
||
**Recommendation:**
|
||
```java
|
||
// Option 1: Cache extracted page data
|
||
private static class CachedPage {
|
||
List<PdfJsonTextElement> textElements;
|
||
List<PdfJsonImageElement> imageElements;
|
||
// ... other page data
|
||
}
|
||
|
||
Map<String, Map<Integer, CachedPage>> pageCache = ...;
|
||
|
||
// Option 2: Keep PDF open with RandomAccessFile
|
||
private static class CachedPdfDocument {
|
||
private final RandomAccessReadBufferedFile randomAccess;
|
||
private final PDDocument document; // Keep open!
|
||
}
|
||
|
||
// Option 3: Pre-split pages at upload time
|
||
// Store each page as separate lightweight JSON blob
|
||
```
|
||
|
||
---
|
||
|
||
### Issue #10: Large Base64 Operations - MEDIUM Performance
|
||
|
||
**Location:** `PdfJsonConversionService.java:1062, 1428, 3570, 3584, 3612, 3630`
|
||
|
||
**Severity:** MEDIUM (Performance Bottleneck)
|
||
|
||
**Type:** Synchronous Blocking Operation
|
||
|
||
**Verified:** ✅ TRUE
|
||
|
||
**Description:**
|
||
```java
|
||
// Encode large font programs
|
||
String base64 = Base64.getEncoder().encodeToString(fontBytes); // 10MB → 13MB
|
||
|
||
// Decode large font programs
|
||
byte[] bytes = Base64.getDecoder().decode(pdfProgram); // 13MB → 10MB
|
||
```
|
||
|
||
**Problem:**
|
||
- Large fonts (embedded TrueType, Type3) can be 5-10MB
|
||
- Base64 encoding inflates size by ~33%
|
||
- All encoding/decoding is synchronous on request threads
|
||
- CPU-intensive operation (20-50ms for 10MB)
|
||
|
||
**Impact:**
|
||
```
|
||
100 concurrent requests processing 10MB fonts:
|
||
- Each request: 30ms CPU time for Base64
|
||
- All threads blocked on encoding simultaneously
|
||
- Thread pool saturation (if using fixed-size pool)
|
||
- Other requests starved waiting for threads
|
||
|
||
Large PDF with 50 fonts:
|
||
- 50 × 30ms = 1.5 seconds just for Base64 operations
|
||
- User perceives slowness
|
||
```
|
||
|
||
**Recommendation:**
|
||
```java
|
||
// Option 1: Size limits
|
||
private static final int MAX_FONT_SIZE = 10 * 1024 * 1024; // 10MB
|
||
|
||
if (fontBytes.length > MAX_FONT_SIZE) {
|
||
throw new IllegalArgumentException("Font too large: " + fontBytes.length);
|
||
}
|
||
|
||
// Option 2: Streaming Base64 (for very large files)
|
||
OutputStream base64Out = Base64.getEncoder().wrap(outputStream);
|
||
inputStream.transferTo(base64Out);
|
||
|
||
// Option 3: Async processing
|
||
CompletableFuture<String> encodeFuture = CompletableFuture.supplyAsync(
|
||
() -> Base64.getEncoder().encodeToString(fontBytes),
|
||
fontEncodingExecutor
|
||
);
|
||
```
|
||
|
||
---
|
||
|
||
### Issue #11: File I/O on Request Threads - MEDIUM
|
||
|
||
**Location:** `PdfJsonConversionService.java:276, 405, 5066`
|
||
|
||
**Severity:** MEDIUM (Performance)
|
||
|
||
**Type:** Blocking I/O
|
||
|
||
**Verified:** ❌ PARTIALLY TRUE
|
||
|
||
**Description:**
|
||
```java
|
||
// Line 276: Write upload to disk
|
||
file.transferTo(originalFile.getFile());
|
||
|
||
// Line 405: Read full file into memory
|
||
byte[] cachedPdfBytes = Files.readAllBytes(workingPath);
|
||
|
||
// Line 5066: Get uploaded file bytes
|
||
byte[] pdfBytes = file.getBytes();
|
||
```
|
||
|
||
**Clarification:**
|
||
- These are in DIFFERENT methods (not double-reads within one operation)
|
||
- Each method reads the file once
|
||
- Still synchronous blocking I/O
|
||
|
||
**Impact:**
|
||
- Large uploads (100MB) block request thread for seconds
|
||
- No async or streaming support
|
||
- Thread pool saturation under high upload volume
|
||
|
||
**Recommendation:**
|
||
```java
|
||
// Async file I/O
|
||
CompletableFuture<Path> uploadFuture = CompletableFuture.supplyAsync(
|
||
() -> {
|
||
Path tempPath = Files.createTempFile("pdf-upload", ".pdf");
|
||
file.transferTo(tempPath.toFile());
|
||
return tempPath;
|
||
},
|
||
fileIoExecutor
|
||
);
|
||
|
||
// Stream large files
|
||
try (InputStream in = file.getInputStream();
|
||
OutputStream out = Files.newOutputStream(targetPath)) {
|
||
in.transferTo(out);
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## LOW SEVERITY ISSUES
|
||
|
||
### Issue #12: PdfLazyLoadingService Unused - LOW
|
||
|
||
**Location:** `PdfLazyLoadingService.java` (entire file)
|
||
|
||
**Severity:** LOW (Code Quality)
|
||
|
||
**Type:** Dead Code
|
||
|
||
**Verified:** ✅ TRUE
|
||
|
||
**Description:**
|
||
- Complete service implementation exists
|
||
- Has its own `documentCache` and cleanup logic
|
||
- Duplicates functionality in `PdfJsonConversionService`
|
||
- Not wired to any controller
|
||
- Not imported by any other class
|
||
|
||
**Impact:**
|
||
- Code maintenance burden
|
||
- Confusing for developers
|
||
- Potential for accidental use in future
|
||
- Cache divergence if both ever get used
|
||
|
||
**Recommendation:**
|
||
```java
|
||
// Delete PdfLazyLoadingService.java entirely
|
||
// Or clearly mark as @Deprecated with explanation
|
||
```
|
||
|
||
---
|
||
|
||
### Issue #13: PdfJsonFontService Volatile Fields - LOW
|
||
|
||
**Location:** `PdfJsonFontService.java:46-47`
|
||
|
||
**Severity:** LOW (Actually Correct)
|
||
|
||
**Type:** None (Good Practice)
|
||
|
||
**Verified:** ✅ TRUE (No issue, correctly implemented)
|
||
|
||
**Description:**
|
||
```java
|
||
private volatile boolean pythonCffConverterAvailable;
|
||
private volatile boolean fontForgeCffConverterAvailable;
|
||
|
||
@PostConstruct
|
||
private void initialiseCffConverterAvailability() {
|
||
pythonCffConverterAvailable = isCommandAvailable(pythonCommand);
|
||
fontForgeCffConverterAvailable = isCommandAvailable(fontforgeCommand);
|
||
}
|
||
```
|
||
|
||
**Why This Is Correct:**
|
||
- `volatile` ensures visibility across threads
|
||
- Set once at startup
|
||
- Read many times (thread-safe)
|
||
- No synchronization needed
|
||
|
||
**Recommendation:** None - this is good practice.
|
||
|
||
---
|
||
|
||
## VERIFIED FALSE CLAIMS
|
||
|
||
### Claim: file.getBytes() Called Twice
|
||
|
||
**Status:** ❌ FALSE
|
||
|
||
**Explanation:** The claim stated that `file.getBytes()` is called twice (lines 446, 5065). Investigation shows:
|
||
- Line 446: `convertJsonToPdf` method
|
||
- Line 5065: `extractDocumentMetadata` method
|
||
- These are DIFFERENT methods for DIFFERENT operations
|
||
- Each method calls `getBytes()` only once
|
||
|
||
**Conclusion:** Not a double-read issue.
|
||
|
||
---
|
||
|
||
### Claim: Image Base64 Encoding Per Call
|
||
|
||
**Status:** ❌ FALSE
|
||
|
||
**Explanation:** The claim stated images are Base64-encoded on every call. Investigation shows:
|
||
```java
|
||
// PdfJsonImageService.java:430-450
|
||
private EncodedImage getOrEncodeImage(PDImage pdImage) {
|
||
COSBase key = xObject.getCOSObject();
|
||
EncodedImage cached = imageCache.get(key); // Cache check!
|
||
if (cached != null) {
|
||
return cached; // Cache hit
|
||
}
|
||
EncodedImage encoded = encodeImage(pdImage);
|
||
imageCache.put(key, encoded); // Cache miss, encode and store
|
||
return encoded;
|
||
}
|
||
```
|
||
|
||
**Conclusion:** Images ARE cached. Only stencil and inline images bypass cache.
|
||
|
||
---
|
||
|
||
## ARCHITECTURE ISSUES
|
||
|
||
### Issue #14: Singleton Service Architecture - MEDIUM
|
||
|
||
**Location:** All `@Service` and `@Component` classes
|
||
|
||
**Severity:** MEDIUM (Maintainability)
|
||
|
||
**Type:** Architectural Pattern
|
||
|
||
**Description:**
|
||
All services use default singleton scope:
|
||
```java
|
||
@Service // Defaults to singleton
|
||
public class PdfJsonConversionService {
|
||
// Shared instance variables across all requests
|
||
private final Map<String, PDFont> type3NormalizedFontCache = ...;
|
||
}
|
||
```
|
||
|
||
**Implications:**
|
||
✅ **Good:**
|
||
- Most dependencies are stateless and injected
|
||
- Caches use ConcurrentHashMap (thread-safe)
|
||
- No mutable instance variables beyond caches
|
||
|
||
⚠️ **Risks:**
|
||
- Singleton means shared state across all requests
|
||
- Requires careful synchronization
|
||
- Easy for future developers to introduce thread-unsafe code
|
||
- Difficult to test concurrent scenarios
|
||
|
||
**Recommendation:**
|
||
- Document thread-safety requirements prominently
|
||
- Add unit tests for concurrent access
|
||
- Consider request-scoped services for mutable state
|
||
- Code review checklist for new instance variables
|
||
|
||
---
|
||
|
||
## SUMMARY BY SEVERITY
|
||
|
||
### CRITICAL (Fix Immediately)
|
||
1. ✅ **User-supplied jobId** (Issue #1) - Security vulnerability
|
||
2. ✅ **Mutable cache maps** (Issue #2) - Data corruption
|
||
3. ✅ **Font UID collisions** (Issue #3) - Cache corruption
|
||
4. ✅ **pageFontResources PDFont keys** (Issue #4) - Broken feature
|
||
5. ✅ **Unbounded thread creation** (Issue #5) - Resource exhaustion
|
||
|
||
### HIGH (Fix Soon)
|
||
6. ✅ **Unbounded cache growth** (Issue #6) - Memory leak
|
||
7. ✅ **Type3 cache race** (Issue #7) - Duplicate work
|
||
8. ⚠️ **PDDocument lifecycle** (Issue #8) - Needs investigation (speculative)
|
||
|
||
### MEDIUM (Plan and Address)
|
||
9. ✅ **Full PDF reload per page** (Issue #9) - Performance
|
||
10. ✅ **Large Base64 operations** (Issue #10) - Performance
|
||
11. ✅ **File I/O blocking** (Issue #11) - Performance
|
||
12. ✅ **PdfLazyLoadingService unused** (Issue #12) - Dead code
|
||
13. ✅ **Singleton architecture** (Issue #14) - Maintainability
|
||
|
||
### LOW (Monitor)
|
||
14. ✅ **PdfJsonFontService volatile** (Issue #13) - Correctly implemented
|
||
|
||
### VERIFIED FALSE
|
||
15. ❌ file.getBytes() called twice
|
||
16. ❌ Image Base64 encoding per call
|
||
|
||
---
|
||
|
||
## IMPLEMENTATION ROADMAP
|
||
|
||
### Phase 1: Critical Security & Data Integrity (1-2 weeks)
|
||
|
||
**1. Fix jobId Security (Issue #1)**
|
||
```java
|
||
// Priority: CRITICAL
|
||
// Time: 2 days
|
||
// Risk: Low (straightforward fix)
|
||
|
||
// Generate server-side UUIDs
|
||
// Add session/user binding
|
||
// Validate ownership on all cache operations
|
||
```
|
||
|
||
**2. Fix Cache Mutation (Issues #2, #3, #4)**
|
||
```java
|
||
// Priority: CRITICAL
|
||
// Time: 3-5 days
|
||
// Risk: Medium (requires careful testing)
|
||
|
||
// Make CachedPdfDocument immutable
|
||
// Scope font caches per-job
|
||
// Replace PDFont keys with String resource names
|
||
// Add defensive copying
|
||
```
|
||
|
||
### Phase 2: Resource Management (1 week)
|
||
|
||
**3. Fix Thread Leaks (Issue #5)**
|
||
```java
|
||
// Priority: CRITICAL
|
||
// Time: 1 day
|
||
// Risk: Low (well-understood solution)
|
||
|
||
// Replace new Thread() with ScheduledExecutorService
|
||
// Add @PreDestroy cleanup
|
||
// Monitor thread counts
|
||
```
|
||
|
||
**4. Add Cache Eviction (Issue #6, previously also listed as Issue #11)**
|
||
```java
|
||
// Priority: HIGH
|
||
// Time: 3 days
|
||
// Risk: Low (library-based solution)
|
||
|
||
// Integrate Caffeine cache
|
||
// Set size limits, TTL
|
||
// Add eviction logging
|
||
// Monitor cache metrics
|
||
```
|
||
|
||
### Phase 3: Performance Optimization (2-3 weeks)
|
||
|
||
**5. Optimize Lazy Loading (Issue #9)**
|
||
```java
|
||
// Priority: MEDIUM
|
||
// Time: 1 week
|
||
// Risk: Medium (requires benchmarking)
|
||
|
||
// Cache extracted page data
|
||
// Or: Keep PDDocument open with RandomAccessFile
|
||
// Or: Pre-split pages at upload
|
||
```
|
||
|
||
**6. Async I/O (Issues #10, #12)**
|
||
```java
|
||
// Priority: MEDIUM
|
||
// Time: 3-5 days
|
||
// Risk: Medium (requires async architecture changes)
|
||
|
||
// Add dedicated I/O thread pool
|
||
// Async file operations
|
||
// Stream large files
|
||
```
|
||
|
||
### Phase 4: Code Quality (1 week)
|
||
|
||
**7. Remove Dead Code (Issue #12)**
|
||
```java
|
||
// Priority: LOW
|
||
// Time: 1 day
|
||
// Risk: None
|
||
|
||
// Delete PdfLazyLoadingService
|
||
// Clean up unused imports
|
||
```
|
||
|
||
**8. Documentation & Testing**
|
||
```java
|
||
// Priority: MEDIUM
|
||
// Time: 3-5 days
|
||
|
||
// Add thread-safety documentation
|
||
// Concurrent integration tests
|
||
// Load testing scripts
|
||
```
|
||
|
||
---
|
||
|
||
## TESTING STRATEGY
|
||
|
||
### 1. Concurrency Tests
|
||
|
||
```java
|
||
@SpringBootTest
|
||
@TestInstance(TestInstance.Lifecycle.PER_CLASS)
|
||
class ConcurrencyTest {
|
||
|
||
@Test
|
||
void testConcurrentCacheAccess() throws Exception {
|
||
ExecutorService executor = Executors.newFixedThreadPool(20);
|
||
CountDownLatch latch = new CountDownLatch(100);
|
||
List<Future<?>> futures = new ArrayList<>();
|
||
|
||
// 100 requests across 10 jobIds (10 requests per job)
|
||
for (int i = 0; i < 100; i++) {
|
||
String jobId = "job-" + (i % 10);
|
||
int pageNum = (i % 5) + 1;
|
||
|
||
futures.add(executor.submit(() -> {
|
||
try {
|
||
service.extractSinglePage(jobId, pageNum);
|
||
} catch (Exception e) {
|
||
log.error("Concurrent access failed", e);
|
||
throw e;
|
||
} finally {
|
||
latch.countDown();
|
||
}
|
||
}));
|
||
}
|
||
|
||
// Wait for completion
|
||
assertTrue(latch.await(60, TimeUnit.SECONDS));
|
||
|
||
// Check for exceptions
|
||
for (Future<?> future : futures) {
|
||
future.get(); // Throws if any task failed
|
||
}
|
||
}
|
||
|
||
@Test
|
||
void testCacheNotCorrupted() throws Exception {
|
||
// Upload document
|
||
String jobId = "test-job";
|
||
service.extractDocumentMetadata(testPdf, jobId);
|
||
|
||
// Concurrent page requests
|
||
ExecutorService executor = Executors.newFixedThreadPool(10);
|
||
List<Future<byte[]>> futures = new ArrayList<>();
|
||
|
||
for (int i = 0; i < 50; i++) {
|
||
int page = (i % 10) + 1;
|
||
futures.add(executor.submit(() ->
|
||
service.extractSinglePage(jobId, page)));
|
||
}
|
||
|
||
// All should succeed without ConcurrentModificationException
|
||
for (Future<byte[]> future : futures) {
|
||
assertNotNull(future.get());
|
||
}
|
||
}
|
||
}
|
||
```
|
||
|
||
### 2. Memory Leak Tests
|
||
|
||
```java
|
||
@Test
|
||
void testCacheDoesNotGrowUnbounded() {
|
||
long initialHeap = getHeapUsage();
|
||
|
||
// Process 10,000 small PDFs with Type3 fonts
|
||
for (int i = 0; i < 10000; i++) {
|
||
service.convertPdfToJson(createTestPdfWithType3Fonts());
|
||
}
|
||
|
||
// Force GC
|
||
System.gc();
|
||
Thread.sleep(1000);
|
||
|
||
long finalHeap = getHeapUsage();
|
||
long growth = finalHeap - initialHeap;
|
||
|
||
// Cache should not grow beyond reasonable limit
|
||
assertThat(growth).isLessThan(100_000_000); // 100MB max
|
||
}
|
||
|
||
@Test
|
||
void testThreadsNotLeaking() {
|
||
int initialThreads = getActiveThreadCount();
|
||
|
||
// Upload 100 PDFs (spawns 100 cleanup threads)
|
||
for (int i = 0; i < 100; i++) {
|
||
service.extractDocumentMetadata(testPdf, "job-" + i);
|
||
}
|
||
|
||
int peakThreads = getActiveThreadCount();
|
||
|
||
// Should not create 100+ threads
|
||
assertThat(peakThreads - initialThreads).isLessThan(10);
|
||
}
|
||
|
||
private long getHeapUsage() {
|
||
Runtime runtime = Runtime.getRuntime();
|
||
return runtime.totalMemory() - runtime.freeMemory();
|
||
}
|
||
|
||
private int getActiveThreadCount() {
|
||
return Thread.getAllStackTraces().size();
|
||
}
|
||
```
|
||
|
||
### 3. Security Tests
|
||
|
||
```java
|
||
@Test
|
||
void testJobIdIsolation() {
|
||
// User A uploads PDF
|
||
String jobIdA = service.extractDocumentMetadata(userAPdf, sessionA);
|
||
|
||
// User B tries to access User A's jobId
|
||
assertThrows(AccessDeniedException.class, () -> {
|
||
service.extractSinglePage(jobIdA, 1, sessionB);
|
||
});
|
||
}
|
||
|
||
@Test
|
||
void testJobIdUnpredictable() {
|
||
Set<String> jobIds = new HashSet<>();
|
||
|
||
for (int i = 0; i < 1000; i++) {
|
||
String jobId = service.extractDocumentMetadata(testPdf, session);
|
||
jobIds.add(jobId);
|
||
}
|
||
|
||
// All jobIds should be unique UUIDs
|
||
assertThat(jobIds).hasSize(1000);
|
||
|
||
// Should not be sequential
|
||
List<String> sorted = new ArrayList<>(jobIds);
|
||
Collections.sort(sorted);
|
||
assertThat(sorted).isNotEqualTo(new ArrayList<>(jobIds));
|
||
}
|
||
```
|
||
|
||
### 4. Performance Tests
|
||
|
||
```java
|
||
@Test
|
||
void testLargeFilePerformance() {
|
||
// 100MB PDF
|
||
byte[] largePdf = createLargePdf(100 * 1024 * 1024);
|
||
|
||
long start = System.currentTimeMillis();
|
||
String json = service.convertPdfToJson(largePdf);
|
||
long duration = System.currentTimeMillis() - start;
|
||
|
||
// Should complete in reasonable time
|
||
assertThat(duration).isLessThan(30_000); // 30 seconds
|
||
}
|
||
|
||
@Test
|
||
void testConcurrentThroughput() throws Exception {
|
||
ExecutorService executor = Executors.newFixedThreadPool(50);
|
||
CountDownLatch latch = new CountDownLatch(500);
|
||
|
||
long start = System.currentTimeMillis();
|
||
|
||
for (int i = 0; i < 500; i++) {
|
||
executor.submit(() -> {
|
||
try {
|
||
service.convertPdfToJson(testPdf);
|
||
} finally {
|
||
latch.countDown();
|
||
}
|
||
});
|
||
}
|
||
|
||
latch.await();
|
||
long duration = System.currentTimeMillis() - start;
|
||
|
||
// 500 conversions should complete in reasonable time
|
||
double throughput = 500.0 / (duration / 1000.0);
|
||
assertThat(throughput).isGreaterThan(10); // At least 10 conversions/sec
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## MONITORING & METRICS
|
||
|
||
### Recommended Metrics (Micrometer)
|
||
|
||
```java
|
||
@Service
|
||
public class PdfJsonConversionService {
|
||
|
||
private final MeterRegistry meterRegistry;
|
||
|
||
// Cache size gauges
|
||
@PostConstruct
|
||
void registerMetrics() {
|
||
Gauge.builder("pdf.cache.document.size", documentCache, Map::size)
|
||
.description("Number of cached documents")
|
||
.register(meterRegistry);
|
||
|
||
Gauge.builder("pdf.cache.type3font.size", type3NormalizedFontCache, Map::size)
|
||
.description("Number of cached Type3 fonts")
|
||
.register(meterRegistry);
|
||
|
||
Gauge.builder("pdf.cache.coverage.size", type3GlyphCoverageCache, Map::size)
|
||
.description("Number of cached glyph coverage sets")
|
||
.register(meterRegistry);
|
||
|
||
Gauge.builder("pdf.threads.cleanup", this::getCleanupThreadCount)
|
||
.description("Active cleanup threads")
|
||
.register(meterRegistry);
|
||
}
|
||
|
||
// Operation timers
|
||
public String convertPdfToJson(byte[] pdfBytes) {
|
||
Timer.Sample sample = Timer.start(meterRegistry);
|
||
try {
|
||
String result = doConvertPdfToJson(pdfBytes);
|
||
sample.stop(meterRegistry.timer("pdf.convert.toJson"));
|
||
return result;
|
||
} catch (Exception e) {
|
||
meterRegistry.counter("pdf.convert.errors", "operation", "toJson").increment();
|
||
throw e;
|
||
}
|
||
}
|
||
|
||
// Cache hit/miss counters
|
||
private PDFont getCachedType3Font(String uid) {
|
||
PDFont cached = type3NormalizedFontCache.get(uid);
|
||
if (cached != null) {
|
||
meterRegistry.counter("pdf.cache.type3font.hits").increment();
|
||
} else {
|
||
meterRegistry.counter("pdf.cache.type3font.misses").increment();
|
||
}
|
||
return cached;
|
||
}
|
||
}
|
||
```
|
||
|
||
### Alerts
|
||
|
||
```yaml
|
||
alerts:
|
||
# Cache growth
|
||
- name: DocumentCacheTooLarge
|
||
condition: pdf_cache_document_size > 100
|
||
severity: warning
|
||
|
||
- name: Type3CacheTooLarge
|
||
condition: pdf_cache_type3font_size > 1000
|
||
severity: warning
|
||
|
||
# Thread leaks
|
||
- name: TooManyCleanupThreads
|
||
condition: pdf_threads_cleanup > 10
|
||
severity: critical
|
||
|
||
# Memory pressure
|
||
- name: HeapUsageHigh
|
||
condition: jvm_memory_used_bytes / jvm_memory_max_bytes > 0.8
|
||
severity: warning
|
||
|
||
# Performance
|
||
- name: SlowConversions
|
||
condition: pdf_convert_toJson{quantile="0.95"} > 10s
|
||
severity: warning
|
||
|
||
# Error rate
|
||
- name: HighErrorRate
|
||
condition: rate(pdf_convert_errors[5m]) > 0.1
|
||
severity: critical
|
||
```
|
||
|
||
---
|
||
|
||
## CONCLUSION
|
||
|
||
The PDF JSON editor has **CRITICAL** issues that must be fixed before production deployment:
|
||
|
||
### Must-Fix Issues (Blocks Production):
|
||
1. **User-supplied jobId** - Security vulnerability enabling cache poisoning and information disclosure
|
||
2. **Mutable cache maps** - Causes ConcurrentModificationException under load
|
||
3. **Font UID collisions** - Different documents overwrite each other's font caches
|
||
4. **pageFontResources broken** - Lazy page loading completely broken
|
||
5. **Thread leaks** - Unbounded thread creation causes OutOfMemoryError
|
||
|
||
### Should-Fix Issues (Prevents Scale):
|
||
6. **Unbounded cache growth** - Memory leaks require server restarts
|
||
7. **Type3 cache races** - Wasted CPU doing duplicate work
|
||
8. **PDDocument lifecycle** - Needs investigation (no evidence of actual problems yet)
|
||
|
||
### Performance Improvements (Nice-to-Have):
|
||
9. Full PDF reload per page
|
||
10. Large Base64 operations
|
||
11. Synchronous file I/O
|
||
|
||
### Code Quality Issues:
|
||
12. PdfLazyLoadingService dead code
|
||
13. Documentation of thread-safety requirements
|
||
|
||
**Estimated Effort:**
|
||
- Critical fixes: 2-3 weeks
|
||
- High priority: 1 week
|
||
- Performance: 2-3 weeks
|
||
- **Total:** 5-7 weeks for complete remediation
|
||
|
||
**Recommendation:** Fix Critical issues immediately, then address High priority issues before beta testing.
|