mirror of https://github.com/Frooodle/Stirling-PDF.git synced 2025-11-16 01:21:16 +01:00

Anthony Stirling f645eaff18 fix issues with type0

2025-11-10 00:26:55 +00:00

42 KiB

Raw Blame History

PDF JSON Editor - Threading, Concurrency, and Performance Analysis

Date: 2025-01-09 Version: 1.0 Status: Comprehensive analysis combining automated review and manual verification

Executive Summary

This analysis identifies CRITICAL security vulnerabilities, thread safety issues, and performance problems in the PDF JSON editor codebase. The service contains:

1 CRITICAL security vulnerability (cache poisoning/information disclosure)
3 CRITICAL threading issues causing data corruption
2 HIGH severity resource leaks causing memory exhaustion
Multiple performance bottlenecks limiting scalability

Immediate Action Required:

Fix user-supplied jobId security vulnerability (Issue #1)
Fix cache mutation race conditions (Issues #2, #3, #4)
Replace unbounded thread spawning (Issue #5)
Add cache size limits (Issue #6)

CRITICAL ISSUES

Issue #1: User-Supplied jobId - CRITICAL SECURITY VULNERABILITY ⚠️

Location: ConvertPdfJsonController.java:95, 120, 148, 160

Severity: CRITICAL (Security Vulnerability)

Type: Cache Poisoning, Information Disclosure, Access Control Bypass

Verified: ✅ TRUE

Description:

// Line 95
public ResponseEntity<byte[]> extractPdfMetadata(
        @ModelAttribute PDFFile request,
        @RequestParam(required = true) String jobId)  // User-controlled!

// Lines 120, 148, 160
@PathVariable String jobId  // User-controlled!

Security Issues:

Information Disclosure: Users can guess/enumerate jobIds to access other users' cached PDFs
Cache Poisoning: Users can overwrite other users' cache entries by supplying the same jobId
No Authentication/Authorization: No check to verify jobId ownership
Predictable IDs: If users choose sequential IDs ("job1", "job2"), enumeration is trivial

Attack Scenarios:

User A: Uploads sensitive PDF with jobId="company-financials"
User B: Calls /pdf/json/metadata/company-financials
        -> Gets User A's PDF metadata
User B: Calls /pdf/json/page/company-financials/1
        -> Gets User A's PDF page content

Impact:

Confidentiality Breach: Unauthorized access to other users' documents
Data Integrity: Users can corrupt each other's cached documents
Cache Collision: Legitimate users get wrong data when IDs collide

Recommendation:

// Server-generated UUIDs with user/session binding
@PostMapping("/pdf/json/metadata")
public ResponseEntity<byte[]> extractPdfMetadata(
        @ModelAttribute PDFFile request,
        HttpSession session) {
    String jobId = UUID.randomUUID().toString();
    String userId = session.getId(); // or authenticated user ID

    // Store with composite key
    String cacheKey = userId + ":" + jobId;

    // Validate ownership on retrieval
    if (!validateOwnership(cacheKey, session)) {
        throw new AccessDeniedException("Invalid jobId");
    }

    return service.extractDocumentMetadata(request, cacheKey);
}

Issue #2: Mutable Cache Maps - CRITICAL Race Condition

Location: PdfJsonConversionService.java:5009-5037, 4844-4852, 5170-5176

Severity: CRITICAL (Data Corruption)

Type: Race Condition, Cache Mutation

Verified: ✅ TRUE

Description:

// Line 5010-5037: CachedPdfDocument stores mutable maps
@lombok.Data
private static class CachedPdfDocument {
    private final Map<String, PdfJsonFont> fonts;  // Mutable reference!
    private final Map<Integer, Map<PDFont, String>> pageFontResources;  // Mutable!
}

// Line 5170-5176: Cached maps passed directly to stripper
TextCollectingStripper stripper = new TextCollectingStripper(
    document,
    cached.getFonts(),  // ← Mutable map shared across threads!
    textByPage,
    cached.getPageFontResources(),
    new IdentityHashMap<>());

// Line 4844-4852: registerFont() mutates the shared map
private String registerFont(PDFont font) throws IOException {
    if (!fonts.containsKey(key)) {
        fonts.put(key, buildFontModel(...));  // ← MUTATION!
    }
    return fontId;
}

Race Condition:

Time | Thread A (extractSinglePage job1, page1) | Thread B (extractSinglePage job1, page2)
-----|-------------------------------------------|------------------------------------------
T1   | Gets cached.getFonts() -> Map@123         |
T2   |                                           | Gets cached.getFonts() -> Map@123 (same!)
T3   | registerFont() checks containsKey("1:F1") |
T4   |                                           | registerFont() checks containsKey("2:F1")
T5   | fonts.put("1:F1", fontModel)              |
T6   |                                           | fonts.put("2:F1", fontModel)
T7   | Iterator over fonts                       |
T8   |                                           | Concurrent modification! ← EXCEPTION

Impact:

ConcurrentModificationException thrown to users
Font metadata corruption (partial writes visible)
Cache inconsistency between requests
Non-deterministic failures under load

Evidence of Mutation:

@lombok.Data generates getFonts() that returns the same mutable reference
No defensive copying when retrieving from cache
Multiple threads call extractSinglePage with same jobId simultaneously

Recommendation:

// Option 1: Deep copy on retrieval
private static class CachedPdfDocument {
    public Map<String, PdfJsonFont> getFonts() {
        return new LinkedHashMap<>(fonts);  // Defensive copy
    }
}

// Option 2: Immutable maps
private static class CachedPdfDocument {
    private final ImmutableMap<String, PdfJsonFont> fonts;
}

// Option 3: Synchronize mutations (less performant)
private synchronized String registerFont(PDFont font) { ... }

Issue #3: Font UID Not Globally Unique - CRITICAL Cache Collision

Location: PdfJsonConversionService.java:715-722, 961, 147-148

Severity: CRITICAL (Cache Corruption)

Type: Cache Key Collision, Memory Leak

Verified: ✅ TRUE

Description:

// Line 715-722: UID construction
private String buildFontKey(int pageNumber, String fontId) {
    return pageNumber + ":" + fontId;  // Only unique within one document!
}

// Line 961: Used as cache key
.uid(buildFontKey(pageNumber, fontId))

// Line 147-148: Global singleton caches
private final Map<String, PDFont> type3NormalizedFontCache = new ConcurrentHashMap<>();
private final Map<String, Set<Integer>> type3GlyphCoverageCache = new ConcurrentHashMap<>();

Problem: Font UIDs are only unique within a single document. The format is "pageNumber:fontId" (e.g., "1:F1"). When multiple jobs run concurrently:

Job A (invoice.pdf):  Font UID = "1:F1" (Times-Roman)
Job B (report.pdf):   Font UID = "1:F1" (Helvetica)

Both jobs share the SAME cache entry!

Race Condition:

T1: Job A converts PDF with font "1:F1" (Times-Roman)
T2: Job A caches PDFont@AAA in type3NormalizedFontCache["1:F1"]
T3: Job B converts PDF with font "1:F1" (Helvetica)
T4: Job B finds cache hit for "1:F1"
T5: Job B uses Times-Roman font (WRONG!) ← DATA CORRUPTION

Memory Leak:

// PDFont objects hold references to their source PDDocument
PDFont cachedFont = ...; // Created from documentA
documentA.close();       // Document closed
// cachedFont still references freed native resources!

Impact:

Wrong fonts used across different documents
PDFBox native memory leaks (PDDocument held by cached PDFont)
Cache grows unbounded (never cleared except during JSON→PDF)
Type3GlyphCoverageCache NEVER CLEARED anywhere in codebase

Recommendation:

// Include jobId in font UID
private String buildFontKey(String jobId, int pageNumber, String fontId) {
    return jobId + ":" + pageNumber + ":" + fontId;
}

// Or scope caches per-job
private static class JobContext {
    private final String jobId;
    private final Map<String, PDFont> type3FontCache = new HashMap<>();
    private final Map<String, Set<Integer>> glyphCoverageCache = new HashMap<>();
}

Issue #4: pageFontResources Keyed by PDFont Instances - CRITICAL

Location: PdfJsonConversionService.java:5075-5081, 5158-5159

Severity: CRITICAL (Broken Functionality)

Type: Object Identity Mismatch, Cache Miss

Verified: ✅ TRUE

Description:

// Line 5075-5081: Initial metadata extraction (first PDF load)
try (PDDocument document = pdfDocumentFactory.load(pdfBytes, true)) {
    Map<Integer, Map<PDFont, String>> pageFontResources = new HashMap<>();
    for (PDPage page : document.getPages()) {
        Map<PDFont, String> resourceMap = collectFontsForPage(...);
        // resourceMap keys are PDFont instances from this document
        pageFontResources.put(pageNumber, resourceMap);
    }
    // Cache it
    documentCache.put(jobId, new CachedPdfDocument(..., pageFontResources, ...));
}  // PDDocument closed, PDFont instances now reference freed document

// Line 5158-5159: Lazy page extraction (reloads PDF)
try (PDDocument document = pdfDocumentFactory.load(cached.getPdfBytes(), true)) {
    // NEW PDFont instances created!
    PDPage page = document.getPage(pageIndex);

    // Try to lookup fonts
    Map<PDFont, String> cachedResourceMap = cached.getPageFontResources().get(pageNum);
    // cachedResourceMap keys are OLD PDFont instances from closed document

    // Lookup using NEW PDFont instances
    String fontId = cachedResourceMap.get(newFont);  // Always NULL! ← BUG
}

Why It Fails:

Object Identity: Map<PDFont, String> uses PDFont object identity as key
Different Instances: Each PDF load creates new PDFont instances with different identities
Lookup Fails: cachedResourceMap.get(newFont) returns null because newFont != oldFont
Defeat Caching Goal: Every lazy page request rebuilds font metadata, defeating the cache

Impact:

Lazy page loading doesn't reuse cached font metadata
CPU wasted rebuilding font info on every page request
Cache only stores garbage (unusable keys)
"Consistent font UID" feature completely broken

Evidence:

// No code actually uses the cached pageFontResources successfully
// Every extractSinglePage call rebuilds fonts from scratch

Recommendation:

// Use resource names as keys instead of PDFont objects
Map<Integer, Map<String, String>> pageFontResources = new HashMap<>();
// Key: font resource name (e.g., "F1"), Value: font UID

// Or use font UID directly
Map<Integer, Set<String>> pageFontUids = new HashMap<>();

Issue #5: Unbounded Thread Creation - CRITICAL Resource Leak

Location: PdfJsonConversionService.java:5550-5562

Severity: CRITICAL (Resource Exhaustion)

Type: Thread Leak, Memory Leak

Verified: ✅ TRUE

Description:

private void scheduleDocumentCleanup(String jobId) {
    new Thread(
        () -> {
            try {
                Thread.sleep(TimeUnit.MINUTES.toMillis(30));  // Sleep 30 minutes!
                clearCachedDocument(jobId);
                log.debug("Auto-cleaned cached document for jobId: {}", jobId);
            } catch (InterruptedException e) {
                Thread.currentThread().interrupt();
            }
        })
    .start();  // Unmanaged thread!
}

Also in: PdfLazyLoadingService.java:256-269 (duplicate implementation)

Problems:

One Thread Per Upload: Each PDF upload spawns a new thread
No Thread Pool: Unlimited thread creation (no cap)
Long Sleep: Threads sleep for 30 minutes holding resources
No Cancellation: Cannot stop cleanup threads if job completes early
Non-Daemon: Threads prevent JVM shutdown (daemon=false by default)
No Monitoring: No visibility into active cleanup threads

Impact Under Load:

100 concurrent uploads → 100 cleanup threads spawned
Each thread: ~1MB stack + overhead
Total: ~100MB+ wasted on sleeping threads

1000 uploads/day → 1000+ threads accumulate
OS thread limit (ulimit -u) exceeded → OutOfMemoryError: unable to create new native thread

Production Failure Scenario:

09:00 - Peak traffic: 500 PDFs uploaded
09:00 - 500 cleanup threads spawned (each sleeps 30 min)
09:15 - 400 more uploads → 400 more threads
09:30 - First batch wakes up, cleans, exits
        But new threads keep getting created faster than old ones die
10:00 - Thread count > 2000
10:15 - JVM hits OS thread limit
10:16 - Server crashes: "OutOfMemoryError: unable to create new native thread"

Recommendation:

@Service
public class PdfJsonConversionService {
    // Fixed-size thread pool for cleanup
    private final ScheduledExecutorService cleanupScheduler =
        Executors.newScheduledThreadPool(
            2,  // Only 2 threads needed
            new ThreadFactoryBuilder()
                .setNameFormat("pdf-cache-cleanup-%d")
                .setDaemon(true)
                .build()
        );

    private void scheduleDocumentCleanup(String jobId) {
        cleanupScheduler.schedule(
            () -> clearCachedDocument(jobId),
            30,
            TimeUnit.MINUTES
        );
    }

    @PreDestroy
    public void shutdown() {
        cleanupScheduler.shutdown();
        try {
            if (!cleanupScheduler.awaitTermination(10, TimeUnit.SECONDS)) {
                cleanupScheduler.shutdownNow();
            }
        } catch (InterruptedException e) {
            cleanupScheduler.shutdownNow();
            Thread.currentThread().interrupt();
        }
    }
}

HIGH SEVERITY ISSUES

Issue #6: Unbounded Cache Growth - HIGH Memory Leak

Location: PdfJsonConversionService.java:147-148, 154

Severity: HIGH (Memory Exhaustion)

Type: Memory Leak, Missing Eviction Policy

Verified: ✅ TRUE

Description:

// Line 147: Type3 normalized fonts - cleared only in convertJsonToPdf (line 454)
private final Map<String, PDFont> type3NormalizedFontCache = new ConcurrentHashMap<>();

// Line 148: Type3 glyph coverage - NEVER CLEARED anywhere in codebase!
private final Map<String, Set<Integer>> type3GlyphCoverageCache = new ConcurrentHashMap<>();

// Line 154: Document cache - relies on buggy cleanup threads
private final Map<String, CachedPdfDocument> documentCache = new ConcurrentHashMap<>();

Growth Patterns:

1. type3NormalizedFontCache:

Written at line 3766: type3NormalizedFontCache.put(fontModel.getUid(), font)
Cleared only at line 454: type3NormalizedFontCache.clear() (JSON→PDF conversion)
NOT cleared during PDF→JSON conversion (most common operation)
Each PDFont holds references to native resources (C++ objects via JNI)
Grows unbounded during PDF→JSON operations

2. type3GlyphCoverageCache:

Written at line 1122: type3GlyphCoverageCache.put(fontUid, coverageSet)
NEVER CLEARED in entire codebase (verified via grep)
Accumulates Set for every Type3 font ever processed
Each Set can contain thousands of integers (Unicode codepoints)
Pure memory leak

3. documentCache:

Stores full PDF bytes in memory
Each entry can be 1MB-100MB+ (document bytes + metadata)
Relies on cleanup threads (which have issues from Issue #5)
If cleanup fails (exception, server restart), entries stay forever
No max size check

Impact:

Long-running server processes 10,000 Type3 fonts:
- type3GlyphCoverageCache: 10,000 entries × ~1KB each = 10MB
- type3NormalizedFontCache: 1,000 cached fonts × ~100KB each = 100MB
- documentCache: 50 active jobs × 10MB each = 500MB

After 1 week: Caches grow to 1GB+
After 1 month: OutOfMemoryError, server restart required

Recommendation:

// Use Caffeine cache with eviction policies
private final Cache<String, PDFont> type3NormalizedFontCache =
    Caffeine.newBuilder()
        .maximumSize(1000)  // Max 1000 fonts
        .expireAfterAccess(1, TimeUnit.HOURS)  // Expire after 1hr unused
        .removalListener((key, value, cause) -> {
            // Cleanup PDFont resources if needed
        })
        .build();

private final Cache<String, Set<Integer>> type3GlyphCoverageCache =
    Caffeine.newBuilder()
        .maximumSize(5000)
        .expireAfterWrite(1, TimeUnit.HOURS)
        .build();

private final Cache<String, CachedPdfDocument> documentCache =
    Caffeine.newBuilder()
        .maximumWeight(500_000_000)  // 500MB max
        .weigher((String key, CachedPdfDocument doc) ->
            doc.getPdfBytes().length)
        .expireAfterWrite(30, TimeUnit.MINUTES)
        .removalListener((key, value, cause) -> {
            log.info("Evicted document {} (cause: {})", key, cause);
        })
        .build();

Issue #7: Type3 Cache Race Condition - HIGH

Location: PdfJsonConversionService.java:3759-3773

Severity: HIGH (Duplicate Work)

Type: Check-Then-Act Race Condition

Verified: ✅ TRUE

Description:

private void loadNormalizedType3Font(
        PDDocument document,
        PdfJsonFont fontModel,
        List<FontByteSource> candidates,
        String originalFormat) throws IOException {
    if (fontModel.getUid() == null || candidates == null || candidates.isEmpty()) {
        return;
    }
    if (type3NormalizedFontCache.containsKey(fontModel.getUid())) {  // CHECK
        return;
    }
    for (FontByteSource source : candidates) {
        PDFont font = loadFontFromSource(...);  // EXPENSIVE: 10-50ms
        if (font != null) {
            type3NormalizedFontCache.put(fontModel.getUid(), font);  // ACT
            log.info("Cached normalized font {} for Type3 {}", ...);
            break;
        }
    }
}

Race Condition:

Thread A: Check cache for "1:F1" → MISS (line 3759)
Thread B: Check cache for "1:F1" → MISS (line 3759)  [both pass check!]
Thread A: Load font from bytes (10ms I/O + parsing)
Thread B: Load font from bytes (10ms I/O + parsing)  ← DUPLICATE WORK
Thread A: Put font in cache (line 3766)
Thread B: Put font in cache (line 3766)  [overwrites A's entry]

Why ConcurrentHashMap Doesn't Help:

ConcurrentHashMap prevents corruption (map state stays consistent)
ConcurrentHashMap does NOT prevent duplicate work (both threads compute)
The check (containsKey) and act (put) are separate operations

Impact:

Wasted CPU cycles loading same font twice
Temporary memory spike (two fonts in heap simultaneously)
Font loading is expensive: Base64 decode + PDFBox parsing + font validation
Under high concurrency, 10+ threads could all load the same font

Recommendation:

private void loadNormalizedType3Font(...) throws IOException {
    if (fontModel.getUid() == null || candidates == null || candidates.isEmpty()) {
        return;
    }

    // Atomic compute-if-absent
    type3NormalizedFontCache.computeIfAbsent(fontModel.getUid(), uid -> {
        for (FontByteSource source : candidates) {
            try {
                PDFont font = loadFontFromSource(...);
                if (font != null) {
                    log.info("Cached normalized font {} for Type3 {}", ...);
                    return font;
                }
            } catch (IOException e) {
                log.warn("Failed to load font from {}: {}", source.originLabel(), e.getMessage());
            }
        }
        return null;
    });
}

Issue #8: PDDocument Resource Lifecycle - NEEDS INVESTIGATION

Location: PdfJsonConversionService.java:3766, 5158

Severity: UNKNOWN (Requires Investigation)

Type: Unclear Resource Ownership

Verified: ⚠️ SPECULATIVE (No concrete evidence of failure)

Description:

// Line 3766: Cache PDFont created from a PDDocument
try (PDDocument document = ...) {
    PDFont font = loadFontFromSource(document, fontModel, source, ...);
    type3NormalizedFontCache.put(fontModel.getUid(), font);
}  // PDDocument is closed here!

// Later: cached PDFont is used with a DIFFERENT PDDocument
try (PDDocument newDocument = ...) {
    PDFont cachedFont = type3NormalizedFontCache.get(fontUid);
    // Is cachedFont safe to use after original document closed?
    // Does it hold references to freed native resources?
}

Theoretical Concerns:

Native Memory: PDFBox uses JNI for some operations
Resource Ties: PDFont may hold references to the source PDDocument
Freed Resources: Using PDFont after document closes could access freed memory
Unclear Contract: PDFBox documentation doesn't explicitly address font lifecycle

Current Status:

⚠️ NO EVIDENCE OF ACTUAL FAILURES - System appears to work in practice
⚠️ NO CRASHES OBSERVED - No segmentation faults or memory corruption reported
⚠️ NO MEMORY LEAKS DETECTED - No profiler data showing leaks
⚠️ PURELY THEORETICAL CONCERN - Based on API design, not observed behavior

Why This May Actually Be Safe:

PDFBox may create self-contained PDFont objects
Font data may be copied rather than referenced
PDFBox may be designed for this use case
Current code has been running without apparent issues

Required Investigation:

PDFBox Source Code Review: Check if PDFont copies or references document data
Load Testing: Create PDFont, close document, use font in new document
Memory Profiling: Monitor for native memory leaks over extended runs
PDFBox Documentation/Forums: Search for guidance on font lifecycle

Recommendation:

Priority: MEDIUM (needs investigation but not blocking)
Add monitoring for potential issues
Test font reuse after document closure explicitly
If problems found, cache serialized bytes instead of PDFont objects

// Option 1: Cache font bytes instead of PDFont objects
private void cacheType3FontBytes(String fontUid, byte[] fontBytes) {
    type3FontBytesCache.put(fontUid, fontBytes);
}

// Option 2: Verify font is safe to use
private PDFont getCachedFont(String fontUid) {
    PDFont cached = type3NormalizedFontCache.get(fontUid);
    if (cached != null && !isFontValid(cached)) {
        log.warn("Cached font {} is invalid, removing", fontUid);
        type3NormalizedFontCache.remove(fontUid);
        return null;
    }
    return cached;
}

MEDIUM SEVERITY ISSUES

Issue #9: Full PDF Reload Per Page - MEDIUM Performance

Location: PdfJsonConversionService.java:5158-5159

Severity: MEDIUM (Performance)

Type: Inefficient I/O

Verified: ✅ TRUE

Description:

// extractSinglePage method
try (PDDocument document = pdfDocumentFactory.load(cached.getPdfBytes(), true)) {
    // Full PDF loaded from bytes (10-100ms for large PDFs)
    PDPage page = document.getPage(pageIndex);
    // Extract just one page...
}

Problem: Every page request loads the entire PDF from bytes:

100-page PDF = Load 10MB, extract 1 page
10 page requests = 10× full PDF loads (100MB I/O)
No incremental parsing or streaming

Impact:

100MB PDF, 50 pages requested sequentially:
- Total I/O: 100MB × 50 = 5GB
- Time: 50× parse time (5-10 seconds total)
- Memory: 100MB peak per request

Concurrent page requests for same PDF:
- 10 threads × 100MB = 1GB temporary memory spike

Why This Exists: Lazy loading design trades memory (don't cache full extraction) for CPU (reload on demand). But the tradeoff is poor because:

PDFBox parsing is expensive
Repeated decompression of streams
Could cache extracted page data instead

Recommendation:

// Option 1: Cache extracted page data
private static class CachedPage {
    List<PdfJsonTextElement> textElements;
    List<PdfJsonImageElement> imageElements;
    // ... other page data
}

Map<String, Map<Integer, CachedPage>> pageCache = ...;

// Option 2: Keep PDF open with RandomAccessFile
private static class CachedPdfDocument {
    private final RandomAccessReadBufferedFile randomAccess;
    private final PDDocument document;  // Keep open!
}

// Option 3: Pre-split pages at upload time
// Store each page as separate lightweight JSON blob

Issue #10: Large Base64 Operations - MEDIUM Performance

Location: PdfJsonConversionService.java:1062, 1428, 3570, 3584, 3612, 3630

Severity: MEDIUM (Performance Bottleneck)

Type: Synchronous Blocking Operation

Verified: ✅ TRUE

Description:

// Encode large font programs
String base64 = Base64.getEncoder().encodeToString(fontBytes);  // 10MB → 13MB

// Decode large font programs
byte[] bytes = Base64.getDecoder().decode(pdfProgram);  // 13MB → 10MB

Problem:

Large fonts (embedded TrueType, Type3) can be 5-10MB
Base64 encoding inflates size by ~33%
All encoding/decoding is synchronous on request threads
CPU-intensive operation (20-50ms for 10MB)

Impact:

100 concurrent requests processing 10MB fonts:
- Each request: 30ms CPU time for Base64
- All threads blocked on encoding simultaneously
- Thread pool saturation (if using fixed-size pool)
- Other requests starved waiting for threads

Large PDF with 50 fonts:
- 50 × 30ms = 1.5 seconds just for Base64 operations
- User perceives slowness

Recommendation:

// Option 1: Size limits
private static final int MAX_FONT_SIZE = 10 * 1024 * 1024;  // 10MB

if (fontBytes.length > MAX_FONT_SIZE) {
    throw new IllegalArgumentException("Font too large: " + fontBytes.length);
}

// Option 2: Streaming Base64 (for very large files)
OutputStream base64Out = Base64.getEncoder().wrap(outputStream);
inputStream.transferTo(base64Out);

// Option 3: Async processing
CompletableFuture<String> encodeFuture = CompletableFuture.supplyAsync(
    () -> Base64.getEncoder().encodeToString(fontBytes),
    fontEncodingExecutor
);

Issue #11: File I/O on Request Threads - MEDIUM

Location: PdfJsonConversionService.java:276, 405, 5066

Severity: MEDIUM (Performance)

Type: Blocking I/O

Verified: ❌ PARTIALLY TRUE

Description:

// Line 276: Write upload to disk
file.transferTo(originalFile.getFile());

// Line 405: Read full file into memory
byte[] cachedPdfBytes = Files.readAllBytes(workingPath);

// Line 5066: Get uploaded file bytes
byte[] pdfBytes = file.getBytes();

Clarification:

These are in DIFFERENT methods (not double-reads within one operation)
Each method reads the file once
Still synchronous blocking I/O

Impact:

Large uploads (100MB) block request thread for seconds
No async or streaming support
Thread pool saturation under high upload volume

Recommendation:

// Async file I/O
CompletableFuture<Path> uploadFuture = CompletableFuture.supplyAsync(
    () -> {
        Path tempPath = Files.createTempFile("pdf-upload", ".pdf");
        file.transferTo(tempPath.toFile());
        return tempPath;
    },
    fileIoExecutor
);

// Stream large files
try (InputStream in = file.getInputStream();
     OutputStream out = Files.newOutputStream(targetPath)) {
    in.transferTo(out);
}

LOW SEVERITY ISSUES

Issue #12: PdfLazyLoadingService Unused - LOW

Location: PdfLazyLoadingService.java (entire file)

Severity: LOW (Code Quality)

Type: Dead Code

Verified: ✅ TRUE

Description:

Complete service implementation exists
Has its own documentCache and cleanup logic
Duplicates functionality in PdfJsonConversionService
Not wired to any controller
Not imported by any other class

Impact:

Code maintenance burden
Confusing for developers
Potential for accidental use in future
Cache divergence if both ever get used

Recommendation:

// Delete PdfLazyLoadingService.java entirely
// Or clearly mark as @Deprecated with explanation

Issue #13: PdfJsonFontService Volatile Fields - LOW

Location: PdfJsonFontService.java:46-47

Severity: LOW (Actually Correct)

Type: None (Good Practice)

Verified: ✅ TRUE (No issue, correctly implemented)

Description:

private volatile boolean pythonCffConverterAvailable;
private volatile boolean fontForgeCffConverterAvailable;

@PostConstruct
private void initialiseCffConverterAvailability() {
    pythonCffConverterAvailable = isCommandAvailable(pythonCommand);
    fontForgeCffConverterAvailable = isCommandAvailable(fontforgeCommand);
}

Why This Is Correct:

volatile ensures visibility across threads
Set once at startup
Read many times (thread-safe)
No synchronization needed

Recommendation: None - this is good practice.

VERIFIED FALSE CLAIMS

Claim: file.getBytes() Called Twice

Status: ❌ FALSE

Explanation: The claim stated that file.getBytes() is called twice (lines 446, 5065). Investigation shows:

Line 446: convertJsonToPdf method
Line 5065: extractDocumentMetadata method
These are DIFFERENT methods for DIFFERENT operations
Each method calls getBytes() only once

Conclusion: Not a double-read issue.

Claim: Image Base64 Encoding Per Call

Status: ❌ FALSE

Explanation: The claim stated images are Base64-encoded on every call. Investigation shows:

// PdfJsonImageService.java:430-450
private EncodedImage getOrEncodeImage(PDImage pdImage) {
    COSBase key = xObject.getCOSObject();
    EncodedImage cached = imageCache.get(key);  // Cache check!
    if (cached != null) {
        return cached;  // Cache hit
    }
    EncodedImage encoded = encodeImage(pdImage);
    imageCache.put(key, encoded);  // Cache miss, encode and store
    return encoded;
}

Conclusion: Images ARE cached. Only stencil and inline images bypass cache.

ARCHITECTURE ISSUES

Issue #14: Singleton Service Architecture - MEDIUM

Location: All @Service and @Component classes

Severity: MEDIUM (Maintainability)

Type: Architectural Pattern

Description: All services use default singleton scope:

@Service  // Defaults to singleton
public class PdfJsonConversionService {
    // Shared instance variables across all requests
    private final Map<String, PDFont> type3NormalizedFontCache = ...;
}

Implications: ✅ Good:

Most dependencies are stateless and injected
Caches use ConcurrentHashMap (thread-safe)
No mutable instance variables beyond caches

⚠️ Risks:

Singleton means shared state across all requests
Requires careful synchronization
Easy for future developers to introduce thread-unsafe code
Difficult to test concurrent scenarios

Recommendation:

Document thread-safety requirements prominently
Add unit tests for concurrent access
Consider request-scoped services for mutable state
Code review checklist for new instance variables

SUMMARY BY SEVERITY

CRITICAL (Fix Immediately)

✅ User-supplied jobId (Issue #1) - Security vulnerability
✅ Mutable cache maps (Issue #2) - Data corruption
✅ Font UID collisions (Issue #3) - Cache corruption
✅ pageFontResources PDFont keys (Issue #4) - Broken feature
✅ Unbounded thread creation (Issue #5) - Resource exhaustion

HIGH (Fix Soon)

✅ Unbounded cache growth (Issue #6) - Memory leak
✅ Type3 cache race (Issue #7) - Duplicate work
⚠️ PDDocument lifecycle (Issue #8) - Needs investigation (speculative)

MEDIUM (Plan and Address)

✅ Full PDF reload per page (Issue #9) - Performance
✅ Large Base64 operations (Issue #10) - Performance
✅ File I/O blocking (Issue #11) - Performance
✅ PdfLazyLoadingService unused (Issue #12) - Dead code
✅ Singleton architecture (Issue #14) - Maintainability

LOW (Monitor)

✅ PdfJsonFontService volatile (Issue #13) - Correctly implemented

VERIFIED FALSE

❌ file.getBytes() called twice
❌ Image Base64 encoding per call

IMPLEMENTATION ROADMAP

Phase 1: Critical Security & Data Integrity (1-2 weeks)

1. Fix jobId Security (Issue #1)

// Priority: CRITICAL
// Time: 2 days
// Risk: Low (straightforward fix)

// Generate server-side UUIDs
// Add session/user binding
// Validate ownership on all cache operations

2. Fix Cache Mutation (Issues #2, #3, #4)

// Priority: CRITICAL
// Time: 3-5 days
// Risk: Medium (requires careful testing)

// Make CachedPdfDocument immutable
// Scope font caches per-job
// Replace PDFont keys with String resource names
// Add defensive copying

Phase 2: Resource Management (1 week)

3. Fix Thread Leaks (Issue #5)

// Priority: CRITICAL
// Time: 1 day
// Risk: Low (well-understood solution)

// Replace new Thread() with ScheduledExecutorService
// Add @PreDestroy cleanup
// Monitor thread counts

4. Add Cache Eviction (Issue #6, previously also listed as Issue #11)

// Priority: HIGH
// Time: 3 days
// Risk: Low (library-based solution)

// Integrate Caffeine cache
// Set size limits, TTL
// Add eviction logging
// Monitor cache metrics

Phase 3: Performance Optimization (2-3 weeks)

5. Optimize Lazy Loading (Issue #9)

// Priority: MEDIUM
// Time: 1 week
// Risk: Medium (requires benchmarking)

// Cache extracted page data
// Or: Keep PDDocument open with RandomAccessFile
// Or: Pre-split pages at upload

6. Async I/O (Issues #10, #12)

// Priority: MEDIUM
// Time: 3-5 days
// Risk: Medium (requires async architecture changes)

// Add dedicated I/O thread pool
// Async file operations
// Stream large files

Phase 4: Code Quality (1 week)

7. Remove Dead Code (Issue #12)

// Priority: LOW
// Time: 1 day
// Risk: None

// Delete PdfLazyLoadingService
// Clean up unused imports

8. Documentation & Testing

// Priority: MEDIUM
// Time: 3-5 days

// Add thread-safety documentation
// Concurrent integration tests
// Load testing scripts

TESTING STRATEGY

1. Concurrency Tests

@SpringBootTest
@TestInstance(TestInstance.Lifecycle.PER_CLASS)
class ConcurrencyTest {

    @Test
    void testConcurrentCacheAccess() throws Exception {
        ExecutorService executor = Executors.newFixedThreadPool(20);
        CountDownLatch latch = new CountDownLatch(100);
        List<Future<?>> futures = new ArrayList<>();

        // 100 requests across 10 jobIds (10 requests per job)
        for (int i = 0; i < 100; i++) {
            String jobId = "job-" + (i % 10);
            int pageNum = (i % 5) + 1;

            futures.add(executor.submit(() -> {
                try {
                    service.extractSinglePage(jobId, pageNum);
                } catch (Exception e) {
                    log.error("Concurrent access failed", e);
                    throw e;
                } finally {
                    latch.countDown();
                }
            }));
        }

        // Wait for completion
        assertTrue(latch.await(60, TimeUnit.SECONDS));

        // Check for exceptions
        for (Future<?> future : futures) {
            future.get(); // Throws if any task failed
        }
    }

    @Test
    void testCacheNotCorrupted() throws Exception {
        // Upload document
        String jobId = "test-job";
        service.extractDocumentMetadata(testPdf, jobId);

        // Concurrent page requests
        ExecutorService executor = Executors.newFixedThreadPool(10);
        List<Future<byte[]>> futures = new ArrayList<>();

        for (int i = 0; i < 50; i++) {
            int page = (i % 10) + 1;
            futures.add(executor.submit(() ->
                service.extractSinglePage(jobId, page)));
        }

        // All should succeed without ConcurrentModificationException
        for (Future<byte[]> future : futures) {
            assertNotNull(future.get());
        }
    }
}

2. Memory Leak Tests

@Test
void testCacheDoesNotGrowUnbounded() {
    long initialHeap = getHeapUsage();

    // Process 10,000 small PDFs with Type3 fonts
    for (int i = 0; i < 10000; i++) {
        service.convertPdfToJson(createTestPdfWithType3Fonts());
    }

    // Force GC
    System.gc();
    Thread.sleep(1000);

    long finalHeap = getHeapUsage();
    long growth = finalHeap - initialHeap;

    // Cache should not grow beyond reasonable limit
    assertThat(growth).isLessThan(100_000_000); // 100MB max
}

@Test
void testThreadsNotLeaking() {
    int initialThreads = getActiveThreadCount();

    // Upload 100 PDFs (spawns 100 cleanup threads)
    for (int i = 0; i < 100; i++) {
        service.extractDocumentMetadata(testPdf, "job-" + i);
    }

    int peakThreads = getActiveThreadCount();

    // Should not create 100+ threads
    assertThat(peakThreads - initialThreads).isLessThan(10);
}

private long getHeapUsage() {
    Runtime runtime = Runtime.getRuntime();
    return runtime.totalMemory() - runtime.freeMemory();
}

private int getActiveThreadCount() {
    return Thread.getAllStackTraces().size();
}

3. Security Tests

@Test
void testJobIdIsolation() {
    // User A uploads PDF
    String jobIdA = service.extractDocumentMetadata(userAPdf, sessionA);

    // User B tries to access User A's jobId
    assertThrows(AccessDeniedException.class, () -> {
        service.extractSinglePage(jobIdA, 1, sessionB);
    });
}

@Test
void testJobIdUnpredictable() {
    Set<String> jobIds = new HashSet<>();

    for (int i = 0; i < 1000; i++) {
        String jobId = service.extractDocumentMetadata(testPdf, session);
        jobIds.add(jobId);
    }

    // All jobIds should be unique UUIDs
    assertThat(jobIds).hasSize(1000);

    // Should not be sequential
    List<String> sorted = new ArrayList<>(jobIds);
    Collections.sort(sorted);
    assertThat(sorted).isNotEqualTo(new ArrayList<>(jobIds));
}

4. Performance Tests

@Test
void testLargeFilePerformance() {
    // 100MB PDF
    byte[] largePdf = createLargePdf(100 * 1024 * 1024);

    long start = System.currentTimeMillis();
    String json = service.convertPdfToJson(largePdf);
    long duration = System.currentTimeMillis() - start;

    // Should complete in reasonable time
    assertThat(duration).isLessThan(30_000); // 30 seconds
}

@Test
void testConcurrentThroughput() throws Exception {
    ExecutorService executor = Executors.newFixedThreadPool(50);
    CountDownLatch latch = new CountDownLatch(500);

    long start = System.currentTimeMillis();

    for (int i = 0; i < 500; i++) {
        executor.submit(() -> {
            try {
                service.convertPdfToJson(testPdf);
            } finally {
                latch.countDown();
            }
        });
    }

    latch.await();
    long duration = System.currentTimeMillis() - start;

    // 500 conversions should complete in reasonable time
    double throughput = 500.0 / (duration / 1000.0);
    assertThat(throughput).isGreaterThan(10); // At least 10 conversions/sec
}

MONITORING & METRICS

Recommended Metrics (Micrometer)

@Service
public class PdfJsonConversionService {

    private final MeterRegistry meterRegistry;

    // Cache size gauges
    @PostConstruct
    void registerMetrics() {
        Gauge.builder("pdf.cache.document.size", documentCache, Map::size)
            .description("Number of cached documents")
            .register(meterRegistry);

        Gauge.builder("pdf.cache.type3font.size", type3NormalizedFontCache, Map::size)
            .description("Number of cached Type3 fonts")
            .register(meterRegistry);

        Gauge.builder("pdf.cache.coverage.size", type3GlyphCoverageCache, Map::size)
            .description("Number of cached glyph coverage sets")
            .register(meterRegistry);

        Gauge.builder("pdf.threads.cleanup", this::getCleanupThreadCount)
            .description("Active cleanup threads")
            .register(meterRegistry);
    }

    // Operation timers
    public String convertPdfToJson(byte[] pdfBytes) {
        Timer.Sample sample = Timer.start(meterRegistry);
        try {
            String result = doConvertPdfToJson(pdfBytes);
            sample.stop(meterRegistry.timer("pdf.convert.toJson"));
            return result;
        } catch (Exception e) {
            meterRegistry.counter("pdf.convert.errors", "operation", "toJson").increment();
            throw e;
        }
    }

    // Cache hit/miss counters
    private PDFont getCachedType3Font(String uid) {
        PDFont cached = type3NormalizedFontCache.get(uid);
        if (cached != null) {
            meterRegistry.counter("pdf.cache.type3font.hits").increment();
        } else {
            meterRegistry.counter("pdf.cache.type3font.misses").increment();
        }
        return cached;
    }
}

Alerts

alerts:
  # Cache growth
  - name: DocumentCacheTooLarge
    condition: pdf_cache_document_size > 100
    severity: warning

  - name: Type3CacheTooLarge
    condition: pdf_cache_type3font_size > 1000
    severity: warning

  # Thread leaks
  - name: TooManyCleanupThreads
    condition: pdf_threads_cleanup > 10
    severity: critical

  # Memory pressure
  - name: HeapUsageHigh
    condition: jvm_memory_used_bytes / jvm_memory_max_bytes > 0.8
    severity: warning

  # Performance
  - name: SlowConversions
    condition: pdf_convert_toJson{quantile="0.95"} > 10s
    severity: warning

  # Error rate
  - name: HighErrorRate
    condition: rate(pdf_convert_errors[5m]) > 0.1
    severity: critical

CONCLUSION

The PDF JSON editor has CRITICAL issues that must be fixed before production deployment:

Must-Fix Issues (Blocks Production):

User-supplied jobId - Security vulnerability enabling cache poisoning and information disclosure
Mutable cache maps - Causes ConcurrentModificationException under load
Font UID collisions - Different documents overwrite each other's font caches
pageFontResources broken - Lazy page loading completely broken
Thread leaks - Unbounded thread creation causes OutOfMemoryError

Should-Fix Issues (Prevents Scale):

Unbounded cache growth - Memory leaks require server restarts
Type3 cache races - Wasted CPU doing duplicate work
PDDocument lifecycle - Needs investigation (no evidence of actual problems yet)

Performance Improvements (Nice-to-Have):

Full PDF reload per page
Large Base64 operations
Synchronous file I/O

Code Quality Issues:

PdfLazyLoadingService dead code
Documentation of thread-safety requirements

Estimated Effort:

Critical fixes: 2-3 weeks
High priority: 1 week
Performance: 2-3 weeks
Total: 5-7 weeks for complete remediation

Recommendation: Fix Critical issues immediately, then address High priority issues before beta testing.

42 KiB Raw Blame History Unescape Escape

PDF JSON Editor - Threading, Concurrency, and Performance Analysis

Executive Summary

CRITICAL ISSUES

Issue #1: User-Supplied jobId - CRITICAL SECURITY VULNERABILITY ⚠️

Issue #2: Mutable Cache Maps - CRITICAL Race Condition

Issue #3: Font UID Not Globally Unique - CRITICAL Cache Collision

Issue #4: pageFontResources Keyed by PDFont Instances - CRITICAL

Issue #5: Unbounded Thread Creation - CRITICAL Resource Leak

HIGH SEVERITY ISSUES

Issue #6: Unbounded Cache Growth - HIGH Memory Leak

Issue #7: Type3 Cache Race Condition - HIGH

Issue #8: PDDocument Resource Lifecycle - NEEDS INVESTIGATION

MEDIUM SEVERITY ISSUES

Issue #9: Full PDF Reload Per Page - MEDIUM Performance

Issue #10: Large Base64 Operations - MEDIUM Performance

Issue #11: File I/O on Request Threads - MEDIUM

LOW SEVERITY ISSUES

Issue #12: PdfLazyLoadingService Unused - LOW

Issue #13: PdfJsonFontService Volatile Fields - LOW

VERIFIED FALSE CLAIMS

Claim: file.getBytes() Called Twice

Claim: Image Base64 Encoding Per Call

ARCHITECTURE ISSUES

Issue #14: Singleton Service Architecture - MEDIUM

SUMMARY BY SEVERITY

CRITICAL (Fix Immediately)

HIGH (Fix Soon)

MEDIUM (Plan and Address)

LOW (Monitor)

VERIFIED FALSE

IMPLEMENTATION ROADMAP

Phase 1: Critical Security & Data Integrity (1-2 weeks)

Phase 2: Resource Management (1 week)

Phase 3: Performance Optimization (2-3 weeks)

Phase 4: Code Quality (1 week)

TESTING STRATEGY

1. Concurrency Tests

2. Memory Leak Tests

3. Security Tests

4. Performance Tests

MONITORING & METRICS

Recommended Metrics (Micrometer)

Alerts

CONCLUSION

Must-Fix Issues (Blocks Production):

Should-Fix Issues (Prevents Scale):

Performance Improvements (Nice-to-Have):

Code Quality Issues:

42 KiB

Raw Blame History