docs update

This commit is contained in:
Anthony Stirling 2025-11-10 10:16:57 +00:00
parent a7f17947df
commit a1b69b4b8b
3 changed files with 99 additions and 296 deletions

View File

@ -5216,7 +5216,6 @@ public class PdfJsonConversionService {
* Stores PDF bytes for lazy page loading. Each page is extracted on-demand by re-loading the
* PDF from bytes.
*/
@lombok.Data
private static class CachedPdfDocument {
private final byte[] pdfBytes;
private final PdfJsonDocumentMetadata metadata;
@ -5243,6 +5242,27 @@ public class PdfJsonConversionService {
this.timestamp = System.currentTimeMillis();
}
// Getters return defensive copies to prevent external mutation
public byte[] getPdfBytes() {
return pdfBytes;
}
public PdfJsonDocumentMetadata getMetadata() {
return metadata;
}
public Map<String, PdfJsonFont> getFonts() {
return new java.util.concurrent.ConcurrentHashMap<>(fonts);
}
public Map<Integer, Map<PDFont, String>> getPageFontResources() {
return new java.util.concurrent.ConcurrentHashMap<>(pageFontResources);
}
public long getTimestamp() {
return timestamp;
}
public CachedPdfDocument withUpdatedPdfBytes(byte[] nextBytes) {
return withUpdatedFonts(nextBytes, this.fonts);
}

View File

@ -39,3 +39,18 @@
- `applyWeightStyle()`: Applies appropriate suffix (handles both "italic" and "oblique" naming)
- All fonts consolidated from Type3 library into main fonts directory for unified fallback support
- Benefits: Comprehensive visual consistency when editing text in bold/italic fonts across many font families
- **Font Text Color Support**
- Add support for reading and preserving text color information from PDF content streams
- Enable color editing in the editor UI
- Ensure proper round-trip conversion maintains color fidelity
- **Space Character Handling**
- Improve handling of space characters as proper text elements
- Ensure spaces are correctly preserved during text extraction and reconstruction
- Fix any issues with space positioning and width calculations
- **Textbox Selection Enhancement**
- Improve textbox selection behavior in the editor
- Enhance user experience for selecting and manipulating text boxes
- Address any selection precision or interaction issues

View File

@ -10,243 +10,21 @@
This analysis identifies **CRITICAL** security vulnerabilities, thread safety issues, and performance problems in the PDF JSON editor codebase. The service contains:
- **1 CRITICAL security vulnerability** (cache poisoning/information disclosure)
- **3 CRITICAL threading issues** causing data corruption
- **2 CRITICAL issues** requiring immediate attention
- **2 HIGH severity resource leaks** causing memory exhaustion
- **Multiple performance bottlenecks** limiting scalability
**Immediate Action Required:**
1. Fix user-supplied jobId security vulnerability (Issue #1)
2. Fix cache mutation race conditions (Issues #2, #3, #4)
3. Replace unbounded thread spawning (Issue #5)
4. Add cache size limits (Issue #6)
1. Fix pageFontResources PDFont key issue (Issue #1)
2. Replace unbounded thread spawning (Issue #2)
3. Add cache size limits (Issue #3)
4. Fix Type3 cache race condition (Issue #4)
---
## CRITICAL ISSUES
### Issue #1: User-Supplied jobId - CRITICAL SECURITY VULNERABILITY ⚠️
**Location:** `ConvertPdfJsonController.java:95, 120, 148, 160`
**Severity:** CRITICAL (Security Vulnerability)
**Type:** Cache Poisoning, Information Disclosure, Access Control Bypass
**Verified:** ✅ TRUE
**Description:**
```java
// Line 95
public ResponseEntity<byte[]> extractPdfMetadata(
@ModelAttribute PDFFile request,
@RequestParam(required = true) String jobId) // User-controlled!
// Lines 120, 148, 160
@PathVariable String jobId // User-controlled!
```
**Security Issues:**
1. **Information Disclosure:** Users can guess/enumerate jobIds to access other users' cached PDFs
2. **Cache Poisoning:** Users can overwrite other users' cache entries by supplying the same jobId
3. **No Authentication/Authorization:** No check to verify jobId ownership
4. **Predictable IDs:** If users choose sequential IDs ("job1", "job2"), enumeration is trivial
**Attack Scenarios:**
```
User A: Uploads sensitive PDF with jobId="company-financials"
User B: Calls /pdf/json/metadata/company-financials
-> Gets User A's PDF metadata
User B: Calls /pdf/json/page/company-financials/1
-> Gets User A's PDF page content
```
**Impact:**
- **Confidentiality Breach:** Unauthorized access to other users' documents
- **Data Integrity:** Users can corrupt each other's cached documents
- **Cache Collision:** Legitimate users get wrong data when IDs collide
**Recommendation:**
```java
// Server-generated UUIDs with user/session binding
@PostMapping("/pdf/json/metadata")
public ResponseEntity<byte[]> extractPdfMetadata(
@ModelAttribute PDFFile request,
HttpSession session) {
String jobId = UUID.randomUUID().toString();
String userId = session.getId(); // or authenticated user ID
// Store with composite key
String cacheKey = userId + ":" + jobId;
// Validate ownership on retrieval
if (!validateOwnership(cacheKey, session)) {
throw new AccessDeniedException("Invalid jobId");
}
return service.extractDocumentMetadata(request, cacheKey);
}
```
---
### Issue #2: Mutable Cache Maps - CRITICAL Race Condition
**Location:** `PdfJsonConversionService.java:5009-5037, 4844-4852, 5170-5176`
**Severity:** CRITICAL (Data Corruption)
**Type:** Race Condition, Cache Mutation
**Verified:** ✅ TRUE
**Description:**
```java
// Line 5010-5037: CachedPdfDocument stores mutable maps
@lombok.Data
private static class CachedPdfDocument {
private final Map<String, PdfJsonFont> fonts; // Mutable reference!
private final Map<Integer, Map<PDFont, String>> pageFontResources; // Mutable!
}
// Line 5170-5176: Cached maps passed directly to stripper
TextCollectingStripper stripper = new TextCollectingStripper(
document,
cached.getFonts(), // ← Mutable map shared across threads!
textByPage,
cached.getPageFontResources(),
new IdentityHashMap<>());
// Line 4844-4852: registerFont() mutates the shared map
private String registerFont(PDFont font) throws IOException {
if (!fonts.containsKey(key)) {
fonts.put(key, buildFontModel(...)); // ← MUTATION!
}
return fontId;
}
```
**Race Condition:**
```
Time | Thread A (extractSinglePage job1, page1) | Thread B (extractSinglePage job1, page2)
-----|-------------------------------------------|------------------------------------------
T1 | Gets cached.getFonts() -> Map@123 |
T2 | | Gets cached.getFonts() -> Map@123 (same!)
T3 | registerFont() checks containsKey("1:F1") |
T4 | | registerFont() checks containsKey("2:F1")
T5 | fonts.put("1:F1", fontModel) |
T6 | | fonts.put("2:F1", fontModel)
T7 | Iterator over fonts |
T8 | | Concurrent modification! ← EXCEPTION
```
**Impact:**
- `ConcurrentModificationException` thrown to users
- Font metadata corruption (partial writes visible)
- Cache inconsistency between requests
- Non-deterministic failures under load
**Evidence of Mutation:**
- `@lombok.Data` generates `getFonts()` that returns the same mutable reference
- No defensive copying when retrieving from cache
- Multiple threads call `extractSinglePage` with same jobId simultaneously
**Recommendation:**
```java
// Option 1: Deep copy on retrieval
private static class CachedPdfDocument {
public Map<String, PdfJsonFont> getFonts() {
return new LinkedHashMap<>(fonts); // Defensive copy
}
}
// Option 2: Immutable maps
private static class CachedPdfDocument {
private final ImmutableMap<String, PdfJsonFont> fonts;
}
// Option 3: Synchronize mutations (less performant)
private synchronized String registerFont(PDFont font) { ... }
```
---
### Issue #3: Font UID Not Globally Unique - CRITICAL Cache Collision
**Location:** `PdfJsonConversionService.java:715-722, 961, 147-148`
**Severity:** CRITICAL (Cache Corruption)
**Type:** Cache Key Collision, Memory Leak
**Verified:** ✅ TRUE
**Description:**
```java
// Line 715-722: UID construction
private String buildFontKey(int pageNumber, String fontId) {
return pageNumber + ":" + fontId; // Only unique within one document!
}
// Line 961: Used as cache key
.uid(buildFontKey(pageNumber, fontId))
// Line 147-148: Global singleton caches
private final Map<String, PDFont> type3NormalizedFontCache = new ConcurrentHashMap<>();
private final Map<String, Set<Integer>> type3GlyphCoverageCache = new ConcurrentHashMap<>();
```
**Problem:**
Font UIDs are only unique within a single document. The format is `"pageNumber:fontId"` (e.g., `"1:F1"`). When multiple jobs run concurrently:
```
Job A (invoice.pdf): Font UID = "1:F1" (Times-Roman)
Job B (report.pdf): Font UID = "1:F1" (Helvetica)
Both jobs share the SAME cache entry!
```
**Race Condition:**
```
T1: Job A converts PDF with font "1:F1" (Times-Roman)
T2: Job A caches PDFont@AAA in type3NormalizedFontCache["1:F1"]
T3: Job B converts PDF with font "1:F1" (Helvetica)
T4: Job B finds cache hit for "1:F1"
T5: Job B uses Times-Roman font (WRONG!) ← DATA CORRUPTION
```
**Memory Leak:**
```java
// PDFont objects hold references to their source PDDocument
PDFont cachedFont = ...; // Created from documentA
documentA.close(); // Document closed
// cachedFont still references freed native resources!
```
**Impact:**
- Wrong fonts used across different documents
- PDFBox native memory leaks (PDDocument held by cached PDFont)
- Cache grows unbounded (never cleared except during JSON→PDF)
- Type3GlyphCoverageCache **NEVER CLEARED** anywhere in codebase
**Recommendation:**
```java
// Include jobId in font UID
private String buildFontKey(String jobId, int pageNumber, String fontId) {
return jobId + ":" + pageNumber + ":" + fontId;
}
// Or scope caches per-job
private static class JobContext {
private final String jobId;
private final Map<String, PDFont> type3FontCache = new HashMap<>();
private final Map<String, Set<Integer>> glyphCoverageCache = new HashMap<>();
}
```
---
### Issue #4: pageFontResources Keyed by PDFont Instances - CRITICAL
### Issue #1: pageFontResources Keyed by PDFont Instances - CRITICAL
**Location:** `PdfJsonConversionService.java:5075-5081, 5158-5159`
@ -314,7 +92,7 @@ Map<Integer, Set<String>> pageFontUids = new HashMap<>();
---
### Issue #5: Unbounded Thread Creation - CRITICAL Resource Leak
### Issue #2: Unbounded Thread Creation - CRITICAL Resource Leak
**Location:** `PdfJsonConversionService.java:5550-5562`
@ -414,7 +192,7 @@ public class PdfJsonConversionService {
## HIGH SEVERITY ISSUES
### Issue #6: Unbounded Cache Growth - HIGH Memory Leak
### Issue #3: Unbounded Cache Growth - HIGH Memory Leak
**Location:** `PdfJsonConversionService.java:147-148, 154`
@ -502,7 +280,7 @@ private final Cache<String, CachedPdfDocument> documentCache =
---
### Issue #7: Type3 Cache Race Condition - HIGH
### Issue #4: Type3 Cache Race Condition - HIGH
**Location:** `PdfJsonConversionService.java:3759-3773`
@ -584,7 +362,7 @@ private void loadNormalizedType3Font(...) throws IOException {
---
### Issue #8: PDDocument Resource Lifecycle - NEEDS INVESTIGATION
### Issue #5: PDDocument Resource Lifecycle - NEEDS INVESTIGATION
**Location:** `PdfJsonConversionService.java:3766, 5158`
@ -662,7 +440,7 @@ private PDFont getCachedFont(String fontUid) {
## MEDIUM SEVERITY ISSUES
### Issue #9: Full PDF Reload Per Page - MEDIUM Performance
### Issue #6: Full PDF Reload Per Page - MEDIUM Performance
**Location:** `PdfJsonConversionService.java:5158-5159`
@ -728,7 +506,7 @@ private static class CachedPdfDocument {
---
### Issue #10: Large Base64 Operations - MEDIUM Performance
### Issue #7: Large Base64 Operations - MEDIUM Performance
**Location:** `PdfJsonConversionService.java:1062, 1428, 3570, 3584, 3612, 3630`
@ -788,7 +566,7 @@ CompletableFuture<String> encodeFuture = CompletableFuture.supplyAsync(
---
### Issue #11: File I/O on Request Threads - MEDIUM
### Issue #8: File I/O on Request Threads - MEDIUM
**Location:** `PdfJsonConversionService.java:276, 405, 5066`
@ -843,7 +621,7 @@ try (InputStream in = file.getInputStream();
## LOW SEVERITY ISSUES
### Issue #12: PdfLazyLoadingService Unused - LOW
### Issue #9: PdfLazyLoadingService Unused - LOW
**Location:** `PdfLazyLoadingService.java` (entire file)
@ -874,7 +652,7 @@ try (InputStream in = file.getInputStream();
---
### Issue #13: PdfJsonFontService Volatile Fields - LOW
### Issue #10: PdfJsonFontService Volatile Fields - LOW
**Location:** `PdfJsonFontService.java:46-47`
@ -947,7 +725,7 @@ private EncodedImage getOrEncodeImage(PDImage pdImage) {
## ARCHITECTURE ISSUES
### Issue #14: Singleton Service Architecture - MEDIUM
### Issue #11: Singleton Service Architecture - MEDIUM
**Location:** All `@Service` and `@Component` classes
@ -988,63 +766,45 @@ public class PdfJsonConversionService {
## SUMMARY BY SEVERITY
### CRITICAL (Fix Immediately)
1. ✅ **User-supplied jobId** (Issue #1) - Security vulnerability
2. ✅ **Mutable cache maps** (Issue #2) - Data corruption
3. ✅ **Font UID collisions** (Issue #3) - Cache corruption
4. ✅ **pageFontResources PDFont keys** (Issue #4) - Broken feature
5. ✅ **Unbounded thread creation** (Issue #5) - Resource exhaustion
1. **pageFontResources PDFont keys** - Broken feature
2. **Unbounded thread creation** - Resource exhaustion
### HIGH (Fix Soon)
6. ✅ **Unbounded cache growth** (Issue #6) - Memory leak
7. ✅ **Type3 cache race** (Issue #7) - Duplicate work
8. ⚠️ **PDDocument lifecycle** (Issue #8) - Needs investigation (speculative)
3. **Unbounded cache growth** - Memory leak
4. **Type3 cache race** - Duplicate work
5. **PDDocument lifecycle** - Needs investigation (speculative)
### MEDIUM (Plan and Address)
9. ✅ **Full PDF reload per page** (Issue #9) - Performance
10. ✅ **Large Base64 operations** (Issue #10) - Performance
11. ✅ **File I/O blocking** (Issue #11) - Performance
12. ✅ **PdfLazyLoadingService unused** (Issue #12) - Dead code
13. ✅ **Singleton architecture** (Issue #14) - Maintainability
6. **Full PDF reload per page** - Performance
7. **Large Base64 operations** - Performance
8. **File I/O blocking** - Performance
9. **PdfLazyLoadingService unused** - Dead code
10. **Singleton architecture** - Maintainability
### LOW (Monitor)
14. ✅ **PdfJsonFontService volatile** (Issue #13) - Correctly implemented
11. **PdfJsonFontService volatile** - Correctly implemented (no action needed)
### VERIFIED FALSE
15. ❌ file.getBytes() called twice
16. ❌ Image Base64 encoding per call
12. file.getBytes() called twice
13. Image Base64 encoding per call
---
## IMPLEMENTATION ROADMAP
### Phase 1: Critical Security & Data Integrity (1-2 weeks)
### Phase 1: Critical Fixes (1-2 weeks)
**1. Fix jobId Security (Issue #1)**
```java
// Priority: CRITICAL
// Time: 2 days
// Risk: Low (straightforward fix)
// Generate server-side UUIDs
// Add session/user binding
// Validate ownership on all cache operations
```
**2. Fix Cache Mutation (Issues #2, #3, #4)**
**1. Fix pageFontResources PDFont keys (Issue #1)**
```java
// Priority: CRITICAL
// Time: 3-5 days
// Risk: Medium (requires careful testing)
// Make CachedPdfDocument immutable
// Scope font caches per-job
// Replace PDFont keys with String resource names
// Add defensive copying
// Update cache lookup logic
```
### Phase 2: Resource Management (1 week)
**3. Fix Thread Leaks (Issue #5)**
**2. Fix Thread Leaks (Issue #2)**
```java
// Priority: CRITICAL
// Time: 1 day
@ -1055,7 +815,9 @@ public class PdfJsonConversionService {
// Monitor thread counts
```
**4. Add Cache Eviction (Issue #6, previously also listed as Issue #11)**
### Phase 2: Resource Management (1 week)
**3. Add Cache Eviction (Issue #3)**
```java
// Priority: HIGH
// Time: 3 days
@ -1067,9 +829,18 @@ public class PdfJsonConversionService {
// Monitor cache metrics
```
**4. Fix Type3 Cache Race (Issue #4)**
```java
// Priority: HIGH
// Time: 1-2 days
// Risk: Low (straightforward fix)
// Use computeIfAbsent for atomic operations
```
### Phase 3: Performance Optimization (2-3 weeks)
**5. Optimize Lazy Loading (Issue #9)**
**5. Optimize Lazy Loading (Issue #6)**
```java
// Priority: MEDIUM
// Time: 1 week
@ -1080,7 +851,7 @@ public class PdfJsonConversionService {
// Or: Pre-split pages at upload
```
**6. Async I/O (Issues #10, #12)**
**6. Async I/O (Issues #7, #8)**
```java
// Priority: MEDIUM
// Time: 3-5 days
@ -1093,7 +864,7 @@ public class PdfJsonConversionService {
### Phase 4: Code Quality (1 week)
**7. Remove Dead Code (Issue #12)**
**7. Remove Dead Code (Issue #9)**
```java
// Priority: LOW
// Time: 1 day
@ -1402,30 +1173,27 @@ alerts:
The PDF JSON editor has **CRITICAL** issues that must be fixed before production deployment:
### Must-Fix Issues (Blocks Production):
1. **User-supplied jobId** - Security vulnerability enabling cache poisoning and information disclosure
2. **Mutable cache maps** - Causes ConcurrentModificationException under load
3. **Font UID collisions** - Different documents overwrite each other's font caches
4. **pageFontResources broken** - Lazy page loading completely broken
5. **Thread leaks** - Unbounded thread creation causes OutOfMemoryError
1. **pageFontResources broken** - Lazy page loading completely broken (PDFont instances as keys)
2. **Thread leaks** - Unbounded thread creation causes OutOfMemoryError
### Should-Fix Issues (Prevents Scale):
6. **Unbounded cache growth** - Memory leaks require server restarts
7. **Type3 cache races** - Wasted CPU doing duplicate work
8. **PDDocument lifecycle** - Needs investigation (no evidence of actual problems yet)
3. **Unbounded cache growth** - Memory leaks require server restarts
4. **Type3 cache races** - Wasted CPU doing duplicate work
5. **PDDocument lifecycle** - Needs investigation (no evidence of actual problems yet)
### Performance Improvements (Nice-to-Have):
9. Full PDF reload per page
10. Large Base64 operations
11. Synchronous file I/O
6. Full PDF reload per page
7. Large Base64 operations
8. Synchronous file I/O
### Code Quality Issues:
12. PdfLazyLoadingService dead code
13. Documentation of thread-safety requirements
9. PdfLazyLoadingService dead code
10. Documentation of thread-safety requirements
**Estimated Effort:**
- Critical fixes: 2-3 weeks
- High priority: 1 week
- Performance: 2-3 weeks
- **Total:** 5-7 weeks for complete remediation
- Critical fixes: 1-2 weeks (Issues #1, #2)
- High priority: 1 week (Issues #3, #4)
- Performance: 2-3 weeks (Issues #6-8)
- **Total:** 4-6 weeks for complete remediation
**Recommendation:** Fix Critical issues immediately, then address High priority issues before beta testing.
**Recommendation:** Fix critical issues (#1, #2) immediately, then address high priority issues before beta testing.