Add default languages to OCR, fix compression for QPDF and embedded images (#3202)

# Description of Changes

This pull request includes several changes to the codebase, focusing on
enhancing OCR support, improving endpoint management, and adding new
functionality for PDF compression. The most important changes are
detailed below.

### Enhancements to OCR support:

* `Dockerfile` and `Dockerfile.fat`: Added support for multiple new OCR
languages including Chinese (Simplified), German, French, and
Portuguese. (Our top 5 languages including English)
[[1]](diffhunk://#diff-dd2c0eb6ea5cfc6c4bd4eac30934e2d5746747af48fef6da689e85b752f39557R69-R72)
[[2]](diffhunk://#diff-571631582b988e88c52c86960cc083b0b8fa63cf88f056f26e9e684195221c27L78-R81)

### Improvements to endpoint management:

*
[`src/main/java/stirling/software/SPDF/config/EndpointConfiguration.java`](diffhunk://#diff-750f31f6ecbd64b025567108a33775cad339e835a04360affff82a09410b697dR51-R66):
Added a new method `isGroupEnabled` to check if a group of endpoints is
enabled.
*
[`src/main/java/stirling/software/SPDF/config/EndpointConfiguration.java`](diffhunk://#diff-750f31f6ecbd64b025567108a33775cad339e835a04360affff82a09410b697dL179-L193):
Updated endpoint groups and removed redundant qpdf endpoints.
[[1]](diffhunk://#diff-750f31f6ecbd64b025567108a33775cad339e835a04360affff82a09410b697dL179-L193)
[[2]](diffhunk://#diff-750f31f6ecbd64b025567108a33775cad339e835a04360affff82a09410b697dL243-L244)
*
[`src/main/java/stirling/software/SPDF/config/EndpointInspector.java`](diffhunk://#diff-845de13e140bb1264014539714860f044405274ad2a9481f38befdd1c1333818R1-R291):
Introduced a new `EndpointInspector` class to discover and validate GET
endpoints dynamically.

### New functionality for PDF compression:

*
[`src/main/java/stirling/software/SPDF/controller/api/misc/CompressController.java`](diffhunk://#diff-c307589e9f958f2593c9567c5ad9d63cd03788aa4803b3017b1c13b0d0485805R10):
Enhanced the `CompressController` to handle nested images within form
XObjects, improving the accuracy of image compression in PDFs.
Remove Compresses Dependency on QPDF
[[1]](diffhunk://#diff-c307589e9f958f2593c9567c5ad9d63cd03788aa4803b3017b1c13b0d0485805R10)
[[2]](diffhunk://#diff-c307589e9f958f2593c9567c5ad9d63cd03788aa4803b3017b1c13b0d0485805R28-R44)
[[3]](diffhunk://#diff-c307589e9f958f2593c9567c5ad9d63cd03788aa4803b3017b1c13b0d0485805L49-R61)
[[4]](diffhunk://#diff-c307589e9f958f2593c9567c5ad9d63cd03788aa4803b3017b1c13b0d0485805R77-R99)
[[5]](diff
hunk://#diff-c307589e9f958f2593c9567c5ad9d63cd03788aa4803b3017b1c13b0d0485805L92-R191)
Closes #(issue_number)

---

## Checklist

### General

- [ ] I have read the [Contribution
Guidelines](https://github.com/Stirling-Tools/Stirling-PDF/blob/main/CONTRIBUTING.md)
- [ ] I have read the [Stirling-PDF Developer
Guide](https://github.com/Stirling-Tools/Stirling-PDF/blob/main/DeveloperGuide.md)
(if applicable)
- [ ] I have read the [How to add new languages to
Stirling-PDF](https://github.com/Stirling-Tools/Stirling-PDF/blob/main/HowToAddNewLanguage.md)
(if applicable)
- [ ] I have performed a self-review of my own code
- [ ] My changes generate no new warnings

### Documentation

- [ ] I have updated relevant docs on [Stirling-PDF's doc
repo](https://github.com/Stirling-Tools/Stirling-Tools.github.io/blob/main/docs/)
(if functionality has heavily changed)
- [ ] I have read the section [Add New Translation
Tags](https://github.com/Stirling-Tools/Stirling-PDF/blob/main/HowToAddNewLanguage.md#add-new-translation-tags)
(for new translation tags only)

### UI Changes (if applicable)

- [ ] Screenshots or videos demonstrating the UI changes are attached
(e.g., as comments or direct attachments in the PR)

### Testing (if applicable)

- [ ] I have tested my changes locally. Refer to the [Testing
Guide](https://github.com/Stirling-Tools/Stirling-PDF/blob/main/DeveloperGuide.md#6-testing)
for more details.

---------

Co-authored-by: a <a>
This commit is contained in:
Anthony Stirling
2025-03-20 09:39:57 +00:00
committed by GitHub
parent 748ac494e6
commit d8cca66560
15 changed files with 840 additions and 321 deletions

View File

@@ -77,7 +77,7 @@ public class CustomPDFDocumentFactory {
}
long fileSize = file.length();
log.info("Loading PDF from file, size: {}MB", fileSize / (1024 * 1024));
log.debug("Loading PDF from file, size: {}MB", fileSize / (1024 * 1024));
return loadAdaptively(file, fileSize);
}
@@ -92,7 +92,7 @@ public class CustomPDFDocumentFactory {
}
long fileSize = Files.size(path);
log.info("Loading PDF from file, size: {}MB", fileSize / (1024 * 1024));
log.debug("Loading PDF from file, size: {}MB", fileSize / (1024 * 1024));
return loadAdaptively(path.toFile(), fileSize);
}
@@ -104,7 +104,7 @@ public class CustomPDFDocumentFactory {
}
long dataSize = input.length;
log.info("Loading PDF from byte array, size: {}MB", dataSize / (1024 * 1024));
log.debug("Loading PDF from byte array, size: {}MB", dataSize / (1024 * 1024));
return loadAdaptively(input, dataSize);
}
@@ -150,7 +150,7 @@ public class CustomPDFDocumentFactory {
long actualFreeMemory = maxMemory - usedMemory;
// Log memory status
log.info(
log.debug(
"Memory status - Free: {}MB ({}%), Used: {}MB, Max: {}MB",
actualFreeMemory / (1024 * 1024),
String.format("%.2f", freeMemoryPercent),
@@ -160,21 +160,21 @@ public class CustomPDFDocumentFactory {
// If free memory is critically low, always use file-based caching
if (freeMemoryPercent < MIN_FREE_MEMORY_PERCENTAGE
|| actualFreeMemory < MIN_FREE_MEMORY_BYTES) {
log.info(
log.debug(
"Low memory detected ({}%), forcing file-based cache",
String.format("%.2f", freeMemoryPercent));
return createScratchFileCacheFunction(MemoryUsageSetting.setupTempFileOnly());
} else if (contentSize < SMALL_FILE_THRESHOLD) {
log.info("Using memory-only cache for small document ({}KB)", contentSize / 1024);
log.debug("Using memory-only cache for small document ({}KB)", contentSize / 1024);
return IOUtils.createMemoryOnlyStreamCache();
} else if (contentSize < LARGE_FILE_THRESHOLD) {
// For medium files (10-50MB), use a mixed approach
log.info(
log.debug(
"Using mixed memory/file cache for medium document ({}MB)",
contentSize / (1024 * 1024));
return createScratchFileCacheFunction(MemoryUsageSetting.setupMixed(LARGE_FILE_USAGE));
} else {
log.info("Using file-based cache for large document");
log.debug("Using file-based cache for large document");
return createScratchFileCacheFunction(MemoryUsageSetting.setupTempFileOnly());
}
}
@@ -237,7 +237,7 @@ public class CustomPDFDocumentFactory {
byte[] bytes, long size, StreamCacheCreateFunction cache, String password)
throws IOException {
if (size >= SMALL_FILE_THRESHOLD) {
log.info("Writing large byte array to temp file for password-protected PDF");
log.debug("Writing large byte array to temp file for password-protected PDF");
Path tempFile = createTempFile("pdf-bytes-");
Files.write(tempFile, bytes);
@@ -261,7 +261,6 @@ public class CustomPDFDocumentFactory {
removePassword(doc);
}
private PDDocument loadFromFile(File file, long size, StreamCacheCreateFunction cache)
throws IOException {
return Loader.loadPDF(new DeletingRandomAccessFile(file), "", null, null, cache);
@@ -270,7 +269,7 @@ public class CustomPDFDocumentFactory {
private PDDocument loadFromBytes(byte[] bytes, long size, StreamCacheCreateFunction cache)
throws IOException {
if (size >= SMALL_FILE_THRESHOLD) {
log.info("Writing large byte array to temp file");
log.debug("Writing large byte array to temp file");
Path tempFile = createTempFile("pdf-bytes-");
Files.write(tempFile, bytes);
@@ -318,7 +317,7 @@ public class CustomPDFDocumentFactory {
// Temp file handling with enhanced logging
private Path createTempFile(String prefix) throws IOException {
Path file = Files.createTempFile(prefix + tempCounter.incrementAndGet() + "-", ".tmp");
log.info("Created temp file: {}", file);
log.debug("Created temp file: {}", file);
return file;
}

View File

@@ -4,6 +4,8 @@ import java.util.HashMap;
import java.util.Map;
import java.util.concurrent.ConcurrentHashMap;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.scheduling.annotation.Scheduled;
import org.springframework.stereotype.Service;
@@ -11,22 +13,32 @@ import org.springframework.stereotype.Service;
import io.micrometer.core.instrument.MeterRegistry;
import io.micrometer.core.instrument.search.Search;
import stirling.software.SPDF.config.EndpointInspector;
@Service
public class MetricsAggregatorService {
private static final Logger logger = LoggerFactory.getLogger(MetricsAggregatorService.class);
private final MeterRegistry meterRegistry;
private final PostHogService postHogService;
private final EndpointInspector endpointInspector;
private final Map<String, Double> lastSentMetrics = new ConcurrentHashMap<>();
@Autowired
public MetricsAggregatorService(MeterRegistry meterRegistry, PostHogService postHogService) {
public MetricsAggregatorService(
MeterRegistry meterRegistry,
PostHogService postHogService,
EndpointInspector endpointInspector) {
this.meterRegistry = meterRegistry;
this.postHogService = postHogService;
this.endpointInspector = endpointInspector;
}
@Scheduled(fixedRate = 7200000) // Run every 2 hours
public void aggregateAndSendMetrics() {
Map<String, Object> metrics = new HashMap<>();
final boolean validateGetEndpoints = endpointInspector.getValidGetEndpoints().size() != 0;
Search.in(meterRegistry)
.name("http.requests")
.counters()
@@ -34,35 +46,52 @@ public class MetricsAggregatorService {
counter -> {
String method = counter.getId().getTag("method");
String uri = counter.getId().getTag("uri");
// Skip if either method or uri is null
if (method == null || uri == null) {
return;
}
// Skip URIs that are 2 characters or shorter
if (uri.length() <= 2) {
return;
}
// Skip non-GET and non-POST requests
if (!"GET".equals(method) && !"POST".equals(method)) {
return;
}
// Skip URIs that are 2 characters or shorter
if (uri.length() <= 2) {
// For POST requests, only include if they start with /api/v1
if ("POST".equals(method) && !uri.contains("api/v1")) {
return;
}
if (uri.contains(".txt")) {
return;
}
// For GET requests, validate if we have a list of valid endpoints
if ("GET".equals(method)
&& validateGetEndpoints
&& !endpointInspector.isValidGetEndpoint(uri)) {
logger.debug("Skipping invalid GET endpoint: {}", uri);
return;
}
String key =
String.format(
"http_requests_%s_%s", method, uri.replace("/", "_"));
double currentCount = counter.count();
double lastCount = lastSentMetrics.getOrDefault(key, 0.0);
double difference = currentCount - lastCount;
if (difference > 0) {
logger.info("{}, {}", key, difference);
metrics.put(key, difference);
lastSentMetrics.put(key, currentCount);
}
});
// Send aggregated metrics to PostHog
if (!metrics.isEmpty()) {
postHogService.captureEvent("aggregated_metrics", metrics);
}
}