Stirling-PDF/HowToUseOCR.md

83 lines
2.8 KiB
Markdown
Raw Normal View History

# OCR Language Packs and Setup
This document provides instructions on how to add additional language packs for the OCR tab in Stirling-PDF, both inside and outside of Docker.
## My OCR used to work and now doesn't!
2024-11-03 08:20:10 +01:00
The paths have changed for the tessdata locations on new Docker images. Please use `/usr/share/tessdata` (Others should still work for backward compatibility but might not).
2023-12-10 23:02:30 +01:00
## How does the OCR Work
2024-11-03 08:20:10 +01:00
Stirling-PDF uses [OCRmyPDF](https://github.com/ocrmypdf/OCRmyPDF), which in turn uses Tesseract for its text recognition. All credit goes to them for this awesome work!
## Language Packs
Tesseract OCR supports a variety of languages. You can find additional language packs in the Tesseract GitHub repositories:
2024-11-03 08:20:10 +01:00
- [tessdata_fast](https://github.com/tesseract-ocr/tessdata_fast): These language packs are smaller and faster to load but may provide lower recognition accuracy.
- [tessdata](https://github.com/tesseract-ocr/tessdata): These language packs are larger and provide better recognition accuracy, but may take longer to load.
2024-11-03 08:20:10 +01:00
Depending on your requirements, you can choose the appropriate language pack for your use case. By default, Stirling-PDF uses `tessdata_fast` for English, but this can be replaced.
### Installing Language Packs
1. Download the desired language pack(s) by selecting the `.traineddata` file(s) for the language(s) you need.
2. Place the `.traineddata` files in the Tesseract tessdata directory: `/usr/share/tessdata`
2024-11-03 08:20:10 +01:00
**DO NOT REMOVE EXISTING `eng.traineddata`, IT'S REQUIRED.**
2023-04-30 15:42:26 +02:00
2024-11-03 08:20:10 +01:00
### Docker Setup
2024-02-11 17:47:00 +01:00
If you are using Docker, you need to expose the Tesseract tessdata directory as a volume in order to use the additional language packs.
2024-11-03 08:20:10 +01:00
#### Docker Compose
2024-11-03 08:20:10 +01:00
Modify your `docker-compose.yml` file to include the following volume configuration:
```yaml
services:
your_service_name:
image: your_docker_image_name
volumes:
- /location/of/trainingData:/usr/share/tessdata
```
2024-11-03 08:20:10 +01:00
#### Docker Run
Add the following to your existing Docker run command:
```bash
-v /location/of/trainingData:/usr/share/tessdata
```
2024-11-03 08:20:10 +01:00
### Non-Docker Setup
If you are not using Docker, you need to install the OCR components, including the `ocrmypdf` app. You can see the [OCRmyPDF install guide](https://ocrmypdf.readthedocs.io/en/latest/installation.html).
2024-11-03 08:20:10 +01:00
For Debian-based systems, install languages with this command:
```bash
sudo apt update &&\
# All languages
# sudo apt install -y 'tesseract-ocr-*'
# Find languages:
apt search tesseract-ocr-
# View installed languages:
dpkg-query -W tesseract-ocr- | sed 's/tesseract-ocr-//g'
```
2024-11-03 08:20:10 +01:00
For Fedora:
```bash
# All languages
# sudo dnf install -y tesseract-langpack-*
# Find languages:
dnf search -C tesseract-langpack-
# View installed languages:
rpm -qa | grep tesseract-langpack | sed 's/tesseract-langpack-//g'
```