Stirling-PDF/HowToUseOCR.md

# OCR Language Packs and Setup

This document provides instructions on how to add additional language packs for the OCR tab in Stirling-PDF, both inside and outside of Docker.

## My OCR used to work and now doesn't!

The paths have changed for the tessdata locations on new Docker images. Please use `/usr/share/tessdata` (Others should still work for backward compatibility but might not).

## How does the OCR Work

Stirling-PDF uses [qpdf](https://github.com/qpdf/qpdf), which in turn uses Tesseract for its text recognition. All credit goes to them for this awesome work!

## Language Packs

Tesseract OCR supports a variety of languages. You can find additional language packs in the Tesseract GitHub repositories:

- [tessdata_fast](https://github.com/tesseract-ocr/tessdata_fast): These language packs are smaller and faster to load but may provide lower recognition accuracy.
- [tessdata](https://github.com/tesseract-ocr/tessdata): These language packs are larger and provide better recognition accuracy, but may take longer to load.

Depending on your requirements, you can choose the appropriate language pack for your use case. By default, Stirling-PDF uses `tessdata_fast` for English, but this can be replaced.

### Installing Language Packs

1. Download the desired language pack(s) by selecting the `.traineddata` file(s) for the language(s) you need.
2. Place the `.traineddata` files in the Tesseract tessdata directory: `/usr/share/tessdata`

**DO NOT REMOVE EXISTING `eng.traineddata`, IT'S REQUIRED.**

### Docker Setup

If you are using Docker, you need to expose the Tesseract tessdata directory as a volume in order to use the additional language packs.

#### Docker Compose

Modify your `docker-compose.yml` file to include the following volume configuration:

```yaml
services:
  your_service_name:
    image: your_docker_image_name
    volumes:
      - /location/of/trainingData:/usr/share/tessdata
```

#### Docker Run

Add the following to your existing Docker run command:

```bash
-v /location/of/trainingData:/usr/share/tessdata
```

### Non-Docker Setup

If you are not using Docker, you need to install the OCR components, including the `qpdf` app. You can see the [qpdf install guide](https://qpdf.readthedocs.io/en/latest/installation.html).

For Debian-based systems, install languages with this command:

```bash
sudo apt update &&\
# All languages
# sudo apt install -y 'tesseract-ocr-*'

# Find languages:
apt search tesseract-ocr-

# View installed languages:
dpkg-query -W tesseract-ocr- | sed 's/tesseract-ocr-//g'
```

For Fedora:

```bash
# All languages
# sudo dnf install -y tesseract-langpack-*

# Find languages:
dnf search -C tesseract-langpack-

# View installed languages:
rpm -qa | grep tesseract-langpack | sed 's/tesseract-langpack-//g'
```

For Windows:

Ensure qpdf in installed with
``pip install qpdf``

Additional languages must be downloaded manually:
Download desired .traineddata files from tessdata or tessdata_fast
Place them in the tessdata folder within your Tesseract installation directory
(e.g., C:\Program Files\Tesseract-OCR\tessdata)

Verify installation:
``tesseract --list-langs``

You must then edit your ``/configs/settings.yml`` and change the system.tessdataDir to match the directory containing lang files
```
system:
 tessdataDir: C:/Program Files/Tesseract-OCR/tessdata # path to the directory containing the Tessdata files. This setting is relevant for Windows systems. For Windows users, this path should be adjusted to point to the appropriate directory where the Tessdata files are stored.
```
Lots of changes (#70) Image extraction and conversion to formats Multi parallel file execution for all forms so you can input multiple files quickly Any file at all pdf using libreoffice, super powerful Sadly makes docker image larger but worth it OCR PDF using ocr my pdf Works awesomely for adding text to a image Improved compression using ocr my pdf app Settings page with custom download options such as - open in same window - open in new window - download - download as zip Update detection in settings page it should show notification if there is a update (very hidden) UI cleanups Add other image formats to PDF to Image Various fies to icons, and pdf.js usage 2023-03-20 22:55:11 +01:00			`# OCR Language Packs and Setup`

			`This document provides instructions on how to add additional language packs for the OCR tab in Stirling-PDF, both inside and outside of Docker.`

Document (#803) * Update HowToAddNewLanguage.md * Update HowToUseOCR.md * Update LocalRunGuide.md * Update README.md * Update LocalRunGuide.md * Update README.md --------- Co-authored-by: Anthony Stirling <77850077+Frooodle@users.noreply.github.com> 2024-02-16 23:42:56 +01:00			`## My OCR used to work and now doesn't!`
fixed minor bugs in Markdown (#2152) 2024-11-03 08:20:10 +01:00
			The paths have changed for the tessdata locations on new Docker images. Please use `/usr/share/tessdata` (Others should still work for backward compatibility but might not).
docker and ocr updates 2023-12-10 23:02:30 +01:00
Lots of changes (#70) Image extraction and conversion to formats Multi parallel file execution for all forms so you can input multiple files quickly Any file at all pdf using libreoffice, super powerful Sadly makes docker image larger but worth it OCR PDF using ocr my pdf Works awesomely for adding text to a image Improved compression using ocr my pdf app Settings page with custom download options such as - open in same window - open in new window - download - download as zip Update detection in settings page it should show notification if there is a update (very hidden) UI cleanups Add other image formats to PDF to Image Various fies to icons, and pdf.js usage 2023-03-20 22:55:11 +01:00			`## How does the OCR Work`
fixed minor bugs in Markdown (#2152) 2024-11-03 08:20:10 +01:00
cleanups 2024-11-26 21:15:13 +01:00			`Stirling-PDF uses [qpdf](https://github.com/qpdf/qpdf), which in turn uses Tesseract for its text recognition. All credit goes to them for this awesome work!`
Lots of changes (#70) Image extraction and conversion to formats Multi parallel file execution for all forms so you can input multiple files quickly Any file at all pdf using libreoffice, super powerful Sadly makes docker image larger but worth it OCR PDF using ocr my pdf Works awesomely for adding text to a image Improved compression using ocr my pdf app Settings page with custom download options such as - open in same window - open in new window - download - download as zip Update detection in settings page it should show notification if there is a update (very hidden) UI cleanups Add other image formats to PDF to Image Various fies to icons, and pdf.js usage 2023-03-20 22:55:11 +01:00
			`## Language Packs`

			`Tesseract OCR supports a variety of languages. You can find additional language packs in the Tesseract GitHub repositories:`

fixed minor bugs in Markdown (#2152) 2024-11-03 08:20:10 +01:00			`- [tessdata_fast](https://github.com/tesseract-ocr/tessdata_fast): These language packs are smaller and faster to load but may provide lower recognition accuracy.`
Lots of changes (#70) Image extraction and conversion to formats Multi parallel file execution for all forms so you can input multiple files quickly Any file at all pdf using libreoffice, super powerful Sadly makes docker image larger but worth it OCR PDF using ocr my pdf Works awesomely for adding text to a image Improved compression using ocr my pdf app Settings page with custom download options such as - open in same window - open in new window - download - download as zip Update detection in settings page it should show notification if there is a update (very hidden) UI cleanups Add other image formats to PDF to Image Various fies to icons, and pdf.js usage 2023-03-20 22:55:11 +01:00			`- [tessdata](https://github.com/tesseract-ocr/tessdata): These language packs are larger and provide better recognition accuracy, but may take longer to load.`

fixed minor bugs in Markdown (#2152) 2024-11-03 08:20:10 +01:00			Depending on your requirements, you can choose the appropriate language pack for your use case. By default, Stirling-PDF uses `tessdata_fast` for English, but this can be replaced.
Lots of changes (#70) Image extraction and conversion to formats Multi parallel file execution for all forms so you can input multiple files quickly Any file at all pdf using libreoffice, super powerful Sadly makes docker image larger but worth it OCR PDF using ocr my pdf Works awesomely for adding text to a image Improved compression using ocr my pdf app Settings page with custom download options such as - open in same window - open in new window - download - download as zip Update detection in settings page it should show notification if there is a update (very hidden) UI cleanups Add other image formats to PDF to Image Various fies to icons, and pdf.js usage 2023-03-20 22:55:11 +01:00
			`### Installing Language Packs`

			1. Download the desired language pack(s) by selecting the `.traineddata` file(s) for the language(s) you need.
switch images to alpine Signed-off-by: Zoey <zoey@z0ey.de> 2023-12-31 15:54:34 +01:00			2. Place the `.traineddata` files in the Tesseract tessdata directory: `/usr/share/tessdata`
Lots of changes (#70) Image extraction and conversion to formats Multi parallel file execution for all forms so you can input multiple files quickly Any file at all pdf using libreoffice, super powerful Sadly makes docker image larger but worth it OCR PDF using ocr my pdf Works awesomely for adding text to a image Improved compression using ocr my pdf app Settings page with custom download options such as - open in same window - open in new window - download - download as zip Update detection in settings page it should show notification if there is a update (very hidden) UI cleanups Add other image formats to PDF to Image Various fies to icons, and pdf.js usage 2023-03-20 22:55:11 +01:00
fixed minor bugs in Markdown (#2152) 2024-11-03 08:20:10 +01:00			DO NOT REMOVE EXISTING `eng.traineddata`, IT'S REQUIRED.
fix for OCR multi lang 2023-04-30 15:42:26 +02:00
fixed minor bugs in Markdown (#2152) 2024-11-03 08:20:10 +01:00			`### Docker Setup`
Lots of changes (#70) Image extraction and conversion to formats Multi parallel file execution for all forms so you can input multiple files quickly Any file at all pdf using libreoffice, super powerful Sadly makes docker image larger but worth it OCR PDF using ocr my pdf Works awesomely for adding text to a image Improved compression using ocr my pdf app Settings page with custom download options such as - open in same window - open in new window - download - download as zip Update detection in settings page it should show notification if there is a update (very hidden) UI cleanups Add other image formats to PDF to Image Various fies to icons, and pdf.js usage 2023-03-20 22:55:11 +01:00
			`If you are using Docker, you need to expose the Tesseract tessdata directory as a volume in order to use the additional language packs.`
fixed minor bugs in Markdown (#2152) 2024-11-03 08:20:10 +01:00
Lots of changes (#70) Image extraction and conversion to formats Multi parallel file execution for all forms so you can input multiple files quickly Any file at all pdf using libreoffice, super powerful Sadly makes docker image larger but worth it OCR PDF using ocr my pdf Works awesomely for adding text to a image Improved compression using ocr my pdf app Settings page with custom download options such as - open in same window - open in new window - download - download as zip Update detection in settings page it should show notification if there is a update (very hidden) UI cleanups Add other image formats to PDF to Image Various fies to icons, and pdf.js usage 2023-03-20 22:55:11 +01:00			`#### Docker Compose`

fixed minor bugs in Markdown (#2152) 2024-11-03 08:20:10 +01:00			Modify your `docker-compose.yml` file to include the following volume configuration:
Lots of changes (#70) Image extraction and conversion to formats Multi parallel file execution for all forms so you can input multiple files quickly Any file at all pdf using libreoffice, super powerful Sadly makes docker image larger but worth it OCR PDF using ocr my pdf Works awesomely for adding text to a image Improved compression using ocr my pdf app Settings page with custom download options such as - open in same window - open in new window - download - download as zip Update detection in settings page it should show notification if there is a update (very hidden) UI cleanups Add other image formats to PDF to Image Various fies to icons, and pdf.js usage 2023-03-20 22:55:11 +01:00
			```yaml
			`services:`
			`your_service_name:`
			`image: your_docker_image_name`
			`volumes:`
switch images to alpine Signed-off-by: Zoey <zoey@z0ey.de> 2023-12-31 15:54:34 +01:00			`- /location/of/trainingData:/usr/share/tessdata`
Lots of changes (#70) Image extraction and conversion to formats Multi parallel file execution for all forms so you can input multiple files quickly Any file at all pdf using libreoffice, super powerful Sadly makes docker image larger but worth it OCR PDF using ocr my pdf Works awesomely for adding text to a image Improved compression using ocr my pdf app Settings page with custom download options such as - open in same window - open in new window - download - download as zip Update detection in settings page it should show notification if there is a update (very hidden) UI cleanups Add other image formats to PDF to Image Various fies to icons, and pdf.js usage 2023-03-20 22:55:11 +01:00			```

fixed minor bugs in Markdown (#2152) 2024-11-03 08:20:10 +01:00			`#### Docker Run`

			`Add the following to your existing Docker run command:`
Lots of changes (#70) Image extraction and conversion to formats Multi parallel file execution for all forms so you can input multiple files quickly Any file at all pdf using libreoffice, super powerful Sadly makes docker image larger but worth it OCR PDF using ocr my pdf Works awesomely for adding text to a image Improved compression using ocr my pdf app Settings page with custom download options such as - open in same window - open in new window - download - download as zip Update detection in settings page it should show notification if there is a update (very hidden) UI cleanups Add other image formats to PDF to Image Various fies to icons, and pdf.js usage 2023-03-20 22:55:11 +01:00
			```bash
switch images to alpine Signed-off-by: Zoey <zoey@z0ey.de> 2023-12-31 15:54:34 +01:00			`-v /location/of/trainingData:/usr/share/tessdata`
Lots of changes (#70) Image extraction and conversion to formats Multi parallel file execution for all forms so you can input multiple files quickly Any file at all pdf using libreoffice, super powerful Sadly makes docker image larger but worth it OCR PDF using ocr my pdf Works awesomely for adding text to a image Improved compression using ocr my pdf app Settings page with custom download options such as - open in same window - open in new window - download - download as zip Update detection in settings page it should show notification if there is a update (very hidden) UI cleanups Add other image formats to PDF to Image Various fies to icons, and pdf.js usage 2023-03-20 22:55:11 +01:00			```

fixed minor bugs in Markdown (#2152) 2024-11-03 08:20:10 +01:00			`### Non-Docker Setup`

cleanups 2024-11-26 21:15:13 +01:00			If you are not using Docker, you need to install the OCR components, including the `qpdf` app. You can see the [qpdf install guide](https://qpdf.readthedocs.io/en/latest/installation.html).
Lots of changes (#70) Image extraction and conversion to formats Multi parallel file execution for all forms so you can input multiple files quickly Any file at all pdf using libreoffice, super powerful Sadly makes docker image larger but worth it OCR PDF using ocr my pdf Works awesomely for adding text to a image Improved compression using ocr my pdf app Settings page with custom download options such as - open in same window - open in new window - download - download as zip Update detection in settings page it should show notification if there is a update (very hidden) UI cleanups Add other image formats to PDF to Image Various fies to icons, and pdf.js usage 2023-03-20 22:55:11 +01:00
fixed minor bugs in Markdown (#2152) 2024-11-03 08:20:10 +01:00			`For Debian-based systems, install languages with this command:`
Lots of changes (#70) Image extraction and conversion to formats Multi parallel file execution for all forms so you can input multiple files quickly Any file at all pdf using libreoffice, super powerful Sadly makes docker image larger but worth it OCR PDF using ocr my pdf Works awesomely for adding text to a image Improved compression using ocr my pdf app Settings page with custom download options such as - open in same window - open in new window - download - download as zip Update detection in settings page it should show notification if there is a update (very hidden) UI cleanups Add other image formats to PDF to Image Various fies to icons, and pdf.js usage 2023-03-20 22:55:11 +01:00
Added Fedora location & install commands 2023-05-14 22:54:31 +02:00			```bash
			`sudo apt update &&\`
			`# All languages`
			`# sudo apt install -y 'tesseract-ocr-*'`

			`# Find languages:`
			`apt search tesseract-ocr-`

			`# View installed languages:`
			`dpkg-query -W tesseract-ocr- \| sed 's/tesseract-ocr-//g'`
			```

fixed minor bugs in Markdown (#2152) 2024-11-03 08:20:10 +01:00			`For Fedora:`
Added Fedora location & install commands 2023-05-14 22:54:31 +02:00
			```bash
			`# All languages`
			`# sudo dnf install -y tesseract-langpack-*`

			`# Find languages:`
			`dnf search -C tesseract-langpack-`

			`# View installed languages:`
			`rpm -qa \| grep tesseract-langpack \| sed 's/tesseract-langpack-//g'`
			```
Update HowToUseOCR.md 2024-11-12 14:31:34 +01:00
			`For Windows:`

cleanups 2024-11-26 21:15:13 +01:00			`Ensure qpdf in installed with`
			``pip install qpdf``
Update HowToUseOCR.md 2024-11-12 14:31:34 +01:00
			`Additional languages must be downloaded manually:`
			`Download desired .traineddata files from tessdata or tessdata_fast`
			`Place them in the tessdata folder within your Tesseract installation directory`
			`(e.g., C:\Program Files\Tesseract-OCR\tessdata)`

			`Verify installation:`
			``tesseract --list-langs``

			You must then edit your ``/configs/settings.yml`` and change the system.tessdataDir to match the directory containing lang files
			```
			`system:`
			`tessdataDir: C:/Program Files/Tesseract-OCR/tessdata # path to the directory containing the Tessdata files. This setting is relevant for Windows systems. For Windows users, this path should be adjusted to point to the appropriate directory where the Tessdata files are stored.`
			```