From 9ac260ee9293c997d6514a48d7746e3b27ccfdad Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Bal=C3=A1zs=20Sz=C3=BCcs?= <127139797+balazs-szucs@users.noreply.github.com> Date: Tue, 3 Mar 2026 20:06:46 +0100 Subject: [PATCH] feat(aot): add aot-diagnostics.sh for AOT cache diagnostics and validation (#5848) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit # Description of Changes This pull request makes significant improvements to the Docker build process for the embedded Stirling-PDF image, focusing on build efficiency, runtime optimization, and maintainability. Key changes include upgrading major tool versions, introducing optional stripping of Calibre's WebEngine to reduce image size, consolidating ImageMagick layers, and refining the Python environment build process. The runtime image is now leaner, with clearer separation between build and runtime dependencies, and improved caching for faster builds and pulls. **Build and Dependency Management Improvements** * Upgraded Calibre to version `9.4.0` and added support for the `TARGETPLATFORM` build argument for multi-platform builds. * Added an optional `CALIBRE_STRIP_WEBENGINE` build argument to strip Chromium/WebEngine from Calibre, saving ~80 MB when PDF output via Calibre is not needed. * Consolidated ImageMagick outputs into a single staging directory (`/magick-export`) to reduce Docker layers and improve caching efficiency. * Refactored Python virtual environment build: now built in a dedicated stage with pre-built wheels and copied into the runtime image, eliminating the need for build tools and pip installs at runtime. **Runtime Image Optimization** * Reduced installed system packages to only what is needed at runtime; Python build tools and dev packages are no longer included. * Cleaned up unnecessary runtime files, including removal of build-only Python artifacts and system headers, for a smaller and more secure image. **Layer and Copy Optimization** * Switched to `COPY --link` for all major external tool layers and application files, enabling independent layer caching and parallel pulls for faster builds. **Runtime Configuration and Health** * Improved runtime directory structure and permissions, added persistent cache directories for Project Leyden AOT, and wrote the version tag to `/etc/stirling_version` for easier script access. * Updated the healthcheck to wait longer for startup and increased timeout/retries for more robust readiness detection. --- ## Checklist ### General - [ ] I have read the [Contribution Guidelines](https://github.com/Stirling-Tools/Stirling-PDF/blob/main/CONTRIBUTING.md) - [ ] I have read the [Stirling-PDF Developer Guide](https://github.com/Stirling-Tools/Stirling-PDF/blob/main/devGuide/DeveloperGuide.md) (if applicable) - [ ] I have read the [How to add new languages to Stirling-PDF](https://github.com/Stirling-Tools/Stirling-PDF/blob/main/devGuide/HowToAddNewLanguage.md) (if applicable) - [ ] I have performed a self-review of my own code - [ ] My changes generate no new warnings ### Documentation - [ ] I have updated relevant docs on [Stirling-PDF's doc repo](https://github.com/Stirling-Tools/Stirling-Tools.github.io/blob/main/docs/) (if functionality has heavily changed) - [ ] I have read the section [Add New Translation Tags](https://github.com/Stirling-Tools/Stirling-PDF/blob/main/devGuide/HowToAddNewLanguage.md#add-new-translation-tags) (for new translation tags only) ### Translations (if applicable) - [ ] I ran [`scripts/counter_translation.py`](https://github.com/Stirling-Tools/Stirling-PDF/blob/main/docs/counter_translation.md) ### UI Changes (if applicable) - [ ] Screenshots or videos demonstrating the UI changes are attached (e.g., as comments or direct attachments in the PR) ### Testing (if applicable) - [ ] I have tested my changes locally. Refer to the [Testing Guide](https://github.com/Stirling-Tools/Stirling-PDF/blob/main/devGuide/DeveloperGuide.md#6-testing) for more details. --------- Signed-off-by: Balázs Szücs --- docker/embedded/Dockerfile | 169 +++++---- docker/embedded/Dockerfile.fat | 7 +- docker/embedded/Dockerfile.ultra-lite | 5 +- scripts/aot-diagnostics.sh | 380 +++++++++++++++++++ scripts/init-without-ocr.sh | 520 ++++++++++++++++++-------- 5 files changed, 856 insertions(+), 225 deletions(-) create mode 100755 scripts/aot-diagnostics.sh diff --git a/docker/embedded/Dockerfile b/docker/embedded/Dockerfile index de2aff562..49cc58bbb 100644 --- a/docker/embedded/Dockerfile +++ b/docker/embedded/Dockerfile @@ -1,8 +1,9 @@ # Stirling-PDF - Full version (embedded frontend) FROM ubuntu:noble AS calibre-build - -ARG CALIBRE_VERSION=9.3.1 +ARG TARGETPLATFORM +ARG CALIBRE_VERSION=9.4.0 +ARG CALIBRE_STRIP_WEBENGINE=false RUN --mount=type=cache,target=/var/cache/apt,sharing=locked \ set -eux; \ @@ -27,7 +28,7 @@ RUN --mount=type=cache,target=/var/cache/apt,sharing=locked \ tar xJf /tmp/calibre.txz -C /opt/calibre; \ rm /tmp/calibre.txz; \ \ - # We only need Qt6 WebEngine (Chromium) for ebook→PDF output. + # We only need Qt6 WebEngine (Chromium) for ebook->PDF output. # PDF INPUT now uses the pdftohtml engine (poppler), not Qt. rm -f /opt/calibre/lib/libQt6Designer* \ /opt/calibre/lib/libQt6Multimedia* \ @@ -229,7 +230,7 @@ RUN --mount=type=cache,target=/var/cache/apt,sharing=locked \ find /opt/calibre -name '*.pyc' -delete 2>/dev/null || true; \ \ # ── Verify conversion still works ── - # NOTE: txt→epub used intentionally NOT txt→pdf. + # NOTE: txt->epub used intentionally NOT txt->pdf. # Calibre 7+ uses WebEngine (Chromium) for PDF output, which requires kernel # capabilities unavailable in Docker RUN steps and segfaults under QEMU. # epub output exercises the same Python/plugin stack without touching WebEngine. @@ -242,6 +243,21 @@ RUN --mount=type=cache,target=/var/cache/apt,sharing=locked \ pdftohtml -v >/dev/null 2>&1 && echo "pdftohtml OK" || { echo "ERROR: pdftohtml not found"; exit 1; }; \ echo "=== Calibre stripped successfully ===" +# Optional: strip Chromium/WebEngine (~80 MB savings) when PDF output via Calibre is not needed. +# Build with --build-arg CALIBRE_STRIP_WEBENGINE=true to enable. +RUN if [ "${CALIBRE_STRIP_WEBENGINE}" = "true" ]; then \ + echo "Stripping Calibre WebEngine (Chromium), PDF output via Calibre will be disabled"; \ + rm -rf /opt/calibre/lib/qt6/libexec/QtWebEngineProcess \ + /opt/calibre/lib/qt6/resources \ + /opt/calibre/lib/libQt6WebEngine*.so.* \ + /opt/calibre/lib/libQt6Quick*.so.* \ + /opt/calibre/lib/libQt6Qml*.so.* \ + /opt/calibre/translations/qtwebengine_locales 2>/dev/null || true; \ + echo "WebEngine stripped, Calibre PDF output disabled"; \ + else \ + echo "CALIBRE_STRIP_WEBENGINE=false, keeping WebEngine for PDF output"; \ + fi + # Build the Java application and frontend. FROM gradle:9.3.1-jdk25 AS app-build @@ -294,6 +310,7 @@ RUN java -Djarmode=tools -jar app.jar extract --layers --destination /layers # Build Ghostscript 10.06.0 from source in an isolated stage (avoids library conflicts). FROM ubuntu:noble AS gs-build +ARG TARGETPLATFORM ARG GS_VERSION=10.06.0 RUN --mount=type=cache,target=/var/cache/apt,sharing=locked \ --mount=type=cache,target=/tmp/gs-build,id=gs-build-${TARGETPLATFORM:-local} \ @@ -316,6 +333,7 @@ RUN --mount=type=cache,target=/var/cache/apt,sharing=locked \ # Build PDF Tools (QPDF and ImageMagick 7). FROM ubuntu:noble AS pdf-tools-build +ARG TARGETPLATFORM ARG QPDF_VERSION=12.3.2 ARG IM_VERSION=7.1.2-13 RUN --mount=type=cache,target=/var/cache/apt,sharing=locked \ @@ -346,6 +364,44 @@ RUN --mount=type=cache,target=/var/cache/apt,sharing=locked \ cd .. && \ ldconfig /usr/local/lib +# Stage ImageMagick outputs into a single directory so runtime can import them with one COPY +# (reduces 4 separate COPY layers to 1 independent --link layer). +RUN mkdir -p /magick-export/usr/bin \ + /magick-export/usr/local/lib \ + /magick-export/usr/local/etc && \ + cp /usr/local/bin/magick /magick-export/usr/bin/ && \ + cp -a /usr/local/lib/libMagick*.so* /magick-export/usr/local/lib/ && \ + cp -a /usr/local/lib/ImageMagick-7* /magick-export/usr/local/lib/ && \ + cp -a /usr/local/etc/ImageMagick-7 /magick-export/usr/local/etc/ + + +# Build Python venv in an isolated stage so runtime image never needs build tools. +# Packages with native extensions (opencv, cryptography) use pre-built wheels (--prefer-binary). +# python3-uno is intentionally NOT installed here, it is a system package in the runtime stage +# and accessed via --system-site-packages at runtime. +FROM ubuntu:noble AS python-venv-build +ARG TARGETPLATFORM +ARG UNOSERVER_VERSION=3.6 + +RUN --mount=type=cache,target=/var/cache/apt,sharing=locked \ + apt-get update && apt-get install -y --no-install-recommends \ + python3 python3-venv ca-certificates binutils && \ + rm -rf /var/lib/apt/lists/* + +RUN --mount=type=cache,target=/root/.cache/pip,sharing=locked \ + python3 -m venv /opt/venv --system-site-packages && \ + /opt/venv/bin/pip install --no-cache-dir --prefer-binary \ + weasyprint pdf2image opencv-python-headless ocrmypdf \ + cryptography \ + "unoserver==${UNOSERVER_VERSION}" && \ + find /opt/venv -type d -name __pycache__ -exec rm -rf {} + 2>/dev/null || true && \ + find /opt/venv \( -name '*.pyc' -o -name '*.pyi' \) -delete 2>/dev/null || true && \ + rm -rf /opt/venv/lib/python*/site-packages/pip \ + /opt/venv/lib/python*/site-packages/pip-*.dist-info \ + /opt/venv/lib/python*/site-packages/setuptools \ + /opt/venv/lib/python*/site-packages/setuptools-*.dist-info && \ + find /opt/venv -name '*.so' -exec strip --strip-unneeded {} + 2>/dev/null || true + # Final runtime image. FROM eclipse-temurin:25-jre-noble AS runtime @@ -377,10 +433,11 @@ RUN --mount=type=cache,target=/var/cache/apt,sharing=locked \ fonts-crosextra-caladea fonts-crosextra-carlito \ fonts-noto-core fonts-noto-mono fonts-noto-extra \ fonts-noto-cjk poppler-data \ - # We install these via apt to avoid downloading "fat wheels" from pip - # python3-full replaced with minimal set - python3 python3-dev python3-venv python3-uno \ - # Python dependencies via pip to avoid conflicts, so we don't install them here + # python3-uno required for UNO bridge (accessed by venv via --system-site-packages) + # python3-venv is NOT needed: the copied /opt/venv works without it at runtime + # python3-dev is NOT needed, venv is pre-built in python-venv-build stage + python3 python3-uno \ + # Python packages are in /opt/venv (copied from python-venv-build stage below) # OCR tesseract-ocr tesseract-ocr-eng tesseract-ocr-deu tesseract-ocr-fra \ tesseract-ocr-por tesseract-ocr-chi-sim \ @@ -401,36 +458,21 @@ RUN --mount=type=cache,target=/var/cache/apt,sharing=locked \ ; \ \ \ - # Note: We do NOT install numpy/pillow/cv2 here; it uses the system versions - python3 -m venv /opt/venv --system-site-packages; \ - /opt/venv/bin/pip install --no-cache-dir \ - weasyprint pdf2image opencv-python-headless ocrmypdf \ - cryptography \ - "unoserver==${UNOSERVER_VERSION}"; \ - \ - ln -sf /opt/venv/bin/unoconvert /usr/local/bin/unoconvert; \ - ln -sf /opt/venv/bin/unoserver /usr/local/bin/unoserver; \ - \ # Verify and fix LibreOffice libreoffice --version; \ soffice --version 2>/dev/null || true; \ # Rebuild UNO bridge type database /usr/lib/libreoffice/program/soffice.bin --headless --convert-to pdf /dev/null 2>/dev/null || true; \ - # Force font cache rebuild and verify filters are available + # Force font cache rebuild fc-cache -f -v 2>&1 | awk 'NR <= 20'; \ - /opt/venv/bin/python -c "import cv2; print('OpenCV', cv2.__version__)"; \ - /opt/venv/bin/python -c "import ocrmypdf; print('ocrmypdf OK')"; \ \ # Cleanup stage. \ - # Remove build-only packages no longer needed at runtime. - apt-get remove --purge -y software-properties-common python3-dev || true; \ + # Remove PPA helper, no longer needed after apt-get update + apt-get remove --purge -y software-properties-common || true; \ apt-get autoremove --purge -y || true; \ rm -rf /var/lib/apt/lists/*; \ \ - # Remove C/C++ headers (no longer needed after pip install) - rm -rf /usr/include/*; \ - \ # Docs / man / info / icons / themes / GUI assets (headless server) rm -rf /usr/share/doc/* /usr/share/man/* /usr/share/info/* \ /usr/share/lintian/* /usr/share/linda/* \ @@ -499,15 +541,6 @@ RUN --mount=type=cache,target=/var/cache/apt,sharing=locked \ /usr/lib/libreoffice/program/libdbu* \ /usr/lib/libreoffice/program/libreport* 2>/dev/null || true; \ \ - find /opt/venv -type d -name __pycache__ \ - -exec rm -rf {} + 2>/dev/null || true; \ - find /opt/venv \ - \( -name '*.pyc' -o -name '*.pyi' \) -delete 2>/dev/null || true; \ - rm -rf /opt/venv/lib/python*/site-packages/pip \ - /opt/venv/lib/python*/site-packages/pip-*.dist-info \ - /opt/venv/lib/python*/site-packages/setuptools \ - /opt/venv/lib/python*/site-packages/setuptools-*.dist-info; \ - \ rm -rf /usr/lib/python3.12/test \ /usr/lib/python3.12/idlelib \ /usr/lib/python3.12/tkinter \ @@ -524,8 +557,6 @@ RUN --mount=type=cache,target=/var/cache/apt,sharing=locked \ /usr/lib/python3/dist-packages/_cffi_backend*.so \ /usr/lib/python3/dist-packages/_cffi_backend*.cpython*.so \ 2>/dev/null || true; \ - /opt/venv/bin/python -c "import cffi; print('cffi OK:', cffi.__version__)" \ - || { echo 'ERROR: cffi broken after system package cleanup'; exit 1; }; \ \ # Strip debug symbols from ALL shared libraries find /usr/lib -name '*.so*' -type f \ @@ -597,7 +628,7 @@ RUN --mount=type=cache,target=/var/cache/apt,sharing=locked \ # to be rebuilt without --enable-libflite (not worth the complexity). \ # ── dpkg metadata cleanup (~14MB) ── - # Not needed at runtime — container won't run apt-get. + # Not needed at runtime, container won't run apt-get. rm -rf /var/lib/dpkg/info/*.list \ /var/lib/dpkg/info/*.md5sums \ /var/lib/dpkg/info/*.conffiles \ @@ -613,17 +644,23 @@ RUN --mount=type=cache,target=/var/cache/apt,sharing=locked \ # Misc caches rm -rf /var/cache/fontconfig/* /tmp/* -# Calibre and QPDF tools. -COPY --from=calibre-build /opt/calibre /opt/calibre -COPY --from=pdf-tools-build /usr/local/bin/qpdf /usr/bin/qpdf -COPY --from=pdf-tools-build /usr/local/bin/magick /usr/bin/magick -COPY --from=pdf-tools-build /usr/local/lib/libMagick*.so* /usr/local/lib/ -# Copy loadable coder/filter modules (required when built with --with-modules) -COPY --from=pdf-tools-build /usr/local/lib/ImageMagick-7* /usr/local/lib/ -COPY --from=pdf-tools-build /usr/local/etc/ImageMagick-7 /usr/local/etc/ImageMagick-7 -COPY --from=gs-build /usr/local/bin/gs /usr/local/bin/gs -COPY --from=gs-build /usr/local/share/ghostscript /usr/local/share/ghostscript -RUN ldconfig /usr/local/lib +# External tool layers, all use --link for independent layer caching and parallel pulls. +COPY --link --from=calibre-build /opt/calibre /opt/calibre +COPY --link --from=pdf-tools-build /usr/local/bin/qpdf /usr/bin/qpdf +# ImageMagick: 4 layers collapsed to 1 via the magick-export staging dir in pdf-tools-build +COPY --link --from=pdf-tools-build /magick-export/ / +COPY --link --from=gs-build /usr/local/bin/gs /usr/local/bin/gs +COPY --link --from=gs-build /usr/local/share/ghostscript /usr/local/share/ghostscript +# Python venv pre-built in python-venv-build (no pip install at runtime, no build tools needed) +COPY --link --from=python-venv-build /opt/venv /opt/venv +RUN ldconfig /usr/local/lib && \ + PYTHONDONTWRITEBYTECODE=1 \ + /opt/venv/bin/python -c "import cffi; print('cffi OK:', cffi.__version__)" && \ + PYTHONDONTWRITEBYTECODE=1 \ + /opt/venv/bin/python -c "import cv2; print('OpenCV', cv2.__version__)" && \ + PYTHONDONTWRITEBYTECODE=1 \ + /opt/venv/bin/python -c "import ocrmypdf; print('ocrmypdf OK')" && \ + find /opt/venv -type d -name __pycache__ -exec rm -rf {} + 2>/dev/null || true # --- # Non-root user @@ -646,16 +683,16 @@ RUN set -eux; \ # Application files. WORKDIR /app -COPY --from=jar-extract --chown=1000:1000 /layers/dependencies/ /app/ -COPY --from=jar-extract --chown=1000:1000 /layers/spring-boot-loader/ /app/ -COPY --from=jar-extract --chown=1000:1000 /layers/snapshot-dependencies/ /app/ -COPY --from=jar-extract --chown=1000:1000 /layers/application/ /app/ +COPY --link --from=jar-extract --chown=1000:1000 /layers/dependencies/ /app/ +COPY --link --from=jar-extract --chown=1000:1000 /layers/spring-boot-loader/ /app/ +COPY --link --from=jar-extract --chown=1000:1000 /layers/snapshot-dependencies/ /app/ +COPY --link --from=jar-extract --chown=1000:1000 /layers/application/ /app/ -COPY --from=app-build --chown=1000:1000 \ +COPY --link --from=app-build --chown=1000:1000 \ /app/build/libs/restart-helper.jar /restart-helper.jar -COPY --chown=1000:1000 scripts/ /scripts/ +COPY --link --chown=1000:1000 scripts/ /scripts/ -# Fonts go to system dir — root ownership is correct (world-readable) +# Fonts go to system dir, root ownership is correct (world-readable) COPY app/core/src/main/resources/static/fonts/*.ttf /usr/share/fonts/truetype/ # Permissions and configuration. @@ -667,7 +704,7 @@ RUN set -eux; \ ln -sf /opt/venv/bin/weasyprint /usr/local/bin/weasyprint; \ ln -sf /opt/venv/bin/unoping /usr/local/bin/unoping; \ chmod +x /scripts/*; \ - mkdir -p /configs /logs /customFiles \ + mkdir -p /configs /configs/cache /configs/heap_dumps /logs /customFiles \ /pipeline/watchedFolders /pipeline/finishedFolders \ /tmp/stirling-pdf/heap_dumps; \ # Create symlinks to allow app to find these in /app/ @@ -684,15 +721,21 @@ RUN set -eux; \ chmod 750 /tmp/stirling-pdf/heap_dumps; \ fc-cache -f # NOTE: Project Leyden AOT cache is generated in the background on first boot - # by init-without-ocr.sh. The cache is picked up on subsequent boots for - # 15-25% faster startup. See: JEP 483 + 514 + 515 (JDK 25). + # by init-without-ocr.sh and stored in /configs/cache/stirling.aot (persistent volume). + # The cache is picked up on subsequent boots for 15-25% faster startup. + # See: JEP 483 + 514 + 515 (JDK 25). # Environment variables. ARG VERSION_TAG +# Write version to a file so it is readable by scripts without env-var inheritance. +# init-without-ocr.sh reads /etc/stirling_version for the AOT cache fingerprint. +RUN echo "${VERSION_TAG:-dev}" > /etc/stirling_version + ENV VERSION_TAG=$VERSION_TAG \ + STIRLING_AOT_ENABLE="false" \ STIRLING_JVM_PROFILE="balanced" \ - _JVM_OPTS_BALANCED="-XX:+ExitOnOutOfMemoryError -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/stirling-pdf/heap_dumps -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -XX:G1HeapRegionSize=4m -XX:G1PeriodicGCInterval=60000 -XX:+UseStringDeduplication -XX:+UseCompactObjectHeaders -XX:+ExplicitGCInvokesConcurrent -Dspring.threads.virtual.enabled=true -Djava.awt.headless=true" \ - _JVM_OPTS_PERFORMANCE="-XX:+ExitOnOutOfMemoryError -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/stirling-pdf/heap_dumps -XX:+UseShenandoahGC -XX:ShenandoahGCMode=generational -XX:+UseCompactObjectHeaders -XX:+UseStringDeduplication -XX:+AlwaysPreTouch -XX:+ExplicitGCInvokesConcurrent -Dspring.threads.virtual.enabled=true -Djava.awt.headless=true" \ + _JVM_OPTS_BALANCED="-XX:+ExitOnOutOfMemoryError -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/configs/heap_dumps -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -XX:G1HeapRegionSize=4m -XX:G1PeriodicGCInterval=60000 -XX:+UseStringDeduplication -XX:+UseCompactObjectHeaders -XX:+ExplicitGCInvokesConcurrent -Dspring.threads.virtual.enabled=true -Djava.awt.headless=true" \ + _JVM_OPTS_PERFORMANCE="-XX:+ExitOnOutOfMemoryError -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/configs/heap_dumps -XX:+UseShenandoahGC -XX:ShenandoahGCMode=generational -XX:+UseCompactObjectHeaders -XX:+UseStringDeduplication -XX:+AlwaysPreTouch -XX:+ExplicitGCInvokesConcurrent -Dspring.threads.virtual.enabled=true -Djava.awt.headless=true" \ JAVA_CUSTOM_OPTS="" \ HOME=/home/stirlingpdfuser \ PUID=${PUID} \ @@ -724,8 +767,8 @@ LABEL org.opencontainers.image.title="Stirling-PDF" \ EXPOSE 8080/tcp STOPSIGNAL SIGTERM -HEALTHCHECK --interval=30s --timeout=10s --start-period=40s --retries=3 \ - CMD curl -fs --show-error http://localhost:8080/api/v1/info/status || exit 1 +HEALTHCHECK --interval=30s --timeout=15s --start-period=120s --retries=5 \ + CMD curl -fs --max-time 10 http://localhost:8080/api/v1/info/status || exit 1 ENTRYPOINT ["tini", "--", "/scripts/init.sh"] CMD [] diff --git a/docker/embedded/Dockerfile.fat b/docker/embedded/Dockerfile.fat index c5cf64663..19e8cea3b 100644 --- a/docker/embedded/Dockerfile.fat +++ b/docker/embedded/Dockerfile.fat @@ -3,7 +3,7 @@ FROM ubuntu:noble AS calibre-build -ARG CALIBRE_VERSION=9.3.1 +ARG CALIBRE_VERSION=9.4.0 RUN --mount=type=cache,target=/var/cache/apt,sharing=locked \ set -eux; \ @@ -562,9 +562,10 @@ RUN set -eux; \ # Environment variables. ARG VERSION_TAG ENV VERSION_TAG=$VERSION_TAG \ + STIRLING_AOT_ENABLE="false" \ STIRLING_JVM_PROFILE="balanced" \ - _JVM_OPTS_BALANCED="-XX:+ExitOnOutOfMemoryError -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/stirling-pdf/heap_dumps -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -XX:G1HeapRegionSize=4m -XX:G1PeriodicGCInterval=60000 -XX:+UseStringDeduplication -XX:+UseCompactObjectHeaders -XX:+ExplicitGCInvokesConcurrent -Dspring.threads.virtual.enabled=true -Djava.awt.headless=true" \ - _JVM_OPTS_PERFORMANCE="-XX:+ExitOnOutOfMemoryError -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/stirling-pdf/heap_dumps -XX:+UseShenandoahGC -XX:ShenandoahGCMode=generational -XX:+UseCompactObjectHeaders -XX:+UseStringDeduplication -XX:+AlwaysPreTouch -XX:+ExplicitGCInvokesConcurrent -Dspring.threads.virtual.enabled=true -Djava.awt.headless=true" \ + _JVM_OPTS_BALANCED="-XX:+ExitOnOutOfMemoryError -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/configs/heap_dumps -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -XX:G1HeapRegionSize=4m -XX:G1PeriodicGCInterval=60000 -XX:+UseStringDeduplication -XX:+UseCompactObjectHeaders -XX:+ExplicitGCInvokesConcurrent -Dspring.threads.virtual.enabled=true -Djava.awt.headless=true" \ + _JVM_OPTS_PERFORMANCE="-XX:+ExitOnOutOfMemoryError -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/configs/heap_dumps -XX:+UseShenandoahGC -XX:ShenandoahGCMode=generational -XX:+UseCompactObjectHeaders -XX:+UseStringDeduplication -XX:+AlwaysPreTouch -XX:+ExplicitGCInvokesConcurrent -Dspring.threads.virtual.enabled=true -Djava.awt.headless=true" \ JAVA_CUSTOM_OPTS="" \ HOME=/home/stirlingpdfuser \ PUID=${PUID} \ diff --git a/docker/embedded/Dockerfile.ultra-lite b/docker/embedded/Dockerfile.ultra-lite index d79747631..8783d9048 100644 --- a/docker/embedded/Dockerfile.ultra-lite +++ b/docker/embedded/Dockerfile.ultra-lite @@ -69,9 +69,10 @@ LABEL org.opencontainers.image.title="Stirling-PDF Ultra-Lite" \ # NOTE: Memory flags (InitialRAMPercentage, MaxRAMPercentage, MaxMetaspaceSize) # are computed dynamically by init-without-ocr.sh based on container memory limits. ENV VERSION_TAG=$VERSION_TAG \ + STIRLING_AOT_ENABLE="false" \ STIRLING_JVM_PROFILE="balanced" \ - _JVM_OPTS_BALANCED="-XX:+ExitOnOutOfMemoryError -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/stirling-pdf/heap_dumps -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -XX:G1HeapRegionSize=4m -XX:G1PeriodicGCInterval=60000 -XX:+UseStringDeduplication -XX:+UseCompactObjectHeaders -XX:+ExplicitGCInvokesConcurrent -Dspring.threads.virtual.enabled=true -Djava.awt.headless=true" \ - _JVM_OPTS_PERFORMANCE="-XX:+ExitOnOutOfMemoryError -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/stirling-pdf/heap_dumps -XX:+UseShenandoahGC -XX:ShenandoahGCMode=generational -XX:+UseCompactObjectHeaders -XX:+UseStringDeduplication -XX:+AlwaysPreTouch -XX:+ExplicitGCInvokesConcurrent -Dspring.threads.virtual.enabled=true -Djava.awt.headless=true" \ + _JVM_OPTS_BALANCED="-XX:+ExitOnOutOfMemoryError -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/configs/heap_dumps -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -XX:G1HeapRegionSize=4m -XX:G1PeriodicGCInterval=60000 -XX:+UseStringDeduplication -XX:+UseCompactObjectHeaders -XX:+ExplicitGCInvokesConcurrent -Dspring.threads.virtual.enabled=true -Djava.awt.headless=true" \ + _JVM_OPTS_PERFORMANCE="-XX:+ExitOnOutOfMemoryError -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/configs/heap_dumps -XX:+UseShenandoahGC -XX:ShenandoahGCMode=generational -XX:+UseCompactObjectHeaders -XX:+UseStringDeduplication -XX:+AlwaysPreTouch -XX:+ExplicitGCInvokesConcurrent -Dspring.threads.virtual.enabled=true -Djava.awt.headless=true" \ JAVA_CUSTOM_OPTS="" \ HOME=/home/stirlingpdfuser \ PUID=1000 \ diff --git a/scripts/aot-diagnostics.sh b/scripts/aot-diagnostics.sh new file mode 100755 index 000000000..7baf74478 --- /dev/null +++ b/scripts/aot-diagnostics.sh @@ -0,0 +1,380 @@ +#!/bin/bash +# aot-diagnostics.sh - Project Leyden AOT cache diagnostic tool for Stirling-PDF +# +# Diagnoses AOT cache generation failures, especially on ARM64 (aarch64). +# Reports JVM feature support, memory limits, cache state, and fingerprint validity. +# +# Usage: +# aot-diagnostics.sh [--test] [--cache PATH] +# +# --test Run a quick AOT RECORD smoke test (~10-30s). Shows exactly +# what error the JVM produces, useful for ARM debugging. +# --cache PATH Override the AOT cache path (default: /configs/cache/stirling.aot) +# +# Symlink aliases set up by init-without-ocr.sh: aot-diag, aot-diagnostics + +set -euo pipefail + +AOT_CACHE_DEFAULT="/configs/cache/stirling.aot" +RUN_SMOKE_TEST=false +AOT_CACHE_PATH="" + +for arg in "$@"; do + case "$arg" in + --test) RUN_SMOKE_TEST=true ;; + --cache=*) AOT_CACHE_PATH="${arg#--cache=}" ;; + --cache) shift; AOT_CACHE_PATH="${1:-}" ;; + -h|--help) + sed -n '/^#/,/^[^#]/{ /^#/{ s/^# \{0,1\}//; p } }' "$0" | head -20 + exit 0 + ;; + esac +done + +AOT_CACHE="${AOT_CACHE_PATH:-$AOT_CACHE_DEFAULT}" +AOT_FP="${AOT_CACHE}.fingerprint" + +# ── Terminal colours ────────────────────────────────────────────────────────── +if [ -t 1 ]; then + C_RED='\033[0;31m' C_GRN='\033[0;32m' C_YLW='\033[0;33m' + C_CYN='\033[0;36m' C_BLD='\033[1m' C_RST='\033[0m' +else + C_RED='' C_GRN='' C_YLW='' C_CYN='' C_BLD='' C_RST='' +fi + +PASS=0; WARN=0; FAIL=0 + +pass() { printf "${C_GRN}[PASS]${C_RST} %s\n" "$*"; PASS=$((PASS+1)); } +warn() { printf "${C_YLW}[WARN]${C_RST} %s\n" "$*"; WARN=$((WARN+1)); } +fail() { printf "${C_RED}[FAIL]${C_RST} %s\n" "$*"; FAIL=$((FAIL+1)); } +info() { printf "${C_CYN}[INFO]${C_RST} %s\n" "$*"; } +hdr() { printf "\n${C_BLD}=== %s ===${C_RST}\n" "$*"; } +command_exists() { command -v "$1" >/dev/null 2>&1; } + +# ── Section 1: Environment ──────────────────────────────────────────────────── +hdr "Environment" +info "Date: $(date -u +"%Y-%m-%dT%H:%M:%SZ" 2>/dev/null || date)" +info "Hostname: $(hostname 2>/dev/null || echo unknown)" +info "Architecture: $(uname -m)" +info "Kernel: $(uname -r)" +if [ -f /etc/stirling_version ]; then + info "Version: $(tr -d '\r\n' < /etc/stirling_version)" +elif [ -n "${VERSION_TAG:-}" ]; then + info "Version: ${VERSION_TAG}" +else + warn "VERSION_TAG not set and /etc/stirling_version not found" +fi + +if [ -f /etc/os-release ]; then + info "OS: $(. /etc/os-release; echo "${PRETTY_NAME:-${NAME:-unknown}}")" +fi + +# Warn about external JVM option vars — these break AOT training if set +for _jvm_var in JAVA_TOOL_OPTIONS JDK_JAVA_OPTIONS _JAVA_OPTIONS; do + _jvm_val="$(eval echo "\${${_jvm_var}:-}")" + if [ -n "$_jvm_val" ]; then + warn "${_jvm_var}='${_jvm_val}'" + warn " External JVM options are cleared during AOT training (fixed), but may" + warn " affect the running app. Ensure they are compatible with -Xmx limits." + fi +done +unset _jvm_var _jvm_val + +# ── Section 2: JVM Detection ────────────────────────────────────────────────── +hdr "JVM Detection" +if ! command_exists java; then + fail "java not found in PATH. PATH=${PATH}" + exit 1 +fi + +JDK_VER="$(JAVA_TOOL_OPTIONS= JDK_JAVA_OPTIONS= _JAVA_OPTIONS= java -version 2>&1 | head -1)" +info "JDK: ${JDK_VER}" +info "java binary: $(command -v java)" + +ARCH="$(uname -m)" + +# --- AOTMode support (Project Leyden) --- +AOT_SUPPORTED=false +if java -XX:AOTMode=off -version >/dev/null 2>&1; then + pass "AOTMode supported (-XX:AOTMode=off accepted)" + AOT_SUPPORTED=true +else + fail "AOTMode NOT supported on this JVM build ($(uname -m))" + fail " This JDK does not support Project Leyden (JEP 483/514/515)." + fail " AOT cache generation will be skipped." + if [[ "$ARCH" == "aarch64" ]]; then + warn " ARM64: some vendor JDK 25 builds omit Leyden. Try eclipse-temurin:25-jre." + fi +fi + +# --- CompactObjectHeaders support (Project Lilliput) --- +COMPACT_HEADERS_FLAG="" +if java -XX:+UseCompactObjectHeaders -version >/dev/null 2>&1; then + pass "UseCompactObjectHeaders supported (Project Lilliput active)" + COMPACT_HEADERS_FLAG="-XX:+UseCompactObjectHeaders" +else + warn "UseCompactObjectHeaders NOT supported on $(uname -m)" + warn " AOT training will run without this flag. Runtime must also omit it." + if [[ "$ARCH" == "aarch64" ]]; then + warn " This is the most common cause of ARM AOT failures: the flag was" + warn " hardcoded in training but unsupported at runtime (or vice-versa)." + fi +fi + +# --- CompressedOops --- +COMPRESSED_OOPS_FLAG="-XX:+UseCompressedOops" +if java -XX:+UseCompressedOops -version >/dev/null 2>&1; then + pass "UseCompressedOops accepted by JVM" +else + warn "UseCompressedOops flag not accepted — will use -XX:-UseCompressedOops" + COMPRESSED_OOPS_FLAG="-XX:-UseCompressedOops" +fi + +# ── Section 3: Memory Limits ────────────────────────────────────────────────── +hdr "Memory Limits" + +MEM_MB=0 +if [ -f /sys/fs/cgroup/memory.max ]; then + RAW="$(cat /sys/fs/cgroup/memory.max 2>/dev/null || echo '')" + if [ "$RAW" = "max" ]; then + info "cgroup v2 memory.max: unlimited" + elif [ -n "$RAW" ] && [ "$RAW" -gt 0 ] 2>/dev/null; then + MEM_MB=$(( RAW / 1048576 )) + info "cgroup v2 memory.max: ${MEM_MB}MB" + fi +elif [ -f /sys/fs/cgroup/memory/memory.limit_in_bytes ]; then + RAW="$(cat /sys/fs/cgroup/memory/memory.limit_in_bytes 2>/dev/null || echo '')" + if [ "${#RAW}" -ge 19 ]; then + info "cgroup v1 limit: unlimited (max uint64)" + elif [ -n "$RAW" ] && [ "$RAW" -gt 0 ] 2>/dev/null; then + MEM_MB=$(( RAW / 1048576 )) + info "cgroup v1 limit: ${MEM_MB}MB" + fi +else + info "No cgroup memory limit detected" +fi + +if [ "$MEM_MB" -eq 0 ] && [ -f /proc/meminfo ]; then + MEM_MB=$(awk '/MemTotal/ {print int($2/1024)}' /proc/meminfo 2>/dev/null || echo 0) + info "System MemTotal: ${MEM_MB}MB" +fi + +MIN_MEM=768 +if [ "$ARCH" = "aarch64" ]; then + MIN_MEM=1024 +fi + +if [ "$MEM_MB" -eq 0 ]; then + warn "Could not determine container memory. AOT generation may be skipped." +elif [ "$MEM_MB" -le "$MIN_MEM" ]; then + warn "Available memory (${MEM_MB}MB) is at or below AOT generation minimum (${MIN_MEM}MB on ${ARCH})." + warn " AOT background generation will be skipped for this architecture." + warn " Increase container memory above ${MIN_MEM}MB to enable AOT cache generation." +else + pass "Memory OK: ${MEM_MB}MB available, minimum ${MIN_MEM}MB for ${ARCH}" +fi + +if command_exists free; then + FREE_MB="$(free -m 2>/dev/null | awk '/^Mem:/ {print $7}')" + info "Available (free+cache): ${FREE_MB:-?}MB" +fi + +# ── Section 4: AOT Cache State ──────────────────────────────────────────────── +hdr "AOT Cache State" +info "Cache path: ${AOT_CACHE}" +info "Fingerprint path: ${AOT_FP}" + +if [ -f "${AOT_CACHE}" ]; then + CACHE_SIZE="$(du -h "${AOT_CACHE}" 2>/dev/null | cut -f1 || echo '?')" + CACHE_MTIME="$(stat -c '%y' "${AOT_CACHE}" 2>/dev/null | cut -d. -f1 || echo '?')" + info "Cache exists: ${CACHE_SIZE} (modified ${CACHE_MTIME})" + if [ -s "${AOT_CACHE}" ]; then + pass "Cache file is non-empty" + else + fail "Cache file is empty — will be regenerated on next boot" + rm -f "${AOT_CACHE}" "${AOT_FP}" 2>/dev/null || true + fi +else + warn "No cache file at ${AOT_CACHE}" + info " Cache will be generated in background on next boot." + if [ ! -d "$(dirname "${AOT_CACHE}")" ]; then + warn " Parent directory $(dirname "${AOT_CACHE}") does not exist." + warn " Ensure /configs is volume-mounted and writable." + fi +fi + +# --- Fingerprint validation --- +if [ -f "${AOT_FP}" ]; then + STORED_FP="$(tr -d '\r\n' < "${AOT_FP}" 2>/dev/null || echo '')" + info "Stored fingerprint: ${STORED_FP}" + + # Recompute fingerprint using the same logic as init-without-ocr.sh + FP="" + FP+="jdk:$(JAVA_TOOL_OPTIONS= JDK_JAVA_OPTIONS= _JAVA_OPTIONS= java -version 2>&1 | head -1);" + FP+="arch:${ARCH};" + FP+="compact:${COMPACT_HEADERS_FLAG:-none};" + FP+="oops:${COMPRESSED_OOPS_FLAG:-none};" + if [ -f /app/app.jar ]; then + FP+="app:$(stat -c '%s-%Y' /app/app.jar 2>/dev/null || echo unknown);" + elif [ -f /app.jar ]; then + FP+="app:$(stat -c '%s-%Y' /app.jar 2>/dev/null || echo unknown);" + elif [ -d /app/lib ]; then + FP+="app:$(ls -la /app/lib/ 2>/dev/null | md5sum 2>/dev/null | cut -c1-16 || echo unknown);" + fi + FP+="ver:${VERSION_TAG:-unknown};" + if command_exists md5sum; then + EXPECTED_FP="$(printf '%s' "$FP" | md5sum | cut -c1-16)" + elif command_exists sha256sum; then + EXPECTED_FP="$(printf '%s' "$FP" | sha256sum | cut -c1-16)" + else + EXPECTED_FP="$(printf '%s' "$FP" | cksum | cut -d' ' -f1)" + fi + info "Expected fingerprint: ${EXPECTED_FP}" + + if [ "$STORED_FP" = "$EXPECTED_FP" ]; then + pass "Fingerprint valid — cache matches current JDK/arch/app" + else + fail "Fingerprint mismatch — cache is stale" + info " The cache was built with a different JDK, arch, flags, or app version." + info " It will be automatically removed and regenerated on next boot." + # Print a diff of fingerprint components for easier debugging + printf " Stored FP string: (run with --test to regenerate)\n" + printf " Expected FP string: %s\n" "$FP" + fi +else + if [ -f "${AOT_CACHE}" ]; then + warn "Cache exists but no fingerprint file found" + warn " Cache will be treated as stale and regenerated on next boot." + else + info "No fingerprint file (expected — cache not yet generated)" + fi +fi + +# ── Section 5: JAR Layout Detection ────────────────────────────────────────── +hdr "JAR Layout" +if [ -f /app/app.jar ] && [ -d /app/lib ]; then + pass "Spring Boot 4 layered layout: /app/app.jar + /app/lib/" + info " Classpath: -cp /app/app.jar:/app/lib/* stirling.software.SPDF.SPDFApplication" + JAR_LAYOUT="layered" +elif [ -f /app.jar ]; then + pass "Single JAR layout: /app.jar" + info " Invocation: -jar /app.jar" + JAR_LAYOUT="single" +elif [ -d /app/BOOT-INF ]; then + pass "Exploded Spring Boot 3 layout: /app/BOOT-INF" + info " Classpath: -cp /app org.springframework.boot.loader.launch.JarLauncher" + JAR_LAYOUT="exploded" +else + fail "No recognisable JAR layout found. Looked for:" + fail " /app/app.jar + /app/lib/ (Spring Boot 4 layered)" + fail " /app.jar (single fat JAR)" + fail " /app/BOOT-INF/ (Spring Boot 3 exploded)" + JAR_LAYOUT="unknown" +fi + +# ── Section 6: Disk Space ───────────────────────────────────────────────────── +hdr "Disk Space" +CACHE_DIR="$(dirname "${AOT_CACHE}")" +if [ -d "$CACHE_DIR" ]; then + DF="$(df -h "$CACHE_DIR" 2>/dev/null | tail -1 || echo '')" + info "Volume ($CACHE_DIR): $DF" + AVAIL_PCT="$(df "$CACHE_DIR" 2>/dev/null | awk 'NR==2{print $5}' | tr -d '%')" + if [ -n "$AVAIL_PCT" ] && [ "$AVAIL_PCT" -ge 95 ]; then + fail "Disk almost full (${AVAIL_PCT}% used). AOT cache creation will fail." + elif [ -n "$AVAIL_PCT" ] && [ "$AVAIL_PCT" -ge 85 ]; then + warn "Disk usage high (${AVAIL_PCT}% used). AOT cache is typically 50-150MB." + else + pass "Sufficient disk space available" + fi +else + warn "Cache directory ${CACHE_DIR} does not exist." + warn " /configs must be volume-mounted. AOT cache will not persist across restarts." +fi + +# ── Section 7: Optional Smoke Test ─────────────────────────────────────────── +if [ "$RUN_SMOKE_TEST" = true ]; then + hdr "AOT RECORD Smoke Test" + if [ "$AOT_SUPPORTED" = false ]; then + warn "Skipping smoke test — AOTMode not supported on this JVM" + elif [ "$JAR_LAYOUT" = "unknown" ]; then + warn "Skipping smoke test — could not determine JAR layout" + else + info "Running minimal AOT RECORD phase (this may take 10-30s on ARM)..." + SMOKE_CONF="/tmp/aot-diag-smoke.aotconf" + SMOKE_LOG="/tmp/aot-diag-smoke.log" + rm -f "$SMOKE_CONF" "$SMOKE_LOG" + + SMOKE_CMD=(java -Xmx256m ${COMPACT_HEADERS_FLAG:-} ${COMPRESSED_OOPS_FLAG} + -Xlog:aot=info + -XX:AOTMode=record + -XX:AOTConfiguration="$SMOKE_CONF" + -Dspring.main.banner-mode=off + -Dspring.context.exit=onRefresh + -Dstirling.datasource.url="jdbc:h2:mem:aotsmoke;DB_CLOSE_DELAY=-1;MODE=PostgreSQL") + + case "$JAR_LAYOUT" in + layered) SMOKE_CMD+=(-cp "/app/app.jar:/app/lib/*" stirling.software.SPDF.SPDFApplication) ;; + single) SMOKE_CMD+=(-jar /app.jar) ;; + exploded) SMOKE_CMD+=(-cp /app org.springframework.boot.loader.launch.JarLauncher) ;; + esac + + info "Command: ${SMOKE_CMD[*]}" + SMOKE_EXIT=0 + if command_exists timeout; then + JAVA_TOOL_OPTIONS= JDK_JAVA_OPTIONS= _JAVA_OPTIONS= \ + timeout 120s "${SMOKE_CMD[@]}" >"$SMOKE_LOG" 2>&1 || SMOKE_EXIT=$? + else + JAVA_TOOL_OPTIONS= JDK_JAVA_OPTIONS= _JAVA_OPTIONS= \ + "${SMOKE_CMD[@]}" >"$SMOKE_LOG" 2>&1 || SMOKE_EXIT=$? + fi + + case "$SMOKE_EXIT" in + 0|1) + if [ -f "$SMOKE_CONF" ] && [ -s "$SMOKE_CONF" ]; then + CONF_SIZE="$(du -h "$SMOKE_CONF" | cut -f1)" + pass "RECORD phase succeeded (exit=${SMOKE_EXIT}, conf=${CONF_SIZE})" + info " AOT cache generation should work on this system." + else + fail "RECORD phase exit=${SMOKE_EXIT} but no .aotconf produced" + info " Last 30 lines of AOT output:" + tail -30 "$SMOKE_LOG" 2>/dev/null | while IFS= read -r line; do + printf " %s\n" "$line" + done + fi + ;; + 124) + fail "RECORD phase timed out after 120s" + warn " On ARM under QEMU or slow storage this can happen." + warn " Try running with more memory or on native ARM hardware." + ;; + 137) + fail "RECORD phase OOM-killed (exit 137)" + warn " Increase container memory. Minimum for ARM AOT training: 1GB." + ;; + *) + fail "RECORD phase failed (exit=${SMOKE_EXIT})" + info " Last 30 lines of AOT output:" + tail -30 "$SMOKE_LOG" 2>/dev/null | while IFS= read -r line; do + printf " %s\n" "$line" + done + ;; + esac + rm -f "$SMOKE_CONF" "$SMOKE_LOG" + fi +fi + +# ── Summary ─────────────────────────────────────────────────────────────────── +printf "\n${C_BLD}=== Summary: PASS=%d WARN=%d FAIL=%d ===${C_RST}\n" \ + "$PASS" "$WARN" "$FAIL" + +if [ "$FAIL" -gt 0 ]; then + printf "${C_RED}AOT cache has issues. See FAIL items above.${C_RST}\n" + printf "To disable AOT: omit STIRLING_AOT_ENABLE (default is off) or set STIRLING_AOT_ENABLE=false\n" + exit 1 +elif [ "$WARN" -gt 0 ]; then + printf "${C_YLW}AOT cache may not function optimally. See WARN items above.${C_RST}\n" + exit 0 +else + printf "${C_GRN}All AOT checks passed.${C_RST}\n" + exit 0 +fi diff --git a/scripts/init-without-ocr.sh b/scripts/init-without-ocr.sh index e2215fffd..26e98c598 100755 --- a/scripts/init-without-ocr.sh +++ b/scripts/init-without-ocr.sh @@ -23,6 +23,11 @@ if [ -x /scripts/stirling-diagnostics.sh ]; then ln -sf /scripts/stirling-diagnostics.sh /usr/local/bin/debug ln -sf /scripts/stirling-diagnostics.sh /usr/local/bin/diagnostic fi +if [ -x /scripts/aot-diagnostics.sh ] && [ "${STIRLING_AOT_ENABLE:-false}" = "true" ]; then + mkdir -p /usr/local/bin + ln -sf /scripts/aot-diagnostics.sh /usr/local/bin/aot-diag + ln -sf /scripts/aot-diagnostics.sh /usr/local/bin/aot-diagnostics +fi print_versions() { set +o pipefail @@ -46,17 +51,45 @@ print_versions() { } cleanup() { + # Prevent re-entrance from double signals + trap '' SIGTERM EXIT + log "Shutdown signal received. Cleaning up..." - # Kill background AOT generation if still running - [ -n "${AOT_GEN_PID:-}" ] && kill -TERM "$AOT_GEN_PID" 2>/dev/null || true - # Kill background processes (unoservers, watchdog, Xvfb) - pkill -P $$ || true - # Kill Java if it was backgrounded (though it handles its own shutdown) - [ -n "${JAVA_PID:-}" ] && kill -TERM "$JAVA_PID" 2>/dev/null || true + + # Kill background AOT generation first (least important, clean up tmp files) + if [ -n "${AOT_GEN_PID:-}" ] && kill -0 "$AOT_GEN_PID" 2>/dev/null; then + kill -TERM "$AOT_GEN_PID" 2>/dev/null || true + wait "$AOT_GEN_PID" 2>/dev/null || true + fi + + # Signal unoserver instances to shut down + for pid in "${UNOSERVER_PIDS[@]:-}"; do + [ -n "$pid" ] && kill -TERM "$pid" 2>/dev/null || true + done + + # Signal Java to shut down gracefully, Spring Boot handles SIGTERM cleanly + if [ -n "${JAVA_PID:-}" ] && kill -0 "$JAVA_PID" 2>/dev/null; then + kill -TERM "$JAVA_PID" 2>/dev/null || true + # Wait up to 30s for graceful shutdown before forcing + local _i=0 + while [ "$_i" -lt 30 ] && kill -0 "$JAVA_PID" 2>/dev/null; do + sleep 1 + _i=$((_i + 1)) + done + if kill -0 "$JAVA_PID" 2>/dev/null; then + log "Java did not exit within 30s, sending SIGKILL" + kill -KILL "$JAVA_PID" 2>/dev/null || true + fi + fi + + # Kill any remaining children (watchdog, Xvfb, etc.) + pkill -P $$ 2>/dev/null || true + log "Cleanup complete." } -trap cleanup SIGTERM EXIT +trap cleanup SIGTERM +trap cleanup EXIT print_versions @@ -321,6 +354,10 @@ if [ -z "${VERSION_TAG:-}" ] && [ -f /etc/stirling_version ]; then export VERSION_TAG fi +# ---------- AOT ---------- +# OFF by default. Set STIRLING_AOT_ENABLE=true to opt in. +AOT_ENABLED="${STIRLING_AOT_ENABLE:-false}" + # ---------- Dynamic Memory Detection ---------- # Detects the container memory limit (in MB) from cgroups v2/v1 or /proc/meminfo. detect_container_memory_mb() { @@ -408,9 +445,9 @@ compute_dynamic_memory() { # ---------- Project Leyden AOT Cache (JEP 483 + 514 + 515) ---------- # Replaces legacy AppCDS with JDK 25's AOT cache. Uses the three-step workflow: -# 1. RECORD — runs Spring context init, captures class loading + method profiles -# 2. CREATE — builds the AOT cache file (does NOT start the app) -# 3. RUNTIME — java -XX:AOTCache=... starts with pre-linked classes + compiled methods +# 1. RECORD , runs Spring context init, captures class loading + method profiles +# 2. CREATE , builds the AOT cache file (does NOT start the app) +# 3. RUNTIME, java -XX:AOTCache=... starts with pre-linked classes + compiled methods # Constraints: # - Cache must be generated on the same JDK build + OS + arch as production (satisfied # because we generate inside the same container image at runtime) @@ -426,64 +463,193 @@ generate_aot_cache() { mkdir -p "$aot_dir" 2>/dev/null || true local aot_conf="/tmp/stirling.aotconf" + local arch + arch=$(uname -m) - log "AOT: Phase 1/2 — Recording class loading + method profiles..." + # ── ARM-aware heap sizing ── + # ARM devices (Raspberry Pi, Ampere) often have tighter memory. + # Scale training heap down to avoid OOM-killing the background generation. + local record_xmx="512m" + local create_xmx="256m" + if [ "${CONTAINER_MEM_MB:-0}" -gt 0 ] && [ "${CONTAINER_MEM_MB}" -le 1024 ]; then + record_xmx="256m" + create_xmx="128m" + fi - # RECORD — starts Spring context, observes class loading + collects method profiles (JEP 515). - # -Dspring.context.exit=onRefresh stops after Spring context loads (good training coverage). - # Uses -Xmx512m: enough for Spring context init without starving the running application. - # -Xlog:aot=error suppresses harmless "Skipping"/"Preload Warning" messages for proxies, - # signed JARs (BouncyCastle), JFR events, CGLIB classes, etc. The JVM handles all of - # these internally they are informational, not errors. - # Non-zero exit is expected — onRefresh triggers controlled shutdown. - # Uses in-memory H2 database to avoid file-lock conflicts with the running application. - # Note: DatabaseConfig reads System.getProperty("stirling.datasource.url") to override - # the default file-based H2 URL. We use MODE=PostgreSQL to match the production config. - # Redirect both stdout and stderr to suppress duplicate startup logs (banner + Spring init). - # IMPORTANT: COMPRESSED_OOPS_FLAG must match the runtime setting to avoid AOT cache - # invalidation on restart ("saved state of UseCompressedOops ... is different" error). - java -Xmx512m -XX:+UseCompactObjectHeaders ${COMPRESSED_OOPS_FLAG} \ - -Xlog:aot=error \ - -XX:AOTMode=record \ - -XX:AOTConfiguration="$aot_conf" \ - -Dspring.main.banner-mode=off \ - -Dspring.context.exit=onRefresh \ - -Dstirling.datasource.url="jdbc:h2:mem:aottraining;DB_CLOSE_DELAY=-1;MODE=PostgreSQL" \ - "$@" >/tmp/aot-record.log 2>&1 || true + # ── ARM-aware timeouts ── + # ARM under QEMU or on slow SD/eMMC can take much longer than x86_64. + local record_timeout=300 + local create_timeout=180 + if [ "$arch" = "aarch64" ]; then + record_timeout=600 + create_timeout=300 + fi + + log "AOT: arch=${arch} mem=${CONTAINER_MEM_MB:-?}MB heap=${record_xmx} timeouts=${record_timeout}s/${create_timeout}s" + log "AOT: COMPACT_HEADERS='${COMPACT_HEADERS_FLAG:-}' COMPRESSED_OOPS='${COMPRESSED_OOPS_FLAG}'" + log "AOT: Phase 1/2, Recording class loading + method profiles..." + + # RECORD, starts Spring context, observes class loading + collects method profiles (JEP 515). + # Non-zero exit is expected: -Dspring.context.exit=onRefresh triggers controlled shutdown. + # Uses in-memory H2 to avoid file-lock conflicts with the running app. + # COMPACT_HEADERS_FLAG/COMPRESSED_OOPS_FLAG must exactly match the runtime invocation. + # Clear all JVM option env vars so external settings (e.g. _JAVA_OPTIONS=-Xms14G) cannot + # conflict with the explicit -Xmx we pass here. Training uses its own minimal flag set. + local record_exit=0 + if command_exists timeout; then + JAVA_TOOL_OPTIONS= JDK_JAVA_OPTIONS= _JAVA_OPTIONS= \ + timeout "${record_timeout}s" \ + java "-Xmx${record_xmx}" ${COMPACT_HEADERS_FLAG:-} ${COMPRESSED_OOPS_FLAG} \ + -Xlog:aot=error \ + -XX:AOTMode=record \ + -XX:AOTConfiguration="$aot_conf" \ + -Dspring.main.banner-mode=off \ + -Dspring.context.exit=onRefresh \ + -Dstirling.datasource.url="jdbc:h2:mem:aottraining;DB_CLOSE_DELAY=-1;MODE=PostgreSQL" \ + "$@" >/tmp/aot-record.log 2>&1 || record_exit=$? + else + JAVA_TOOL_OPTIONS= JDK_JAVA_OPTIONS= _JAVA_OPTIONS= \ + java "-Xmx${record_xmx}" ${COMPACT_HEADERS_FLAG:-} ${COMPRESSED_OOPS_FLAG} \ + -Xlog:aot=error \ + -XX:AOTMode=record \ + -XX:AOTConfiguration="$aot_conf" \ + -Dspring.main.banner-mode=off \ + -Dspring.context.exit=onRefresh \ + -Dstirling.datasource.url="jdbc:h2:mem:aottraining;DB_CLOSE_DELAY=-1;MODE=PostgreSQL" \ + "$@" >/tmp/aot-record.log 2>&1 || record_exit=$? + fi + + if [ "$record_exit" -eq 124 ]; then + log "AOT: RECORD phase timed out after ${record_timeout}s, skipping" + rm -f "$aot_conf" /tmp/aot-record.log + return 1 + fi + if [ "$record_exit" -eq 137 ]; then + log "AOT: RECORD phase OOM-killed (exit 137), container memory too low for training" + log "AOT: Set STIRLING_AOT_ENABLE=false or increase container memory above 1GB" + rm -f "$aot_conf" /tmp/aot-record.log + return 1 + fi if [ ! -f "$aot_conf" ]; then - log "AOT: Training produced no configuration file." - tail -5 /tmp/aot-record.log 2>/dev/null | while IFS= read -r line; do log " $line"; done + log "AOT: Training produced no configuration file (exit=${record_exit}), last 30 lines:" + tail -30 /tmp/aot-record.log 2>/dev/null | while IFS= read -r line; do log " $line"; done rm -f /tmp/aot-record.log return 1 fi + log "AOT: Phase 1 complete, conf $(du -h "$aot_conf" 2>/dev/null | cut -f1)" + log "AOT: Phase 2/2, Creating AOT cache from recorded profile..." - log "AOT: Phase 2/2 — Creating AOT cache from recorded profile..." - - # CREATE — does NOT start the application. Processes the recorded configuration - # to build the AOT cache with pre-linked classes and optimized native code. - # Uses less memory than the training run. - # -Xlog:aot=error: same as record phase — suppress harmless skip/preload warnings. - # Redirect both stdout and stderr to avoid polluting container logs. - # IMPORTANT: COMPRESSED_OOPS_FLAG must match both RECORD and RUNTIME. - if java -Xmx256m -XX:+UseCompactObjectHeaders ${COMPRESSED_OOPS_FLAG} \ - -Xlog:aot=error \ - -XX:AOTMode=create \ - -XX:AOTConfiguration="$aot_conf" \ - -XX:AOTCache="$aot_path" \ - "$@" >/tmp/aot-create.log 2>&1; then - - local cache_size - cache_size=$(du -h "$aot_path" 2>/dev/null | cut -f1) - log "AOT: Cache created successfully: $aot_path ($cache_size)" - rm -f "$aot_conf" /tmp/aot-record.log /tmp/aot-create.log - return 0 + # CREATE, does NOT start the application; builds pre-linked class + method data. + local create_exit=0 + if command_exists timeout; then + JAVA_TOOL_OPTIONS= JDK_JAVA_OPTIONS= _JAVA_OPTIONS= \ + timeout "${create_timeout}s" \ + java "-Xmx${create_xmx}" ${COMPACT_HEADERS_FLAG:-} ${COMPRESSED_OOPS_FLAG} \ + -Xlog:aot=error \ + -XX:AOTMode=create \ + -XX:AOTConfiguration="$aot_conf" \ + -XX:AOTCache="$aot_path" \ + "$@" >/tmp/aot-create.log 2>&1 || create_exit=$? else - log "AOT: Cache creation failed." - tail -5 /tmp/aot-create.log 2>/dev/null | while IFS= read -r line; do log " $line"; done + JAVA_TOOL_OPTIONS= JDK_JAVA_OPTIONS= _JAVA_OPTIONS= \ + java "-Xmx${create_xmx}" ${COMPACT_HEADERS_FLAG:-} ${COMPRESSED_OOPS_FLAG} \ + -Xlog:aot=error \ + -XX:AOTMode=create \ + -XX:AOTConfiguration="$aot_conf" \ + -XX:AOTCache="$aot_path" \ + "$@" >/tmp/aot-create.log 2>&1 || create_exit=$? + fi + + if [ "$create_exit" -eq 124 ]; then + log "AOT: CREATE phase timed out after ${create_timeout}s" rm -f "$aot_conf" "$aot_path" /tmp/aot-record.log /tmp/aot-create.log return 1 fi + if [ "$create_exit" -eq 137 ]; then + log "AOT: CREATE phase OOM-killed (exit 137)" + rm -f "$aot_conf" "$aot_path" /tmp/aot-record.log /tmp/aot-create.log + return 1 + fi + + if [ "$create_exit" -eq 0 ] && [ -f "$aot_path" ] && [ -s "$aot_path" ]; then + local cache_size + cache_size=$(du -h "$aot_path" 2>/dev/null | cut -f1) + log "AOT: Cache created successfully: $aot_path ($cache_size)" + chmod 644 "$aot_path" 2>/dev/null || true + save_aot_fingerprint "$aot_path" + rm -f "$aot_conf" /tmp/aot-record.log /tmp/aot-create.log + return 0 + else + log "AOT: Cache creation failed (exit=${create_exit}), last 30 lines:" + tail -30 /tmp/aot-create.log 2>/dev/null | while IFS= read -r line; do log " $line"; done + rm -f "$aot_conf" "$aot_path" /tmp/aot-record.log /tmp/aot-create.log + return 1 + fi +} + +# ---------- AOT Cache Fingerprinting ---------- +# Detects stale caches automatically when the app JAR, JDK version, arch, or JVM flags change. +# Stores a short hash alongside the cache file; mismatch → cache is deleted and regenerated. +compute_aot_fingerprint() { + local fp="" + # Clear JAVA_TOOL_OPTIONS / JDK_JAVA_OPTIONS so the JVM does not prepend + # "Picked up JAVA_TOOL_OPTIONS: ..." to stderr before the version line. + # Those vars are exported by the time the background subshell runs + # save_aot_fingerprint, but are NOT yet set when validate_aot_cache runs on + # the next boot -- causing head -1 to return different strings each time. + fp+="jdk:$(JAVA_TOOL_OPTIONS= JDK_JAVA_OPTIONS= _JAVA_OPTIONS= java -version 2>&1 | head -1);" + fp+="arch:$(uname -m);" + fp+="compact:${COMPACT_HEADERS_FLAG:-none};" + fp+="oops:${COMPRESSED_OOPS_FLAG:-none};" + # App identity: size+mtime is fast (avoids hashing 200MB JARs) + if [ -f /app/app.jar ]; then + fp+="app:$(stat -c '%s-%Y' /app/app.jar 2>/dev/null || echo unknown);" + elif [ -f /app.jar ]; then + fp+="app:$(stat -c '%s-%Y' /app.jar 2>/dev/null || echo unknown);" + elif [ -d /app/lib ]; then + fp+="app:$(ls -la /app/lib/ 2>/dev/null | md5sum 2>/dev/null | cut -c1-16 || echo unknown);" + fi + fp+="ver:${VERSION_TAG:-unknown};" + if command_exists md5sum; then + printf '%s' "$fp" | md5sum | cut -c1-16 + elif command_exists sha256sum; then + printf '%s' "$fp" | sha256sum | cut -c1-16 + else + printf '%s' "$fp" | cksum | cut -d' ' -f1 + fi +} + +validate_aot_cache() { + local cache_path="$1" + local fp_file="${cache_path}.fingerprint" + + [ -f "$cache_path" ] || return 1 + if [ ! -s "$cache_path" ]; then + log "AOT: Cache file is empty, removing." + rm -f "$cache_path" "$fp_file" + return 1 + fi + + local expected_fp stored_fp="" + expected_fp=$(compute_aot_fingerprint) + [ -f "$fp_file" ] && stored_fp=$(cat "$fp_file" 2>/dev/null || true) + + if [ "$stored_fp" != "$expected_fp" ]; then + log "AOT: Fingerprint mismatch (stored=${stored_fp:-} expected=${expected_fp})." + log "AOT: JAR, JDK, arch, or flags changed, removing stale cache." + rm -f "$cache_path" "$fp_file" + return 1 + fi + log "AOT: Cache fingerprint valid (${expected_fp})" + return 0 +} + +save_aot_fingerprint() { + local cache_path="$1" + local fp_file="${cache_path}.fingerprint" + compute_aot_fingerprint > "$fp_file" 2>/dev/null || true + chmod 644 "$fp_file" 2>/dev/null || true } # ---------- Memory Detection ---------- @@ -493,24 +659,18 @@ compute_dynamic_memory "$CONTAINER_MEM_MB" "$JVM_PROFILE" MEMORY_FLAGS="-XX:InitialRAMPercentage=${DYNAMIC_INITIAL_RAM_PCT} -XX:MaxRAMPercentage=${DYNAMIC_MAX_RAM_PCT} -XX:MaxMetaspaceSize=${DYNAMIC_MAX_METASPACE}m" # ---------- Compressed Oops Detection ---------- -# AOT/CDS cache is sensitive to UseCompressedOops. The setting must be identical -# between the training run (generate_aot_cache) and all subsequent runtime boots. -# With small -Xmx during training the JVM defaults to +UseCompressedOops, but at -# runtime a large MaxRAMPercentage (e.g. 50% of 64GB ≈ 32GB) may cause the JVM to -# disable it, invalidating the cache. We compute the expected max heap and lock the -# flag so every invocation agrees. -if [ "$CONTAINER_MEM_MB" -gt 0 ] 2>/dev/null; then - MAX_HEAP_MB=$((CONTAINER_MEM_MB * DYNAMIC_MAX_RAM_PCT / 100)) - # JVM disables compressed oops when max heap >= ~32 GB (exact threshold varies - # by alignment / JVM build). Use a conservative 31744 MB (~31 GB) cutoff. - if [ "$MAX_HEAP_MB" -ge 31744 ]; then - COMPRESSED_OOPS_FLAG="-XX:-UseCompressedOops" +# Only needed for AOT cache consistency (training and runtime must agree on this flag). +if [ "$AOT_ENABLED" = "true" ]; then + if [ "$CONTAINER_MEM_MB" -gt 0 ] 2>/dev/null; then + MAX_HEAP_MB=$((CONTAINER_MEM_MB * DYNAMIC_MAX_RAM_PCT / 100)) + if [ "$MAX_HEAP_MB" -ge 31744 ]; then + COMPRESSED_OOPS_FLAG="-XX:-UseCompressedOops" + else + COMPRESSED_OOPS_FLAG="-XX:+UseCompressedOops" + fi else COMPRESSED_OOPS_FLAG="-XX:+UseCompressedOops" fi -else - # Cannot detect memory — default matches small-heap behaviour - COMPRESSED_OOPS_FLAG="-XX:+UseCompressedOops" fi # ---------- JVM Profile Selection ---------- @@ -563,58 +723,63 @@ else fi fi -# Check if Project Lilliput is supported (standard in Java 25+) +# Check if Project Lilliput is supported (standard in Java 25+, but experimental on some ARM builds) +# COMPACT_HEADERS_FLAG is used by generate_aot_cache() to ensure training/runtime consistency. if java -XX:+UseCompactObjectHeaders -version >/dev/null 2>&1; then + COMPACT_HEADERS_FLAG="-XX:+UseCompactObjectHeaders" # Only append if not already present in JAVA_BASE_OPTS case "${JAVA_BASE_OPTS}" in *UseCompactObjectHeaders*) ;; *) - log "JVM supports Compact Object Headers. Enabling Project Lilliput..." + log "JVM supports Compact Object Headers ($(uname -m)). Enabling Project Lilliput..." JAVA_BASE_OPTS="${JAVA_BASE_OPTS} -XX:+UseCompactObjectHeaders" ;; esac else - log "JVM does not support Compact Object Headers. Skipping Project Lilliput flags." + COMPACT_HEADERS_FLAG="" + log "JVM does not support Compact Object Headers on $(uname -m). Skipping Project Lilliput flags." +fi + +# ---------- AOT Support Check ---------- +AOT_SUPPORTED=false +if [ "$AOT_ENABLED" = "true" ]; then + AOT_SUPPORTED=true + if ! java -XX:AOTMode=off -version >/dev/null 2>&1; then + log "AOT: JVM on $(uname -m) does not support -XX:AOTMode, AOT cache disabled" + AOT_SUPPORTED=false + fi fi # ---------- Clean deprecated/invalid JVM flags ---------- # Remove UseCompressedClassPointers (deprecated in Java 25+ with Lilliput) JAVA_BASE_OPTS=$(echo "$JAVA_BASE_OPTS" | sed -E 's/-XX:[+-]UseCompressedClassPointers//g') -# Remove any existing UseCompressedOops (we manage it explicitly for AOT consistency) -JAVA_BASE_OPTS=$(echo "$JAVA_BASE_OPTS" | sed -E 's/-XX:[+-]UseCompressedOops//g') -# Append the computed compressed oops flag (must match AOT training) -JAVA_BASE_OPTS="${JAVA_BASE_OPTS} ${COMPRESSED_OOPS_FLAG}" +# Manage UseCompressedOops explicitly only when AOT is enabled (training/runtime must agree) +if [ "$AOT_ENABLED" = "true" ]; then + JAVA_BASE_OPTS=$(echo "$JAVA_BASE_OPTS" | sed -E 's/-XX:[+-]UseCompressedOops//g') + JAVA_BASE_OPTS="${JAVA_BASE_OPTS} ${COMPRESSED_OOPS_FLAG}" +fi # ---------- AOT Cache Management (Project Leyden) ---------- -# Strip any legacy CDS/AOT references from base opts (we manage AOT dynamically below) -JAVA_BASE_OPTS=$(echo "$JAVA_BASE_OPTS" | sed -E \ - 's/-XX:SharedArchiveFile=[^ ]*//g; - s/-Xshare:(auto|on|off)//g; - s/-XX:AOTCache=[^ ]*//g') - -AOT_CACHE="/app/stirling.aot" +AOT_CACHE="/configs/cache/stirling.aot" AOT_GENERATE_BACKGROUND=false -# Support both new (STIRLING_AOT_DISABLE) and legacy (STIRLING_CDS_DISABLE) env vars -AOT_DISABLED="${STIRLING_AOT_DISABLE:-${STIRLING_CDS_DISABLE:-false}}" +if [ "$AOT_ENABLED" = "true" ]; then + # Strip any legacy CDS/AOT references from base opts (managed dynamically here) + JAVA_BASE_OPTS=$(echo "$JAVA_BASE_OPTS" | sed -E \ + 's/-XX:SharedArchiveFile=[^ ]*//g; + s/-Xshare:(auto|on|off)//g; + s/-XX:AOTCache=[^ ]*//g') -if [ -f "$AOT_CACHE" ]; then - # Cache exists from a previous boot — use it. - # If the file is corrupt or from a different JDK build, the JVM issues a warning - # and continues without the cache (graceful degradation, no crash). - log "AOT cache found: $AOT_CACHE" - JAVA_BASE_OPTS="${JAVA_BASE_OPTS} -XX:AOTCache=${AOT_CACHE}" - - # Clean up legacy .jsa if still present - rm -f /app/stirling.jsa 2>/dev/null || true -elif [ "$AOT_DISABLED" = "true" ]; then - log "AOT cache disabled via STIRLING_AOT_DISABLE=true" -else - # No cache exists — schedule background generation after app starts. - # The app starts immediately (no training delay). The AOT cache will be - # ready for the NEXT boot, giving 15-25% faster startup from then on. - log "No AOT cache found. Will generate in background after app starts." - AOT_GENERATE_BACKGROUND=true + if [ "$AOT_SUPPORTED" = false ]; then + log "AOT: Not supported on this JVM/platform, skipping" + elif validate_aot_cache "$AOT_CACHE"; then + log "AOT cache valid: $AOT_CACHE" + JAVA_BASE_OPTS="${JAVA_BASE_OPTS} -XX:AOTCache=${AOT_CACHE}" + rm -f /app/stirling.jsa /app/stirling.aot /app/stirling.aot.fingerprint 2>/dev/null || true + else + log "No valid AOT cache found. Will generate in background after app starts." + AOT_GENERATE_BACKGROUND=true + fi fi # Collapse duplicate whitespace @@ -688,7 +853,7 @@ fi # ---------- Permissions ---------- # Ensure required directories exist and set correct permissions. log "Setting permissions..." -mkdir -p /tmp/stirling-pdf /tmp/stirling-pdf/heap_dumps /logs /configs /configs/heap_dumps /customFiles /pipeline || true +mkdir -p /tmp/stirling-pdf /tmp/stirling-pdf/heap_dumps /logs /configs /configs/heap_dumps /configs/cache /customFiles /pipeline || true CHOWN_PATHS=("$HOME" "/logs" "/scripts" "/configs" "/customFiles" "/pipeline" "/tmp/stirling-pdf" "/app.jar") [ -d /usr/share/fonts/truetype ] && CHOWN_PATHS+=("/usr/share/fonts/truetype") CHOWN_OK=true @@ -705,6 +870,7 @@ if command_exists Xvfb; then log "Starting Xvfb on :99" Xvfb :99 -screen 0 1024x768x24 -ac +extension GLX +render -noreset > /dev/null 2>&1 & export DISPLAY=:99 + # Brief pause so Xvfb accepts connections before unoserver tries to attach sleep 1 else log "Xvfb not installed; skipping virtual display setup" @@ -712,44 +878,22 @@ fi # ---------- unoserver ---------- # Start LibreOffice UNO server for document conversions. +# Java and unoserver start in parallel, do NOT block here waiting for readiness. +# Readiness is verified after Java is launched; the watchdog handles any restarts. UNOSERVER_BIN="$(command -v unoserver || true)" UNOCONVERT_BIN="$(command -v unoconvert || true)" UNOPING_BIN="$(command -v unoping || true)" if [ -n "$UNOSERVER_BIN" ] && [ -n "$UNOCONVERT_BIN" ]; then LIBREOFFICE_PROFILE="${HOME:-/home/${RUNTIME_USER}}/.libreoffice_uno_${RUID}" run_as_runtime_user mkdir -p "$LIBREOFFICE_PROFILE" - start_unoserver_pool - log "unoserver pool started (Profile: $LIBREOFFICE_PROFILE)" - - - # Wait until UNO server is ready. - log "Waiting for unoserver..." - for _ in {1..20}; do - # Pass 'silent' to check_unoserver_ready to suppress unoping failure logs during wait - if check_unoserver_ready "silent"; then - log "unoserver is ready!" - break - fi - sleep 1 - done - - start_unoserver_watchdog - - if ! check_unoserver_ready; then - log "ERROR: unoserver failed!" - for pid in "${UNOSERVER_PIDS[@]}"; do - kill "$pid" 2>/dev/null || true - wait "$pid" 2>/dev/null || true - done - exit 1 - fi + log "unoserver pool started (Profile: $LIBREOFFICE_PROFILE), Java starting in parallel" else log "unoserver/unoconvert not installed; skipping UNO setup" fi # ---------- Java ---------- -# Start Stirling PDF Java application. +# Start Stirling PDF Java application immediately (parallel with unoserver startup). log "Starting Stirling PDF" JAVA_CMD=( java @@ -780,46 +924,108 @@ fi JAVA_PID=$! +# ---------- Unoserver Readiness + Watchdog ---------- +# Now that Java is running, check unoserver readiness and start the watchdog. +# Runs in the main shell (not a subshell) so UNOSERVER_PIDS/PORTS arrays are accessible. +# Java handles unoserver being temporarily unavailable, no fatal exit on timeout. +if [ "${#UNOSERVER_PORTS[@]}" -gt 0 ]; then + log "Waiting for unoserver (Java already starting in parallel)..." + UNOSERVER_READY=false + for _ in {1..30}; do + if check_unoserver_ready "silent"; then + log "unoserver is ready!" + UNOSERVER_READY=true + break + fi + sleep 1 + done + + start_unoserver_watchdog + + if [ "$UNOSERVER_READY" = false ] && ! check_unoserver_ready; then + log "WARNING: unoserver not ready after 30s. Watchdog will manage restarts. Document conversion may be temporarily unavailable." + fi +fi + # ---------- Background AOT Cache Generation ---------- -# On first boot (no existing cache), generate the AOT cache in the background -# so the app starts immediately. The cache is picked up on the next boot. -# Only runs on containers with >768MB memory to avoid starving the main process. +# On first boot (no valid cache), generate the AOT cache in the background so the app +# starts immediately. The cache is ready for the NEXT boot (15-25% faster startup). AOT_GEN_PID="" if [ "$AOT_GENERATE_BACKGROUND" = true ]; then - if [ "$CONTAINER_MEM_MB" -gt 768 ] || [ "$CONTAINER_MEM_MB" -eq 0 ]; then - ( - # Wait for the app to finish starting before competing for resources. - # This avoids CPU/memory contention during Spring Boot initialization. - sleep 45 + # ARM devices need more memory for training due to JIT differences + _aot_min_mem=768 + if [ "$(uname -m)" = "aarch64" ]; then + _aot_min_mem=1024 + fi + + if [ "$CONTAINER_MEM_MB" -gt "$_aot_min_mem" ] || [ "$CONTAINER_MEM_MB" -eq 0 ]; then + ( + # Wait for Spring Boot to finish initializing before competing for CPU/memory. + # ARM devices (Raspberry Pi 4, Ampere) need extra time, 90s vs 45s on x86_64. + _startup_wait=45 + if [ "$(uname -m)" = "aarch64" ]; then + _startup_wait=90 + log "AOT: ARM, waiting ${_startup_wait}s for app stabilization before training" + fi + sleep "$_startup_wait" - # Verify the main app is still running before investing in cache generation if ! kill -0 "$JAVA_PID" 2>/dev/null; then log "AOT: Main process exited; skipping cache generation." exit 0 fi - log "AOT: Starting background cache generation for next boot..." - if [ -f /app/app.jar ] && [ -d /app/lib ]; then - generate_aot_cache "$AOT_CACHE" -cp "/app/app.jar:/app/lib/*" stirling.software.SPDF.SPDFApplication - elif [ -f /app.jar ]; then - generate_aot_cache "$AOT_CACHE" -jar /app.jar - elif [ -d /app/BOOT-INF ]; then - # Spring Boot exploded layer layout (produced by 'java -Djarmode=tools extract --layers'). - # The actual JAVA_CMD uses JarLauncher with default classpath = CWD (/app). - # Mirror that exactly: -cp /app resolves the same classes. - generate_aot_cache "$AOT_CACHE" -cp /app org.springframework.boot.loader.launch.JarLauncher - else - log "AOT: Cannot determine JAR layout; skipping cache generation." - fi + _attempt=1 + _max_attempts=2 + while [ "$_attempt" -le "$_max_attempts" ]; do + log "AOT: Background cache generation attempt ${_attempt}/${_max_attempts}..." + _gen_rc=0 + if [ -f /app/app.jar ] && [ -d /app/lib ]; then + generate_aot_cache "$AOT_CACHE" \ + -cp "/app/app.jar:/app/lib/*" stirling.software.SPDF.SPDFApplication || _gen_rc=$? + elif [ -f /app.jar ]; then + generate_aot_cache "$AOT_CACHE" -jar /app.jar || _gen_rc=$? + elif [ -d /app/BOOT-INF ]; then + # Spring Boot exploded layer layout, mirror the exact JAVA_CMD classpath + generate_aot_cache "$AOT_CACHE" \ + -cp /app org.springframework.boot.loader.launch.JarLauncher || _gen_rc=$? + else + log "AOT: Cannot determine JAR layout; skipping cache generation." + exit 0 + fi + + if [ "$_gen_rc" -eq 0 ] && [ -f "$AOT_CACHE" ]; then + log "AOT: Cache ready for next boot!" + exit 0 + fi + + log "AOT: Attempt ${_attempt} failed (rc=${_gen_rc})" + _attempt=$((_attempt + 1)) + if [ "$_attempt" -le "$_max_attempts" ]; then + if ! kill -0 "$JAVA_PID" 2>/dev/null; then + log "AOT: Main process exited during retry; aborting." + exit 0 + fi + log "AOT: Retrying in 30s..." + sleep 30 + fi + done + log "AOT: All attempts failed. App runs normally without cache." + log "AOT: To disable, set STIRLING_AOT_ENABLE=false (or omit it, default is off)" ) & AOT_GEN_PID=$! - log "AOT: Background cache generation scheduled (PID $AOT_GEN_PID)" + log "AOT: Background generation scheduled (PID $AOT_GEN_PID, arch=$(uname -m))" else - log "AOT: Container memory (${CONTAINER_MEM_MB}MB) too low for background generation (need >768MB). Cache will not be created." + log "AOT: Container memory (${CONTAINER_MEM_MB}MB) below minimum (${_aot_min_mem}MB on $(uname -m)), skipping cache generation" fi fi -wait "$JAVA_PID" +wait "$JAVA_PID" || true exit_code=$? -# Propagate Java's actual exit code so container orchestrators can detect crashes +case "$exit_code" in + 0) log "Stirling PDF exited normally." ;; + 137) log "Stirling PDF was OOM-killed (exit 137). Check container memory limits." ;; + 143) log "Stirling PDF terminated by SIGTERM (normal orchestrator shutdown)." ;; + *) log "Stirling PDF exited with code ${exit_code}." ;; +esac +# Propagate exit code so orchestrators can detect crashes vs clean shutdowns exit "${exit_code}"