tito-PDF Documentation

OCR

tito-pdf can run OCR for PDFs using ocrmypdf (Python package + CLI), which in turn uses tesseract.

OCR is only relevant for PDF inputs.

When you need OCR

OCR helps when:

If the PDF already has a good text layer, OCR can be unnecessary work and sometimes makes output noisier.

Dependencies

For OCR to run:

Recommended installs on macOS:

brew install tesseract

How ocrmypdf is invoked (implementation detail)

tito-pdf prefers the ocrmypdf console entrypoint when it is available on PATH.

If the entrypoint is not on PATH but the Python package is installed, it falls back to:

Flags used:

If OCR fails, tito-pdf prints a warning and continues with the non-OCR PDF.

OCR behavior by mode

--mode robust (default)

Good for:

--mode fast

Good for:

--mode best

Good for:

Explicit OCR flags

--no-ocr

Disable OCR completely.

--force-ocr

Force OCR even if the PDF already has a text layer.

Notes:

Failure behavior

If OCR fails for any reason:

This is intentional:

Performance notes

OCR can be slow. When iterating:

tito-pdf input.pdf --mode best --md-out out/input.md --max-pages 10

See: Pipeline.