tito-PDF Documentation

Design rationale

This page answers a practical question:

Why does tito-pdf use multiple tools/libraries instead of “one Python helper”?

Short answer: because PDF/DOCX conversion is not one problem.

…are separate failure modes, and no single library is best-in-class for all of them.

This repo focuses on:

What this code is

tito-pdf is a single-document converter.

Inputs:

Outputs (optional, user-controlled):

Key contracts implemented in main():

Why the code is a single script (repo layout)

The entrypoint is tito-pdf (one Python file) on purpose:

Internally, the script still has multiple helpers (functions) to keep stages separated.

Why multiple external tools / libraries exist

1) qpdf (system tool)

Used in prepare_pdf().

Problem it solves:

What tito-pdf does:

Why not do this “pure Python”?

2) ocrmypdf + tesseract (OCR toolchain)

Used in ocr_pdf().

Problem it solves:

Why ocrmypdf?

What tito-pdf does:

Why not OCR in pure Python?

3) PyMuPDF (fitz) (Python library)

Used in extract_lines_layout() and also the primary table finder.

Problems it solves:

Why not pdfplumber for everything?

4) Multiple table extractors (PyMuPDF + optional Camelot + pdfplumber)

Used in extract_tables().

Problem it solves:

What tito-pdf does:

Why the “stop early” behavior?

Why Camelot is optional:

Why strict vs lenient?

This is reflected in the code:

5) python-docx (DOCX parsing)

DOCX is not PDF.

Problem it solves:

What tito-pdf does:

Why multiple internal helpers exist

This script is a pipeline.

Separation into functions (e.g. prepare_pdf(), ocr_pdf(), extract_lines_layout(), extract_tables()) exists so that:

How parameters map to implementation

--mode

--mode sets sensible defaults:

Explicit flags win:

Output mode (explicit vs convenience)

If you set any explicit output path, the tool enters explicit output mode.

Design intent:

Why --tables-audit-out requires --tables-out

The audit is a companion to the tables Markdown file; the contract enforces they move together.

Why you cannot request only --assets-json

Assets JSON is a companion “receipt”; the implementation requires at least one content output to be requested.

What to read in the source

Jump points in tito-pdf:

See also: