tito-PDF Documentation

Pipeline

This page documents the internal stages of tito-pdf as implemented in the tito-pdf script.

Goals:

High-level flow

tito-pdf converts one input document per invocation.

Conceptually:

PDF pipeline

1) Prepare PDF (prepare_pdf)

Purpose: create a working copy that is easier for downstream parsers.

Implementation:

Why qpdf exists in the pipeline:

2) OCR (ocr_pdf)

Purpose: improve extraction quality for scanned PDFs or PDFs with a bad text layer.

Tool: ocrmypdf (Python package + CLI) + tesseract (system tool).

Behavior:

Invocation details:

Failure mode:

See: OCR.

3) Layout-aware text extraction (extract_lines_layout)

Purpose: get text with position and font metadata so we can make better Markdown than “just text”.

Implementation details:

Tool: PyMuPDF (fitz).

What we extract:

4) Header/footer dropping (drop_repeated_headers_footers)

Purpose: remove repeated page furniture that would pollute the Markdown.

Heuristic:

5) Markdown reconstruction (lines_to_markdown)

Purpose: convert positioned lines into best-effort Markdown.

Key heuristics:

6) Plaintext export (lines_to_text)

Purpose: provide a raw, non-Markdown text stream for downstream tools.

Compared to Markdown:

7) Table extraction (extract_tables)

Purpose: extract tables deterministically without an LLM.

Strategy order:

  1. PyMuPDF table finder (primary)
  2. Camelot (optional; only if installed)
  3. pdfplumber (fallback)

Strict vs lenient:

Output:

See: Tables.

DOCX pipeline

Tool: python-docx.

1) Markdown extraction (extract_docx_markdown)

2) Raw text extraction (extract_docx_text)

3) Table extraction (extract_docx_tables)

Output writing (atomic)

All outputs are written after extraction succeeds.

Implementation detail:

Benefits: