Pipeline
This page documents the internal stages of tito-pdf as implemented in the tito-pdf script.
Goals:
- Describe the real pipeline stages implemented in the
tito-pdfscript. - Explain why each stage exists and what it outputs.
- Keep the output contract stable (explicit outputs are authoritative; no extra output folders).
High-level flow
tito-pdf converts one input document per invocation.
Conceptually:
- Validate input path and infer kind: PDF or DOCX.
- Resolve output mode:
- Explicit output mode (write only to the requested paths), or
- Convenience mode (derive
<stem>.md/<stem>.tables.md).
- Run pipeline stages.
- Write outputs atomically.
- Optionally write assets/metrics JSON.
PDF pipeline
1) Prepare PDF (prepare_pdf)
Purpose: create a working copy that is easier for downstream parsers.
Implementation:
- Requires
qpdfon PATH. - Runs
qpdf --decrypt input.pdf prepared.pdf. - If
qpdfis missing or the command fails: the run stops with an error.
Why qpdf exists in the pipeline:
- Some PDFs are encrypted or have structural quirks.
- Normalization improves the chance that PyMuPDF and table extractors can read the file reliably.
2) OCR (ocr_pdf)
Purpose: improve extraction quality for scanned PDFs or PDFs with a bad text layer.
Tool: ocrmypdf (Python package + CLI) + tesseract (system tool).
Behavior:
- If
ocrmypdfis available,tito-pdfruns it. - Default OCR behavior is conservative:
- It uses
ocrmypdf --skip-textso that text-based PDFs are not rewritten unnecessarily.
- It uses
- Forced OCR uses
ocrmypdf --force-ocr.
Invocation details:
- It prefers the
ocrmypdfCLI entrypoint. - If the entrypoint is not on PATH but the Python package is installed, it falls back to
python -m ocrmypdf. - It passes
--quietand--output-type pdfto keep output stable.
Failure mode:
- If OCR fails,
tito-pdfprints a warning and continues using the non-OCR PDF.
See: OCR.
3) Layout-aware text extraction (extract_lines_layout)
Purpose: get text with position and font metadata so we can make better Markdown than “just text”.
Implementation details:
- The extraction walks pages and calls
page.get_text("dict"). - Lines are later sorted in reading order by
(page, y0, x0). --max-pages Nlimits how many pages are processed in this stage.
Tool: PyMuPDF (fitz).
What we extract:
- Per line, we record:
- page number
- normalized text
- font size (median of span sizes)
- bounding box (
x0,y0,x1,y1) - page width/height
- bold heuristic (font name contains “bold”)
4) Header/footer dropping (drop_repeated_headers_footers)
Purpose: remove repeated page furniture that would pollute the Markdown.
Heuristic:
- If the document has < 3 pages, we only drop page numbers.
- Otherwise we count short strings that appear near the top/bottom of pages.
- Anything that appears on ~60% of pages is treated as a header/footer and removed.
5) Markdown reconstruction (lines_to_markdown)
Purpose: convert positioned lines into best-effort Markdown.
Key heuristics:
- Infer a “body font size” from the mode of typical line sizes.
- Headings:
- Larger font sizes become
#,##,###. - OCR/uniform-font fallbacks: bold lines, centered uppercase, or numbered headings.
- Larger font sizes become
- Lists:
- bullets
- * •and numbered lists like1)/1.are normalized.
- bullets
- Paragraph joining:
- lines are joined until a large vertical gap implies a new paragraph.
- basic hyphenation repair joins
foo-+bar=>foobar.
6) Plaintext export (lines_to_text)
Purpose: provide a raw, non-Markdown text stream for downstream tools.
Compared to Markdown:
- No headings, no list normalization.
- Still uses paragraph joining and hyphenation repair.
7) Table extraction (extract_tables)
Purpose: extract tables deterministically without an LLM.
Strategy order:
- PyMuPDF table finder (primary)
- Camelot (optional; only if installed)
- pdfplumber (fallback)
Strict vs lenient:
- Strict uses PyMuPDF
find_tables(lines/lines)and filters aggressively. - Lenient enables text-based strategies (higher recall, more false positives).
Output:
- Tables Markdown (
--tables-outor<stem>.tables.md) - Tables audit JSON (
--tables-audit-out)
See: Tables.
DOCX pipeline
Tool: python-docx.
1) Markdown extraction (extract_docx_markdown)
- Paragraphs are emitted as Markdown paragraphs.
- DOCX heading styles (
Heading 1..Heading 6) become Markdown headings.
2) Raw text extraction (extract_docx_text)
- Concatenates paragraphs as plaintext.
3) Table extraction (extract_docx_tables)
- Extracts DOCX tables and converts them to Markdown tables.
- Produces a small audit list (tool, table index, rows, cols).
Output writing (atomic)
All outputs are written after extraction succeeds.
Implementation detail:
- Each file is written to a
*.tmpfile and then renamed into place.
Benefits:
- Avoids leaving a misleading empty output if extraction fails mid-run.