tito-PDF Documentation

Implementation details

This page documents code-level behavior in tito-pdf (the Python script), including thresholds and heuristics.

Where logic lives

Key functions in the tito-pdf script:

Output mode resolution (exact behavior)

main() computes:

Explicit output mode

If explicit_output_mode is true:

Convenience mode

If no explicit paths are set:

There is no convenience-mode plaintext output file; --raw-text-out is explicit-only.

Mode + override resolution (exact mapping)

main() treats --mode as a high-level knob that sets defaults.

Inputs:

Resolution logic (as implemented):

PDF preparation (prepare_pdf)

prepare_pdf(input_pdf, output_pdf) produces a normalized working copy for downstream parsing.

qpdf step

OCR stage (ocr_pdf)

ocr_pdf(input_pdf, output_pdf, force):

Flags used:

Failure behavior:

PDF line extraction (extract_lines_layout)

PyMuPDF provides layout metadata.

Implementation notes:

--max-pages N limits how many pages are processed.

Header/footer dropping (drop_repeated_headers_footers)

This is a heuristic filter intended to remove repeated page furniture.

Rules:

Page numbers:

Markdown reconstruction (lines_to_markdown)

Lines are sorted in reading order:

Body font size inference (infer_body_font_size)

The “body size” is the mode of rounded font sizes from “normal” lines:

If nothing qualifies, body defaults to 12.0.

Heading detection (_is_heading)

A line is treated as a heading if:

Primary rule:

Fallbacks (for OCR/uniform-font PDFs):

Centered is defined as:

Heading level (_heading_level)

Size-based levels:

Fallback numbering rule:

List detection (_is_list_item)

A list item is detected if the line matches:

Normalization:

Paragraph joining

Paragraphs are joined until a vertical gap indicates a new paragraph.

Hyphenation repair:

Plaintext reconstruction (lines_to_text)

Plaintext export keeps paragraph joining and hyphenation repair but does not produce headings or lists.

Differences from Markdown:

Tables extraction (extract_tables)

This stage produces:

Strategy order (PDF)

1) PyMuPDF table finder

2) Camelot (optional)

3) pdfplumber fallback

The function returns early when a strategy yields at least one accepted table.

Table normalization + dedup

Deduplication is done by a content signature:

Tables with duplicate signatures are not emitted twice.

Markdown conversion rules:

Acceptance filters (exact thresholds)

The core filters live in should_accept(...).

Hard size limits:

Two-row sparse grids:

Sparsity:

Bounding box guards (when bbox is available):

PyMuPDF multi-column guard:

Text-strategy guard:

Near-full-page hard stop:

Audit JSON fields

The tables audit JSON includes:

DOCX extraction details

DOCX uses python-docx.

Headings

DOCX headings are detected by paragraph style name matching:

The level is clamped to 1..6 and mapped to Markdown headings.

Tables

Tables are converted to Markdown using the first row as the header and a --- separator row.

Assets JSON (toolchain capture)

When --assets-json is requested, tito-pdf captures:

System tool versions are read by executing --version and capturing the first line (with a short timeout).

See: Assets JSON.