Show HN: Validated Table Extractor–Verify PDF Tables Using Docling+Vision LLMs

  • Posted 3 hours ago by 2dogsanerd
  • 2 points
https://github.com/2dogsandanerd/validated-table-extractor
Hey HN,

I built this because I got tired of "silent failures" in traditional PDF table extraction tools.

In my day job working with financial and legal documents, tools like Camelot or Tabula often return data that looks plausible but has shifted columns or missing decimal points. In regulated environments, you can't afford to guess.

I built a pipeline that treats extraction as a hypothesis to be verified:

1. *Extraction:* Uses IBM’s Docling to parse the layout and get the structure (Markdown).

2. *Visual Verification:* Captures a screenshot of the specific table region from the PDF.

3. *Validation:* Feeds both the Markdown and the Screenshot into a local Vision LLM (Llama 3.2 via Ollama).

4. *Scoring:* The LLM compares pixel truth vs. extracted text and outputs a confidence score + audit trail.

The trade-off is speed (it takes ~5s per table) vs. confidence. It's designed to run 100% locally for privacy-critical documents.

Repo is here: https://github.com/2dogsandanerd/validated-table-extractor

Would love to hear how you handle data integrity in RAG pipelines!

0 comments