I open-sourced a dataset of 5 synthetic bank and credit card statement PDFs designed for testing extraction&#x2F;parsing accuracy. Each PDF uses a fictional bank with realistic formatting from a different countryI&#x27;ve been building a bank statement converter (Bankstatemently) and kept discovering edge cases across different banks. At some point, I started cataloging them as &quot;quirks&quot; and I&#x27;m currently at 36 documented challenges and counting (think: dates without years across year boundaries, credit card charges shown as positive instead of negative, dates hiding inside description text etc)Real bank data is private, so there&#x27;s no shared dataset to test parsers against. Once I had these quirks, I realized I can use them to reconstruct statements that deliberately include these challenges so more people can use themThere&#x27;s also a free evaluation API: submit your parsed JSON and get field-level accuracy scores back. Ground truth is held server-side, but that&#x27;s not necessarily bullet-proof against overfittingWould appreciate feedback on which edge cases are missing. I&#x27;m planning to make the next 10 statements a bit harder (scanned PDFs, multi-currency across multi-table, Buddhist era dates)<a href="https:&#x2F;&#x2F;github.com&#x2F;bankstatemently&#x2F;bank-statement-parsing-benchmark" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;bankstatemently&#x2F;bank-statement-parsing-be...</a>You can browse all of the quirks here with real-world examples: <a href="https:&#x2F;&#x2F;bankstatemently.com&#x2F;benchmark&#x2F;challenges" rel="nofollow">https:&#x2F;&#x2F;bankstatemently.com&#x2F;benchmark&#x2F;challenges</a>

Show HN: Open-source synthetic bank statements for testing parsers

0 comments