Topaz et al. reported their findings on citation hallucination in May in The Lancet. They scanned 2.5 million PubMed Central articles and estimated that 1 in 277 contained a fabricated citation. Some of their examples were this exact pattern: real identifier, fabricated title.
I originally built Scholar Sidekick as a formatter for my own use as a clinician-educator preparing talks, articles, etc. After reading the Topaz paper, I added a verifier to catch the most common pattern they found: a real identifier attached to the wrong paper.
My tool resolves the identifier, and then compares the title in your reference with the returned metadata (i.e. does this DOI, PMID, or arXiv ID actually point to the right paper?). It does not attempt to judge whether the cited paper actually supports the claim you make in your text. That still needs judgment, preferably human judgment.
I ran 350 previously unseen citations through the API once each in a test. It correctly identified all 37 fabricated references, but wrongly flagged 5 of 285 real references: 1.8% (95% CI 0.8–4.0%). (Plain similarity comparison, without the optional LLM screening - I would expect the LLM to rescue some of those borderline cases. A handful of citations returned no result on upstream timeouts and weren't scorable either way.) The test suite, results and failures are public, so you do not have to take my word for it. You can check them yourself.
The web version is free and anonymous. The REST API and MCP server use a RapidAPI key, with a free rate-limited tier and paid tiers above that. The MCP server is on npm, Smithery and Glama, and the Obsidian plugin is in the community store. Chrome/Firefox/Edge browser extensions in their stores as well.
I'm very open to feedback and look forward to hearing from anyone who tries it - what works? What fails? Thanks in advance.