Show HN: Crovia Spider v1 –Forensic crawler exposing compliance gaps in LAION-5B

  • Posted 6 hours ago by crovia
  • 2 points
https://github.com/croviatrust/crovia-core-engine
Today we're releasing Crovia Spider v1: an open-core forensic tool that digs into existing public AI datasets (2024–2026) for license hints, provenance signals, and compliance holes – no new crawls, no private data touched. Just verifiable clarity on what's already out there.

Gran it on LAION-5B (the backbone of Stable Diffusion, etc.):

Unverified CC-BY 4.0 / 3.0 licenses

Tens of thousands of "unknown" entries

Mixed variants with zero audit trace

First-ever Compliance Score: 14/100 (every model on it inherits the risk)

Real receipts (e.g., cid:url_sha256:c7cc5b0acf8330e51ffd1ed02f108e6a9649e13ed3547a14255dad6bdf7f01c5 → cc-by-4.0 unverified).

Why? EU AI Act hits 2026: models need reproducible evidence, transparent licensing, and Annex IV bundles. Spider outputs audit packs that plug straight into Crovia Trust (offline Merkle proofs <30s). All Apache 2.0, CLI-ready.

Reproduce it: crovia-spider from-laion --output receipts.ndjson on your dataset. Brutal feedback? Integrations with HF/FAISS?

Let's build the governance layer AI deserves.

Repo: https://github.com/croviatrust/crovia-core-engine

(Real receipts extracted via Crovia Spider)

cid:url_sha256:c7cc5b0acf8330e51ffd1ed02f108e6a9649e13ed3547a14255dad6bdf7f01c5

License: cc-by-4.0 (unverified)

cid:url_sha256:267ad746f168458aa6aca730d82dd565ba0dbada0107317d2252d3b60d57fade

License: cc-by-sa-3.0 (unverified)

cid:url_sha256:8bad9a02f5b4b1e08e19a6417bd6fb03576c80a80deef4f4a1ca868eb9265e71

License: unknownDocs/Spec: docs/CROVIA_SPIDER_RECEIPT_v1.md

#AIGovernance

0 comments