Show HN: I got frustrated with SMILES, so I built one

  • Posted 4 hours ago by sangeet01
  • 1 points
https://github.com/sangeet01/script
Hi HN, I'm an undergrad in Nepal.

For the last 35 years, computational chemistry and AI drug discovery have relied on SMILES to represent molecules. It was great for the 1980s, but today it is a massive bottleneck. It’s non-canonical, its stereochemistry parsing is fragile, and it completely breaks down when trying to represent organometallics, alloys, or polymers. To parse it reliably, you basically need a 300MB C++ dependency (RDKit) relying on decades of hard-coded heuristics.

I got frustrated and realized that representing matter isn't a graph theory problem—it’s a linguistics problem.

To fix it, I built SCRIPT (Structural Chemical Representation in Plain Text). I based the core parser on the generative linguistics of Pāṇini’s Sanskrit grammar. Instead of treating a molecule as a string of dumb nodes, SCRIPT treats it as a language of Roots, States (Vibhakti), and Relationships (Sandhi).

I just released V3 today for Pi Day.

How it works & what it fixes: • Aromaticity without the mess: SMILES uses lowercase letters (c1ccccc1), which causes endless parsing ambiguity. SCRIPT uses an Anubandha (governance marker) on the ring closure. C1CCCCC&6: explicitly tells the parser that the last 6 atoms in the DFS path are resonant.

• Vāk Order Stereochemistry: In SCRIPT, chirality is intrinsically resolved using the Depth-First Search sequence order as the native coordinate frame, making it mathematically order-invariant.

• Organometallics & Materials: Because of the grammar design, SCRIPT natively supports Haptic bonds (*5), fractional alloys (Ti<~0.9>N<~0.1>), crystal phases ([[Rutile]] Ti(O)2), and stochastic polymers ({[CC]}n).

• RDKit-Independent: The core engine uses a pure Python Lark grammar. It catches 6-valent carbons during parsing, generates a 100% native round-trip, and hits 95.9% RDKit InChI parity without relying on RDKit's C++ backend.

Examples: Aspirin (SMILES): CC(=O)Oc1ccccc1C(=O)O (or many other valid strings) Aspirin (SCRIPT): CC(=O)OC1=CC=CC=C1C(=O)O (Deterministic canonicalization) Cisplatin: Pt<sqp>(Cl)2(NH3)2@ (Preserves square-planar geometry and cis-configuration)

I'm just a daft undergrad splashing through code like a toddler (my wet-lab titrations are a mess, and yes, I've used my mouth to pipette). I would absolutely love your harshest technical feedback, especially from the parser nerds, chemoinformaticians, or anyone working in AI drug discovery. Happy to answer any questions about the grammar or the parser architecture!

0 comments