vtaya 4 hours ago

We got tired of vector search returning "nine three zero one" when searching for "Section 9-301" in legal docs. So we built something that learns what matters in YOUR domain. The problem: Pure vector search fails on domain-specific text. Legal sections, medical codes (CYP3A4), error codes (ORA-12545) - all need exact matching, not "semantic similarity." What we built:

Feed it 20K tokens of your docs It learns your domain patterns (sections need exact match, concepts need semantic) Generates a tiny DSL with intent rules Routes queries through hybrid search (BM25 + vectors) with proper weights

Real numbers (UCC Article 9): QueryVector-OnlyOur Approach"When does 9-301 NOT apply?"0.310.94"PMSI priority over blanket lien"0.220.91"Perfection methods for deposit accounts"0.190.96 Average precision: 0.24 → 0.94 (that's 3.9x, not "+47%" - my bad on the title) How it works:

Detects "§9-301" pattern → 85% keyword weight Detects "priority rules" → 60% semantic weight First match wins, normalized to sum=1.0 ~200ms latency on 100K docs

What surprised us: Each domain is wildly different. Medical wants exact enzyme codes but fuzzy symptoms. Legal wants exact sections but flexible concepts. Finance needs temporal awareness. One-size-fits-all search doesn't exist. Current limits: English only, needs 50+ docs, no images/OCR yet We packaged this as CoderSwap.AI but honestly more interested in whether others have tried corpus-driven search config. What worked/failed? Questions:

Where has pure vector search burned you the worst? Anyone else doing automatic pattern extraction from corpuses? Is RRF still the best fusion method or is there something better?

Happy to share more benchmarks or implementation details.