Developer, Knowledge Management Advocate

Building a web search engine from scratch in two months with 3 billion neural embeddings

I stumbled upon this quite bonkers article about building a web search engine from scratch as a solo developer with relatively modest resources, i absolutely can recommend reading it:

Building a web search engine from scratch in two months with 3 billion neural embeddings

Summary (Generated):

  • Wilson Lin built a full web search engine from scratch in ~2 months, crawling ~280M pages and generating 3B SBERT embeddings on a GPU cluster consisting of 200 GPUs.
  • Lin focused on neural-embedding search, with smart HTML normalization + sentence-level chunking + contextual “statement chaining” so queries match meaning and intent, not keywords.
  • The infra is highly optimized + cheap: custom crawler & RocksDB-based queues/KV, sharded HNSW / CoreNN vector DB, mTLS service mesh, hundreds of GPUs on low-cost providers (Runpod, Hetzner, Oracle).
  • The SERP emphasizes high-quality, low-SEO-spam content, knowledge panels (Wikipedia/Wikidata), and a light AI assistant for quick answers and reranking, but still feels like a classic fast search engine.
  • Biggest lessons: crawling + quality filtering are the hardest part, embeddings make very specific queries vastly better, and search + LLMs will likely coexist (LLMs shouldn’t memorize everything, but retrieve via dense indices).
AI Infrastructure Databases search-engine neural-embeddings Technology
Loading...