Developer, Knowledge Management Advocate

Building a web search engine from scratch in two months with 3 billion neural embeddings

I stumbled upon this quite bonkers article about building a web search engine from scratch as a solo developer with relatively modest resources, i absolutely can recommend reading it:

Building a web search engine from scratch in two months with 3 billion neural embeddings

Summary (Generated):

  • Wilson Lin built a full web search engine from scratch in ~2 months, crawling ~280M pages and generating 3B SBERT embeddings on a GPU cluster consisting of 200 GPUs.
  • Lin focused on neural-embedding search, with smart HTML normalization + sentence-level chunking + contextual “statement chaining” so queries match meaning and intent, not keywords.
  • The infra is highly optimized + cheap: custom crawler & RocksDB-based queues/KV, sharded HNSW / CoreNN vector DB, mTLS service mesh, hundreds of GPUs on low-cost providers (Runpod, Hetzner, Oracle).
  • The SERP emphasizes high-quality, low-SEO-spam content, knowledge panels (Wikipedia/Wikidata), and a light AI assistant for quick answers and reranking, but still feels like a classic fast search engine.
  • Biggest lessons: crawling + quality filtering are the hardest part, embeddings make very specific queries vastly better, and search + LLMs will likely coexist (LLMs shouldn’t memorize everything, but retrieve via dense indices).

Thank you for reading!
Written by Mia Heidenstedt.

AI Infrastructure Databases Search Engine Neural Embeddings Technology
Loading...
This site uses a custom, optimized page loading system that lowers bandwidth usage.