Back to Projects
Passion Projects

Distributed Search Engine

A distributed search engine built to handle large-scale web crawling and indexing with horizontal scalability.

PythonDistributed SystemsWeb CrawlingInformation RetrievalElasticsearch

A fully functional distributed search engine that can crawl, index, and serve search results across multiple nodes.

Features

  • Distributed Crawling: Multi-node web crawler with duplicate detection
  • Inverted Index: Efficient indexing using distributed inverted index structure
  • Ranking Algorithm: TF-IDF and PageRank-based ranking for relevant results
  • Fault Tolerance: Automatic failover and data replication
  • Horizontal Scaling: Add nodes dynamically to increase capacity

Architecture

The system consists of:

  • Crawler Nodes: Responsible for fetching and parsing web pages
  • Indexer Nodes: Build and maintain distributed inverted indexes
  • Query Servers: Handle search queries and rank results
  • Coordinator: Manages cluster state and load balancing

Technical Highlights

  • Custom URL frontier and politeness policies
  • Distributed hash table for index sharding
  • Real-time indexing pipeline
  • RESTful API for search queries
  • Web interface for search results