Back to Projects

Passion Projects

Distributed Search Engine

A distributed search engine built to handle large-scale web crawling and indexing with horizontal scalability.

PythonDistributed SystemsWeb CrawlingInformation RetrievalElasticsearch

A fully functional distributed search engine that can crawl, index, and serve search results across multiple nodes.

Features

Distributed Crawling: Multi-node web crawler with duplicate detection
Inverted Index: Efficient indexing using distributed inverted index structure
Ranking Algorithm: TF-IDF and PageRank-based ranking for relevant results
Fault Tolerance: Automatic failover and data replication
Horizontal Scaling: Add nodes dynamically to increase capacity

Architecture

The system consists of:

Crawler Nodes: Responsible for fetching and parsing web pages
Indexer Nodes: Build and maintain distributed inverted indexes
Query Servers: Handle search queries and rank results
Coordinator: Manages cluster state and load balancing

Technical Highlights

Custom URL frontier and politeness policies
Distributed hash table for index sharding
Real-time indexing pipeline
RESTful API for search queries
Web interface for search results