Passion Projects
Distributed Search Engine
A distributed search engine built to handle large-scale web crawling and indexing with horizontal scalability.
PythonDistributed SystemsWeb CrawlingInformation RetrievalElasticsearch
A fully functional distributed search engine that can crawl, index, and serve search results across multiple nodes.
Features
- Distributed Crawling: Multi-node web crawler with duplicate detection
- Inverted Index: Efficient indexing using distributed inverted index structure
- Ranking Algorithm: TF-IDF and PageRank-based ranking for relevant results
- Fault Tolerance: Automatic failover and data replication
- Horizontal Scaling: Add nodes dynamically to increase capacity
Architecture
The system consists of:
- Crawler Nodes: Responsible for fetching and parsing web pages
- Indexer Nodes: Build and maintain distributed inverted indexes
- Query Servers: Handle search queries and rank results
- Coordinator: Manages cluster state and load balancing
Technical Highlights
- Custom URL frontier and politeness policies
- Distributed hash table for index sharding
- Real-time indexing pipeline
- RESTful API for search queries
- Web interface for search results