Basic System Design
Geliyoo Search Engine
Page 1 of 10
1.1 Definition, Acronyms and Abbreviations
Hortonworks Data Platform
Hadoop Distributed File System
Table 1: Definition, Acronyms and Abbreviations
Table 2: References
Page 2 of 10
Apache Nutch is an open source Web crawler written in Java. By using it, we can find
Web page hyperlinks in an automated manner, reduce lots of maintenance work, for example
checking broken links, and create a copy of all the visited pages for searching over.
It is highly scalable and relatively feature rich crawler. It can easily crawl lots of web
pages and can find its invertlinks for crawl them again. It provides easy integration with hadoop,
elastic search and apache cassandra.
Elasticsearch is a search server based on Lucene. It provides a distributed,
multitenantcapable fulltext search engine with a RESTful web interface and schemafree JSON
documents. Elasticsearch is developed in Java and is released as open source under the terms
of the Apache License.
It is easily integrate with apache nutch and nutch operate this for indexing web pages.
Indexed data will store on its file system.
Page 3 of 10
Apache Cassandra is a massively scalable open source NoSQL database. Cassandra is
perfect for managing large amounts of structured, semistructured, and unstructured data
across multiple data centers and the cloud. Cassandra delivers continuous availability, linear
scalability, and operational simplicity across many commodity servers with no single point of
failure, along with a powerful dynamic data model designed for maximum flexibility and fast
response times. Cassandra offers robust support for clusters spanning multiple data centers,
with asynchronous masterless replication allowing low latency operations for all clients.
Hortonworks Data Platform (HDP) is open source, fullytested and certified, Apache™
Hadoop® data platform.
Hortonworks Data Platform is designed for facilitates integrating Apache Hadoop with an
enterprise’s existing data architectures. We can say HDP is bunch of all components that
provides reliable access for hadoop clustering.
The Apache Ambari project is aimed at making Hadoop management simpler by
developing software for provisioning, managing, and monitoring Apache Hadoop clusters. Ambari
provides an intuitive, easytouse Hadoop management web UI backed by its RESTful APIs.
Ambari enables System Administrators to:
● Provision a Hadoop Cluster
○ Ambari provides a stepbystep wizard for installing Hadoop services across any
number of hosts.
○ Ambari handles configuration of Hadoop services for the cluster.
● Manage a Hadoop Cluster
○ Ambari provides central management for starting, stopping, and reconfiguring
Hadoop services across the entire cluster.
● Monitor a Hadoop Cluster
○ Ambari provides a dashboard for monitoring health and status of the Hadoop
○ Ambari leverages Ganglia for metrics collection.
○ Ambari leverages Nagios for system alerting and will send emails when your
attention is needed (e.g., a node goes down, remaining disk space is low, etc).
Ambari enables Application Developers and System Integrators to:
● Easily integrate Hadoop provisioning, management, and monitoring capabilities to their
own applications with the Ambari REST APIs.
Page 4 of 10
Basic System Flow:
There are three procedure take place in overall system as per following:
And overall system divides in 3 clusters:
1. Hadoop Cluster
A small Hadoop cluster includes a single master and multiple worker nodes. The master
node consists of a JobTracker, TaskTracker, NameNode and DataNode. A slave or worker node
acts as both a DataNode and TaskTracker, though it is possible to have dataonly worker nodes
and computeonly worker nodes.
In a larger cluster, the HDFS is managed through a dedicated NameNode server to host
the file system index, and a secondary NameNode that can generate snapshots of the
NameNode's memory structures, thus preventing filesystem corruption and reducing loss of
data. Similarly, a standalone JobTracker server can manage job scheduling. In clusters where
the Hadoop MapReduce engine is deployed against an alternate file system, the NameNode,
secondary NameNode and DataNode architecture of HDFS is replaced by the
Page 5 of 10
2. Elastic search cluster
ElasticSearch is distributed, which means that indices can be divided into shards and
each shard can have zero or more replicas. Each node hosts one or more shards, and acts as
a coordinator to delegate operations to the correct shard(s). Rebalancing and routing are done
We have a series of distinct ElasticSearch instances work in a coordinated manner
without much administrative intervention at all. Clustering ElasticSearch instances (or nodes)
provides data redundancy as well as data availability.
ElasticSearch stores index in their file system as distributed manner. Search result will
obtain from this stored indexed data from file system.
3. Cassandra Cluster
Cassandra cluster contains one or more data centers and each data center have
number of nodes. Cassandra stores crawled data as distributed manner resulting in a good load
balancing. Key features of Cassandra’s distributed architecture are specifically tailored for
multipledata center deployment, for redundancy, for failover and disaster recovery.
Page 6 of 10
Crawling is continuous process, injection is done by only once when injecting urls, but all
other operations expect inject is perform continuously until how depth we want to go in urls.
This all operation gives their job to hadoop and hadoop will perform these tasks parallelly
by distributing their task among nodes.
Following operations are performed for crawling with nutch:
1. Inject : The nutch inject command allows you to add a list of seed URLs in database for
your crawl. It takes urls seed files from hdfs directory. We can define url validation with
nutch and it will check with injection and parsing operation, urls which are not validates
are rejected while rest of urls are inserted in database.
2. Generate : The nutch generate command will take the list of outlinks generated from a
previous cycle and promote them to the fetch list and return a batch ID for this cycle. You
will need this batch ID for subsequent calls in this cycle. Number of top URLs to be
selected by passing top score as argument with this operation.
3. Fetch : The Nutch Fetch command will crawl the pages listed in the column family and
write out the contents into new columns. We need to pass in the batch ID from the
previous step. We can also pass ‘all’ value instead of batch id if we want to fetch all url.
Page 7 of 10
4. Parse : The nutch parse command will loop through all the pages, analyze the page
content to find outgoing links, and write them out in the another column family.
5. Update db : The nutch updatedb command takes the url values from the previous stage
and places it into the another column family, so they can be fetched in the next crawl
Indexing can be done by elasticSearch which is configured with nutch, and nutch is
responsible for fire operation of indexing.
Elastic Index command takes two mandatory arguments:
1. First argument is Cluster Name and
2. Second is either of “batch Id” (which is get by previous operations of nutch), “all”
(for all non indexed data) or “reindex” (for doing again index of all data).
After executing command nutch will give job to hadoop and hadoop will divide job into
smaller tasks. Each task stores indexed data on to the file system of elasticSearch cluster in
Page 8 of 10
Indexed data is stored in the file system of nodes within cluster. Elasticsearch provides a
full query based on JSON to define queries. In general, there are basic queries such as term or
prefix. There are also compound queries like the bool query. Queries can also have filters
associated with them such as the filtered or constant_score queries, with specific filter queries.
Query is pass to elasticSearch cluster and it will match query parameter and return
Page 9 of 10
We have already tested this system in following environment have one master and two
Page 10 of 10