Arama Motoru Geliyoo'nun düzenlemesi hakkında.

GELIYOO.COM
Search Engine
Development
Project Description
[Type the abstract of the document here. The abstract is typically a short summary of the contents
of the document. Type the abstract of the document here. The abstract is typically a short
summary of the contents of the document.]
Buray Savas ANIL
3/1/2014

3 of 56
Table of Contents
Introduction: .......................................................................................................................................5
Phases of Development: .....................................................................................................................6
1. Component Selection..............................................................................................................7
Objective .....................................................................................................................................7
Components Considered.............................................................................................................8
1. Apache Nutch ..................................................................................................................8
2. Hadoop ..........................................................................................................................10
3. Apache Hortonworks.....................................................................................................12
4. MongoDB.......................................................................................................................13
5. HBase.............................................................................................................................14
6. Cassandra.......................................................................................................................15
7. Elastic Search.................................................................................................................17
8. Apache solr....................................................................................................................18
9. Spring MVC ....................................................................................................................19
10. Easy-Cassandra..............................................................................................................21
Conclusion.................................................................................................................................22
2. Architecture Design...............................................................................................................24
Objective ...................................................................................................................................24
System Design ...........................................................................................................................25
1. Web Server for Geliyoo.com : .......................................................................................26
2. Web Server for hosting WebService API: ......................................................................28
3. ElasticSearch Cluster:.....................................................................................................28
4. Hadoop Cluster:.............................................................................................................29
5. Cassandra Cluster ..........................................................................................................34
6. Horizontal Web Server Clustering .................................................................................35
Conclusion.....................................................................................................................................37
3. Component Configuration.....................................................................................................38
Objective ...................................................................................................................................38
Configuration Parameters.........................................................................................................39
Hortonworks Configuration ..................................................................................................39
Installing and running Ambari Server....................................................................................42

4 of 56
Nutch configuration on hadoop Master Node......................................................................44
Cassandra Configuration.......................................................................................................46
ElasticSearch Configuration..................................................................................................47
Run Nutch jobs on Hadoop Master Node .............................................................................49
Conclusion.....................................................................................................................................49
4. Development.........................................................................................................................50
Objective ...................................................................................................................................50
Development Completed..........................................................................................................51
Prototype ..............................................................................................................................51
Implementation :...................................................................................................................51
Future Development.................................................................................................................54
Content Searching.................................................................................................................54
Semantic Search....................................................................................................................55
Prototypes.............................................................................................................................56
Video search..........................................................................................................................56

5 of 56
Geliyoo Search Engine Project Documentation
Introduction:
We are developing a semantic search engine, which will be available for the general user for
searching the internet. Also, this product can be customized so that it can also be installed on a
company's intranet, which will help the users to search the documents & images that are made
available by the each individual and company as a whole to be available for general access.
Objective is to create semantic search engine. For searching we need data from different
websites. We need to crawl many websites for collecting data. Data is stored in a large data store.
For searching from this data store it needs to index those data. For this all process we need
different components for different task.
The semantic search engine development process has following three major components:
1. Crawling
Web crawling is harvesting web contents by visiting each website, and find its all outlinks
for fetching their content too. Crawler is continuous process for fetching web content up
to Nth depth of website. It is restricted by robot.txt from many sites. Web content mean
all text content, images, docs etc. In short Fetching all content from web that available in
website. We need tool for fetch all content , parse by their mime type and finds their
outlinks for same.
Crawlers are rather dumb processes that fetch content supplied by Web servers
answering (HTTP) requests of requested URIs. Crawlers get their URIs from a crawling
engine that’s feeded from different sources, including links extracted from previously
crawled Web documents.
2. Indexing
Indexing means making sense out of the retrieved contents, storing the processing results
in a document index. All harvested data must be indexed for searching from them.
3. Searching
We need component for getting efficient result of search query. For searching from large
indexed data, it must be processed in speedy manner and return all relevant results.

6 of 56
Phases of Development:
Search engine development follows the following phases of development :
1. Component Selection : Select the components that are useful for the implementation of
the search engine.
2. Architecture Design : Design architecture of the system that will allow both internet based
search as well as intranet based search.
3. Component Configuration: Configure the selected components as per our requirements
that will augment the searching process.
4. Development : We will develop a search engine web application and the remaining
components of the system that are not available with the current components.
5. Future Development : The tasks that still needs development are mentioned here.

7 of 56
1. Component Selection
Objective
There are many open source components available which can help us to develop the search
engines. Instead of creating everything from scratch we planned to used some open source
components and customize and extend them as per our requirements. This will save us lot of time
and money to recreate the same thing that has already been developed. For this we need to
figure out the right components that will match our requirements.

8 of 56
Components Considered
We go through many tools for achieve this project’s objective.
1. Apache Nutch
First component which we evaluated was Apache Nutch for crawling the website links.
Apache Nutch is an open source web crawler written in Java. By using it, we can find webpage
hyperlinks in an automated manner, reduce lots of maintenance work, for example checking
broken links, and create a copy of all the visited pages for searching over.
It is highly scalable and relatively feature rich crawler. It can easily crawl lots of web pages and
can find its invert links for crawl them again. It provides easy integration with Hadoop, Elastic
Search and Apache Cassandra.
Fig 1. Basic Workflow of Apache Nutch
List of Nutch Jobs
1. Inject
The nutch inject command allows you to add a list of seed URLs in database for
your crawl. It takes urls seed files. We can define url validation with nutch and it
will check with injection and parsing operation, urls which are not validates are
rejected while rest of urls are inserted in database.
 Command: bin/nutch inject <url_dir>

9 of 56
 Example: bin/nutch inject urls
2. Generate
The nutch generate command will take the list of outlinks generated from a
previous cycle and promote them to the fetch list and return a batch ID for this
cycle. You will need this batch ID for subsequent calls in this cycle. Number of top
URLs to be selected by passing top score as argument with this operation.
 Command: bin/nutch generate batch id or -all
 Example: bin/nutch generate -all
3. Fetch
The Nutch Fetch command will crawl the pages listed in the column family and
write out the contents into new columns. We need to pass in the batch ID from
the previous step. We can also pass ‘all’ value instead of batch id if we want to
fetch all url.
4. Parse
The nutch parse command will loop through all the pages, analyze the page
content to find outgoing links, and write them out in the another column family.
5. Updatedb
The nutch updatedb command takes the url values from the previous stage and
places it into the another column family, so they can be fetched in the next crawl
cycle.
Features
● Fetching and parsing are done separately by default, this reduces the risk of an
error corrupting the fetch parse stage of a crawl with Nutch
● Plugins have been overhauled as a direct result of removal of legacy Lucene
dependency for indexing and search.
● Easily configurable and movable
● We can create or add extra plugins for scale its functionality
● Validation rules are available for restrict other websites or contents.
● Tika parser plugin available for parsing all types of content types
● OPIC Scoring plugin or LinkRank plugin is used for calculation of webpage rank
with nutch.

10 of 56
2. Hadoop
Hadoop itself refers to the overall system that runs jobs in one or more machines parallel,
distributes tasks (pieces of these jobs) and stores data in a parallel and distributed fashion.
Hadoop cluster has Multiple Process node it include some master Node and some slave Node.
It has it own Filesystem it’s called Hdfs.the HDFS is managed through a dedicated NameNode
server to host the file system index, and a secondary NameNode that can generate snapshots
of the NameNode's memory structures, the HDFS manage Replication on one or more Machit.
so if Data loss from one Node it can recover from another Node itself.
Hadoop is easily configure with apache nutch. Hence all nutch crawling and indexing processes
are performed parallely in different nodes for decrease processes time. Nutch gives job to
hadoop for their operation and hadoop perform its job and return to nutch.
HDFS user manual Screen
Browse HDFS file System and HDFS storage information

11 of 56
Nutch running job information.

12 of 56
3. Apache Hortonworks
Hortonworks Data Platform (HDP) is open source, fully-tested and certified, Apache™
Hadoop® data platform.
Hortonworks Data Platform is designed for facilitates integrating Apache Hadoop with an
enterprise’s existing data architectures. We can say HDP is bunch of all components that
provides reliable access for hadoop clustering.
The Apache Ambari project is aimed at making Hadoop management simpler by developing
software for provisioning, managing, and monitoring Apache Hadoop clusters. Ambari
provides an intuitive, easy-to-use Hadoop management web UI backed by its REST APIs.
HDP saves our time for manages hadoop cluster by giving attractive web ui. We can easily
scale hadoop cluster by web application. We can also analyse performance and health of
hadoop job and cluster by different graphs. Like we can get details of Memory usage, network
usage, cluster load, cpu usage etc.

13 of 56
4. MongoDB
MongoDB is an open-source document database that provides high performance, high
availability, and automatic scaling.
A record in MongoDB is a document, which is a data structure composed of field and value
pairs. MongoDB documents are similar to JSON objects. The values of fields may include other
documents, arrays, and arrays of documents.
The advantages of using documents are:
● Documents (i.e. objects) correspond to native data types in many programming language.
● Embedded documents and arrays reduce need for expensive joins.
● Dynamic schema supports fluent polymorphism.
Features:
1. High Performance
MongoDB provides high performance data persistence. In particular,
 Support for embedded data models reduces I/O activity on database system.
 Indexes support faster queries and can include keys from embedded documents and
arrays.
2. High Availability
To provide high availability, MongoDB’s replication facility, called replica sets, provide:
 automatic failover.
 data redundancy.
A replica set is a group of MongoDB servers that maintain the same data set, providing
redundancy and increasing data availability.

14 of 56
5. HBase
HBase is a column-oriented database that’s an open-source implementation of Google’s Big
Table storage architecture. It can manage structured and semi-structured data and has some
built-in features such as scalability, versioning, compression and garbage collection. Since its
uses write-ahead logging and distributed configuration, it can provide fault-tolerance and
quick recovery from individual server failures. HBase built on top of Hadoop / HDFS and the
data stored in HBase can be manipulated using Hadoop’s MapReduce capabilities.
HBase Architecture:
The HBase Physical Architecture consists of servers in a Master-Slave relationship as shown
below. Typically, the HBase cluster has one Master node, called HMaster and multiple Region
Servers called HRegionServer. Each Region Server contains multiple Regions. Regions Just like
in a Relational Database, data in HBase is stored in Tables and these Tables are stored in
Regions. When a Table becomes too big, the Table is partitioned into multiple Regions. These
Regions are assigned to Region Servers across the cluster.
HBase Components
1. HMaster
 Performing Administration
 Managing and Monitoring the Cluster
 Assigning Regions to the Region Servers
 Controlling the Load Balancing and Failover
2. HRegionServer
 Hosting and managing Regions
 Splitting the Regions automatically
 Handling the read/write requests
 Communicating with the Clients directly
Features
 Linear and modular scalability.
 Strictly consistent reads and writes.
 Automatic and configurable sharding of tables
 Automatic failover support between RegionServers.
 Convenient base classes for backing Hadoop MapReduce jobs with Apache HBase
tables.
 Easy to use Java API for client access.
 Block cache and Bloom Filters for real-time queries.
 Query predicate push down via server side Filters
 Thrift gateway and a REST-ful Web service that supports XML, Protobuf, and binary
data encoding options.

15 of 56
6. Cassandra
Cassandra is open source distributed database system that is designed for storing and
managing large amounts of data across commodity servers.Cassandra is designed to have
peer-to-peer symmetric nodes, instead of master or named nodes, to ensure there can never
be a single point of failure .Cassandra automatically partitions data across all the nodes in the
database cluster, we can add N number of node in cassandra.
Features
1. Decentralized
Every node in the cluster has the same role. There is no single point of failure. Data is
distributed across the cluster (so each node contains different data), but there is no master as
every node can service any request.
2. Supports replication and multi data center replication
Replication strategies are configurable. Cassandra is designed as a distributed system, for
deployment of large numbers of nodes across multiple data centers. Key features of
Cassandra’s distributed architecture are specifically tailored for multiple-data center
deployment, for redundancy, for failover and disaster recovery.
3. Scalability
Read and write throughput both increase linearly as new machines are added, with no
downtime or interruption to applications.
4. Fault-tolerant
Data is automatically replicated to multiple nodes for fault-tolerance. Replication across
multiple data centers is supported. Failed nodes can be replaced with no downtime.
5. MapReduce support
Cassandra has Hadoop integration, with MapReduce support.
6. Query language
CQL (Cassandra Query Language) was introduced, a SQL-like alternative to the traditional RPC
interface. Language drivers are available for Java (JDBC).
Replication in Cassandra
Replication is the process of storing copies of data on multiple nodes to ensure reliability and
fault tolerance. When you create a keyspace in Cassandra, you must decide the replica
placement strategy: the number of replicas and how those replicas are distributed across
nodes in the cluster. The replication strategy relies on the cluster-configured snitch to help it
determine the physical location of nodes and their proximity to each other.

16 of 56
Replication Strategies:
1. Simple Strategy:
Simple Strategy is the default replica placement strategy when creating a keyspace using
Cassandra CLI. Simple Strategy places the first replica on a node determined by the
partitioner. Additional replicas are placed on the next nodes clockwise in the ring without
considering rack or data center location.
Fig: Simple Strategy diagram
2. Network Topology Strategy:
As the name indicates, this strategy is aware of the network topology (location of nodes
in racks, data centers etc.) and is much intelligent than Simple Strategy. This strategy is a
must if your Cassandra cluster spans multiple data centers and lets you specify how many
replicas you want per data center. It tries to distribute data among racks to minimize
failures. That is, when choosing nodes to store replicas, it will try to find a node on a
different rack.

17 of 56
7. Elastic Search
 Elasticsearch is a search server based on Lucene. It provides a distributed, multitenant-
capable full-text search engine with a RESTful web interface and schema-free JSON
documents. Elasticsearch is developed in Java and is released as open source under the
terms of the Apache License.
 It easily integrates with Apache Nutch and Nutch operates this for indexing web pages.
Indexed data will store on its file system.
 ElasticSearch is distributed, which means that indices can be divided into shards and each
shard can have zero or more replicas. Each node hosts one or more shards, and acts as a
coordinator to delegate operations to the correct shard(s). Rebalancing and routing are
done automatically.
 We have a series of distinct ElasticSearch instances work in a coordinated manner without
much administrative intervention at all. Clustering ElasticSearch instances (or nodes)
provides data redundancy as well as data availability.
 Indexed data is stored in the file system of nodes within cluster. Elasticsearch provides a
full query based on JSON to define queries. In general, there are basic queries such as
term or prefix. There are also compound queries like the bool query. Queries can also
have filters associated with them such as the filtered or constant_score queries, with
specific filter queries. Query is pass to elasticSearch cluster and it will match query
parameter and return result.
Features
 First, by having a rich RESTful HTTP API, it’s trivial to query elastic search with Ajax.
(elasticsearch further supports JavaScript developers with cross-origin resource
sharing by sending an Access-Control-Allow-Origin header to browsers.)
 Second, since elasticsearch stores schema-free documents serialized as JSON —
coming from “JavaScript Object Notation”, so obviously a native entity in JavaScript
code —, it can be used not only as a search engine, but also as a persistence engine.

18 of 56
8. Apache solr
 Apache Solr is an open source search platform built upon a Java library called Lucene.
 Solr is a popular search platform for Web sites because it can index and search multiple
sites and return recommendations for related content based on the search query’s
taxonomy. Solr is also a popular search platform for enterprise search because it can be
used to index and search documents and email attachments.
 Solr works with Hypertext Transfer Protocol (HTTP) Extensible Markup Language (XML). It
offers application program interfaces (APIs) for Javascript Object Notation (JSON), Python,
and Ruby. According to the Apache Lucene Project, Solr offers capabilities that have made
it popular with administrators including:
o Indexing in near real time
o Automated index replication
o Server statistics logging
o Automated failover and recovery
o Rich document parsing and indexing
o Multiple search indexes
o User-extensible caching
o Design for high-volume traffic
o Scalability, flexibility and extensibility
o Advanced full-text searching
o Geospatial searching
o Load-balanced querying

19 of 56
9. Spring MVC
Spring MVC is the web component of Spring’s framework. Spring Framework is a Java platform
that provides comprehensive infrastructure support for developing Java applications. Spring
handles the infrastructure so one can focus on his/her application. It provides a rich
functionality for building robust Web Applications. The Spring MVC Framework is architected
and designed in such a way that every piece of logic and functionality is highly configurable.
Following is the Request process lifecycle of Spring 3.0 MVC
*Here, User needs to define BeanNameUrlHandlerMapping / SimpleUrlHandlingMapping etc that inherits
HandlerMapping interface.
**Here, You can define multiple controllers like SimpleFormController/MultiActionController etc that ultimately
inherits Controller interface.
Features
 Spring enables developers to develop enterprise-class applications using POJOs. The
benefit of using only POJOs is that no need an EJB container product such as an
application server instead there is an option of using only a robust servlet container such
as Tomcat or some commercial product.
 Spring is organized in a modular fashion. Even though the number of packages and classes
are substantial, so need to worry only about needed ones and ignore the rest.

20 of 56
 Spring does not reinvent the wheel instead, it truly makes use of some of the existing
technologies like several ORM frameworks, logging frameworks, JEE, Quartz and JDK
timers, other view technologies.
 Testing an application written with Spring is simple because environment-dependent code
is moved into this framework. Furthermore, by using JavaBean-style POJOs, it becomes
easier to use dependency injection for injecting test data.
 Spring's web framework is a well-designed web MVC framework, which provides a great
alternative to web frameworks such as Struts or other over engineered or less popular
web frameworks.
 Spring provides a convenient API to translate technology-specific exceptions (thrown by
JDBC, Hibernate, or JDO, for example) into consistent, unchecked exceptions.
 Lightweight IoC containers tend to be lightweight, especially when compared to EJB
containers, for example. This is beneficial for developing and deploying applications on
computers with limited memory and CPU resources.
 Spring provides a consistent transaction management interface that can scale down to a
local transaction (using a single database, for example) and scale up to global transactions
(using JTA, for example).
 Spring has @Async annotation. Using this annotation one can run necessary processes
asynchronously. This feature is very useful for Geliyoo Search Engine to minimize the
search time.

21 of 56
10. Easy-Cassandra
We use cassandra ,which is nosql, to save and retrive data. So we need to make integration
between Spring MVC and Cassandra. For that we use easy-cassandra api.
Easy-Cassandra is a framework ORM API and a high client for Apache Cassandra in java.
Using this, it is possible to persist information from the Java Object in easy way.
To persist information, it adds some annotations at some fields and classes.
It works like an abstraction's tier in the Thrift, doing call for Cassandra.
The EasyCassandra uses the Thrift implementation and has like the main objective be one
simple ORM( Object relational manager).
Features
 An ORM easy to use in Cassandra.
 Only need is to use some Annotations in a class to persist informations.
 Persists many Java Objects in way extremely easy (e.g: all primitives types,
java.Lang.String, java.lang.BigDecimal, java.io.File, etc.).
 Compatible with CQL 3.0.
 In the Apache version 2.0 license.
 Supporting JPA 2.0 annotation.
 Work with multi-nodes.
 Complex rowkey (a key with tow or more keyrow).
 Map some collections (java.util.List, java.util.Set, java.util.Map).
 Find automatically the others clusters which do part of the same cluster.
 May use multiple keyspaces simultaneously.
 Integrations with Spring.

22 of 56
Conclusion
We had different options for NoSQL database and we compared them based on the features we
need for the development of this project. Following is the component feature compatibility table
Feature HBase MongoDB Cassandra
Hortonwork Suppot 0.96.4 No Support No Support
Developed language Java Java C++
Best used Hadoop is probably
still the best way to
run Map/Reduce jobs
on huge datasets.
Best if you use the
Hadoop/HDFS stack
already.
If you need dynamic
queries. If you prefer
to define indexes, not
map/reduce
functions. If you need
good performance on
a big DB. If you
wanted CouchDB, but
your data changes too
much, filling up disks.
When you write more
than you read (logging). If
every component of the
system must be in Java.
("No one gets fired for
choosing Apache's
stuff.")
Main point
Billions of rows X
millions of columns
Retains some friendly
properties of SQL.
(Query, index)
Best of BigTable and
Dynamo
Server-side scripts yes JavaScript no
Replication methods
selectable replication
factor
Master-slave
replication
selectable replication
factor
Consistency concepts Immediate Consistency
Eventual
Consistency,Immediate
Consistency
Eventual
Consistency,Immediate
Consistency
Nutch Support 0.90.4 2.22 2.2
hadoop Support 1.2.1 1.1.X 2.2

23 of 56
Apache Solr v/s Elastic search
 ElasticSearch was released specifically designed to make up for the lacking distributed
features of Solr. For this reason, it may find it easier and more intuitive to start up an
ElasticSearch cluster rather than a SolrCloud cluster
 ElasticSearch will automatically load balance and move shards to new nodes in the cluster.
This automatic shard rebalancing behavior does not exist in Solr.
 There was an issue in solr + nutch for making solr distributed, hence we choose elastic
search for its great features of distribution, searching query etc.
Final Component Selection :
We had gone through all above component and based on our requirements and their respective
features we have finalized the following components :
Service Selected Component
Parallel Processing Apache Hadoop
Crawling Apache Nutch
NoSQL Cassandra
Searching Elastic Search
MVC Spring MVC
ORM EasyCassandra

24 of 56
2. Architecture Design
Objective
To identify an architecture that will meet all the project requirements for the Geliyoo Search
engine development. The design will be based on the components that we selected and based on
the configurable items that they provide. Also, we need to consider the other non functional
factors for the development like number of requests per second, active users etc. Since there is a
fair chance that this site would have so much load we need to architect in order to come up with a
decent architecture.

26 of 56
1. Web Server for Geliyoo.com :
There are three parts of the web application that we propose to develop.
1. Super Admin Panel :
This panel will allow the super user to manage various settings of the system and also
perform functionality like adding urls ,scheduling the indexing and crawling of the urls,
manage users, etc.
2. User Admin Panel :
Admin panel will have allow the registered administrator user to add their sites they
propose to crawl, index and also see their results.
3. General Users
The general user, is a user who will be allowed to search various site indexed by
geliyoo search engine. They will be given an interface to search the web.
Since we expect that there would be too much load on this server we will have a cluster of
the webservers for load balancing and high availability.

28 of 56
2. Web Server for hosting WebService API:
This webserver will host Web Service API for searching and related functionalities. We have
bifurcated this with the admin panel functionalities, so as to manage the load for searching.
The web services will call the elastic search cluster's API to get the search results.
3. ElasticSearch Cluster:
Searching

29 of 56
Figure 2.3
Indexed data is stored in the file system of nodes within cluster. Elasticsearch provides a full
query based on JSON to define queries. In general, there are basic queries such as term or
prefix. There are also compound queries like the bool query. Queries can also have filters
associated with them such as the filtered or constant_score queries, with specific filter
queries.
Query is pass to elasticSearch cluster and it will match query parameter and return result.
4. Hadoop Cluster:
This is the most important part of the system. It will host all the services related to
crawling and indexing. It will also host the web services for the functionalities provided to
admin and super admin users.

30 of 56
Fig: Hadoop Cluster Diagram
Nutch (Crawling & Indexing ) :
For crawling and indexing we will use Nutch. Following is the current architecture of
Nutch Crawler
Basic Components:
Nutch Flow:

31 of 56
There are two procedure take place in overall system as per following:
1. Crawling :
Crawling is continuous process, injection is done by only once when injecting urls, but all
other operations expect inject is perform continuously until how depth we want to go in
urls.
This all operation gives their job to hadoop and hadoop will perform these tasks parallelly
by distributing their task among nodes.
Following operations are performed for crawling with nutch:
Inject
The nutch inject command allows you to add a list of seed URLs in database for your
crawl. It takes urls seed files from hdfs directory. We can define url validation with nutch
and it will check with injection and parsing operation, urls which are not validates are
rejected while rest of urls are inserted in database.
Generate
The nutch generate command will take the list of outlinks generated from a previous cycle
and promote them to the fetch list and return a batch ID for this cycle. You will need this
batch ID for subsequent calls in this cycle. Number of top URLs to be selected by passing
top score as argument with this operation.

32 of 56
Fetch
The Nutch Fetch command will crawl the pages listed in the column family and write out
the contents into new columns. We need to pass in the batch ID from the previous step.
We can also pass ‘all’ value instead of batch id if we want to fetch all url.
Parse
The nutch parse command will loop through all the pages, analyze the page content to
find outgoing links, and write them out in the another column family.
Update db
The nutch updatedb command takes the url values from the previous stage and places it
into the another column family, so they can be fetched in the next crawl cycle.
2. Indexing
Figure 2.2
Indexing can be done by elasticSearch which is configured with nutch, and nutch is
responsible for fire operation of indexing.
Elastic Index command takes two mandatory arguments:
● First argument is Cluster Name and
● Second is either of “batch Id” (which is get by previous operations of nutch), “all” (for all
non indexed data) or “reindex” (for doing again index of all data).
After executing command nutch will give job to hadoop and hadoop will divide job into
smaller tasks. Each task stores indexed data on to the file system of elasticSearch cluster
in distributed manner.

33 of 56
Hadoop:
A small Hadoop cluster includes a single master and multiple worker nodes. The master
node consists of a JobTracker, TaskTracker, NameNode and DataNode. A slave or worker
node acts as both a DataNode and TaskTracker, though it is possible to have data-only
worker nodes and compute-only worker nodes.
In a larger cluster, the HDFS is managed through a dedicated NameNode server to host
the file system index, and a secondary NameNode that can generate snapshots of the
NameNode's memory structures, thus preventing file-system corruption and reducing loss
of data. Similarly, a standalone JobTracker server can manage job scheduling. In clusters
where the Hadoop MapReduce engine is deployed against an alternate file system, the
NameNode, secondary NameNode and DataNode architecture of HDFS is replaced by the
file-system-specific equivalent.

34 of 56
5. Cassandra Cluster
Cassandra cluster contains one or more data centers and each data center have number of
nodes. Cassandra stores crawled data as distributed manner resulting in a good load
balancing. Key features of Cassandra’s distributed architecture are specifically tailored for
multiple-data center deployment, for redundancy, for failover and disaster recovery.

35 of 56
6. Horizontal Web Server Clustering
Objective:
The circumstances may occur in which the machine ,on which Geliyoo Search Api or
Geliyoo web application deployed, will down or become slow because of heavy traffic. To
copup this circumstances we need to make tomcat server clustering. In which our Api and
Application will be deployed on multiple machines(at least more than one). So that if one
server in the cluster goes down, then other servers in the cluster should be able to take
over -- as transparently to the end user as possible.
Process:
Under Horizontal Clustering there can be any no of systems and on each system
we have one Tomcat server running.To make Horizontal tomcat clustering, we are using
Apache http server. The Apache Httpd Server runs on only one of the system and it
controls all the Tomcats running on other systems including the one which installed on the
same system.We are also using mod_jk as load balancer. mod_jk is an Apache module
used to connect the Tomcat servlet container with web servers such as Apache.

36 of 56
Apache http server and mod_jk can be used to balance server load across multiple Tomcat
instances, or divide Tomcat instances into various namespaces, managed by Apache http
server.
Requests hit the Apache server in front and are distributed to backend Tomcat containers
depending on load and availability.The clients know of only one IP (Apache) but the
requests are distributed over multiple containers.So this is in the case you deploy a kind of
distributed web application and you need it robust.
By using Apache HTTP as a front end you can let Apache HTTP act as a front door to your
content to multiple Apache Tomcat instances. If one of your Apache Tomcats fails, Apache
HTTP ignores it. The Apache Tomcats can then be each in a protected area and from a
security point of view, you only need to worry about the Apache HTTP server. Essentially,
Apache becomes a smart proxy server. you can load balance multiple instances of your
application behind Apache. This will allow you to handle more volume, and increase
stability in the event one of your instances goes down. Apache Tomcat uses Connector
components to allow communication between a Tomcat instance and another party, such
as a browser, server, or another Tomcat instance that is part of the same network.
Configuration of this involves enabling mod_jk in Apache, configuring a AJP connector in
your application server, and directing Apache to forward certain paths to the application
server via mod_jk.

37 of 56
The mod_jk connector allows HTTPD to communicate with Apache Tomcat instances over
the AJP protocol. AJP ,acronymn for Apache Jserv Protocol, is a wire protocol. It an
optimized version of the HTTP protocol to allow a standalone web server such as Apache
to talk to Tomcat. The idea is to let Apache serve the static content when possible, but
proxy the request to Tomcat for Tomcat related content.
Conclusion
We have test current environment in different different combination of urls and cluster
node. on this test combination. we have measure HDFS_BYTES_READ (Bytes),Virtual
memory (bytes),Physical memory (bytes) of cluster.

38 of 56
3. Component Configuration
Objective
We are using the lot of open source components for the purpose of creating this search engine.
The components like Hadoop, Nutch and Casandra needs to be configured to achieve what is
required for the purpose of developing the search engine.
After analysis, we have decided to configure best combination of clusters on OVH Dedicated
server and also on development Environment. We have decide to implement the following
o One Master node of Hadoop,
o 4 Slave Node of Hadoop and
o 1 Node of Cassandra .

39 of 56
Configuration Parameters
Hortonworks Configuration
1. Minimum Requirement
● Operation System
○ Red Hat Enterprise Linux (RHEL) v5.x or 6.x (64-bit)
○ CentOS v5.x or 6.x (64-bit)
○ Oracle Linux v5.x or 6.x (64-bit)
○ SUSE Linux Enterprise Server (SLES) 11, SP1 (64-bit)
● Browser Requirements
○ Windows (Vista, 7)
○ Internet Explorer 9.0 and higher (for Vista + Windows 7)
○ Firefox latest stable release
○ Safari latest stable release
○ Google Chrome latest stable release
○ Mac OS X (10.6 or later)
 Firefox latest stable release
 Safari latest stable release
 Google Chrome latest stable release
○ Linux (RHEL, CentOS, SLES, Oracle Linux)
 Firefox latest stable release
 Google Chrome latest stable release
● Software Requirements
○ yum
○ rpm
○ scp
○ curl
○ php_curl
○ wget
○ JDK Requirement
 Oracle JDK 1.6.0_31 64-bit
 Oracle JDK 1.7 64-bit
 Open JDK 7 64-bit
2. Set Up Password-less SSH
 Generate public and private SSH keys on the Ambari Server host.
o ssh-keygen
 Copy the SSH Public Key (.ssh/id_rsa.pub) to the root account on your target
o scp /root/.ssh/id_rsa.pub <username>@<hostname>:/root/.ssh

40 of 56
 Add the SSH Public Key to the authorized_keys file on your target hosts.
o cat id_rsa.pub >> authorized_keys
o .......................directory (to 700) and the authorized_keys file in that directory
(to 600) on the target hosts.
o chmod 700 ~/.ssh
o chmod 600 ~/.ssh/authorized_keys
 From the Ambari Server, make sure you can connect to each host in the cluster using
SSH.
o ssh root@{remote.target.host}
3. Enable ntp
 If not installed then install
o yum install ntp
o chkconfig ntpd on
o ntpdate 0.centos.pool.ntp.org
o service ntpd start
4. Check DNS
 Edit Host file
o Open host file on every host in your cluster
 vi /etc/hosts
o Add a line for each host in your cluster. The line should consist of the IP address
and the FQDN. For example:
 1.2.3.4 fully.qualified.domain.name
 Set Hostname
o Use the "hostname" command to set the hostname on each host in your cluster.
For example:
hostname fully.qualified.domain.name
o Confirm that the hostname is set by running the following command:
 hostname -f
 Edit the Network Configuration File
o Using a text editor, open the network configuration file on every host. This file is
used to set the desired network configuration for each host. For example:
 vi /etc/sysconfig/network
 Modify the HOSTNAME property to set the fully.qualified.domain.name.
NETWORKING=yes
NETWORKING_IPV6=yes
HOSTNAME=fully.qualified.domain.name
5. Configuring Iptables
 Temporary disable iptables

41 of 56
chkconfig iptables off
/etc/init.d/iptables stop
Note: You can restart iptables after setup is complete.
6. Disable SELinux and PackageKit and check the umask Value
● SELinux must be temporarily disabled for the Ambari setup to function. Run the following
command on each host in your cluster:
o setenforce 0
● On the RHEL/CentOS installation host, if PackageKit is installed, open
/etc/yum/pluginconf.d/refresh-packagekit.conf with a text editor and make this change:
o enabled=0
● Make sure umask is set to 022.

42 of 56
Installing and running Ambari Server
1. Log into the machine that serves the Ambari Server as root. You may login and sudo as su
if this is what your environment requires. This machine is the main installation host.
2. Download the the Ambari repository file and copy it to your repos.d.
Platform Access
RHEL, CentOS,
and Oracle
Linux 5
wget http://public-repo-1.hortonworks.com/ambari/centos5/1.x/updates/1.4.1.61/ambari.repo
cp ambari.repo /etc/yum.repos.d
RHEL, CentOS
and Oracle
Linux 6
wget http://public-repo-1.hortonworks.com/ambari/centos6/1.x/updates/1.4.1.61/ambari.repo
SLES 11 wget http://public-repo-1.hortonworks.com/ambari/suse11/1.x/updates/1.4.1.61/ambari.repo
Table I.2.1. Download the repo
3. Install ambari server on master
yum install ambari-server
4. Set up the Master Server
ambari-server setup
o If you have not temporarily disabled SELinux, you may get a warning. Enter ‘y’ to
continue.
o By default, Ambari Server runs under root. If you want to create a different user to
run the Ambari Server instead, or to assign a previously created user, select y at
Customize user account for ambari-server daemon and give the prompt the
username you want to use.
o If you have not temporarily disabled iptables you may get a warning. Enter y to
continue. See Configuring Ports for (2.x) or (1.x) for more information on the ports
that must be open and accessible.
o Agree to the Oracle JDK license when asked. You must accept this license to be
able to download the necessary JDK from Oracle. The JDK is installed during the
deploy phase.
Note: By default, Ambari Server setup will download and install Oracle JDK 1.6. If you
plan to download this JDK and install on all your hosts, or plan to use a different
version of the JDK, skip this step and see Setup Options for more information
o At Enter advanced database configuration:

43 of 56
 To use the default PostgreSQL database, named ambari, with the default
username and password (ambari/bigdata), enter n.
 To use an existing Oracle 11g r2 instance or to select your own database
name, username and password for either database, enter y.
 Select the database you want to use and provide any information required
by the prompts, including hostname, port, Service Name or SID,
username, and password.
o Setup completes
5. Start the Ambari Server
1) To start the Ambari Server:
o ambari-server start
2) To check the Ambari Server processes:
o ps -ef | grep Ambari
3) To stop the Ambari Server:
o ambari-server stop
6. Installing, Configuring and deploying cluster
1) Step 1: Point your browser to http://{main.install.hostname}:8080.
2) Step 2: Log in to the Ambari Server using the default username/password:
admin/admin.
3) Step 3: At welcome screen, type a name for the cluster you want to create in the text
box. No white spaces or special characters can be used in the name.
Select version of hdp and click on next.
4) Step 4: At Install option:
o Use the Target Hosts text box to enter your list of host names, one per line. You
can use ranges inside brackets to indicate larger sets of hosts. For example, for
host01.domain through host10.domain use host[01-10].domain
o If you want to let Ambari automatically install the Ambari Agent on all your hosts
using SSH, select Provide your SSH Private Key and either use the Choose File
button in the Host Registration Information section to find the private key file that
matches the public key you installed earlier on all your hosts or cut and paste the
key into the text box manually.
o Fill in the username for the SSH key you have selected. If you do not want to use
root, you must provide the username for an account that can execute sudo
without entering a password
o If you do not want Ambari to automatically install the Ambari Agents, select
Perform manual registration. See Appendix: Installing Ambari Agents Manually for
more information.
o Advanced Options

44 of 56
a) If you want to use a local software repository (for example, if your installation
does not have access to the Internet), check Use a Local Software Repository.
For more information on using a local repository see Optional: Configure the
Local Repositories
b) Click the Register and Confirm button to continue.
5) Step 5: Confirm hosts
If any hosts get warning, Click Click here to see the warnings to see a list of what was
checked and what caused the warning. On the same page you can get access to a
python script that can help you clear any issues you may encounter and let you run
Rerun Checks.
Python script for clear host:
python /usr/lib/python2.6/site-packages/ambari_agent/HostCleanup.py
6) When you are satisfied with the list of hosts, click Next.
7) Step 7: Choose services
8) Step 8: Assign masters
9) Step 9: Assign slaves and clients
10) Step 10: Customize Services
o Add property in hbase custom_site.xml
o hbase.data.umask.enable = true
o Add nagios password and email address for notification.
11) Step 11: Review it and install.
Nutch configuration on hadoop Master Node
● Download Nutch
○ wget http://www.eu.apache.org/dist/nutch/2.2.1/apache-nutch-2.2.1-src.tar.gz
● Untar this file Nutch tar File
○ tar -vxf apache-nutch-2.2.1-src.tar.gz
● Export Nutch Class path
○ export NUTCH_HOME=/home/hadoop/webcrawer/apache-nutch-2.2.1
○ export PATH=$NUTCH_HOME/runtime/deploy/bin
● change /$NUTCH_HOME/conf as below
○ Add property in nutch-site.xml file
org.apache.gora.cassandra.store.CassandraStore
<property>
<name>org.apache.gora.cassandra.store.CassandraStore</name>
<value>hdfs://master:9001/hbase</value>
</property>
<property>
<name>http.agent.name</name>

45 of 56
<value>GeliyooBot</value>
</property>
<property>
<name>http.robots.agents</name>
<value>GeliyooBot.*</value>
</property>
○ add property in gora-cassandra.property file
gora.datastore.default=org.apache.gora.cassandra.store.CassandraStore
gora.cassandrastore.servers=localhost:9160
○ Add Dependency in $NUTCH_HOME/ivy/ivy.xml
<dependency org="org.apache.gora" name="gora-cassandra" rev="0.3" conf="*-
>default" />
● go to nutch installation folder($NUTCH_HOME) and run
ant clean
ant runtime

46 of 56
Cassandra Configuration
● Download the DataStax Community tarball
curl -L http://downloads.datastax.com/community/dsc.tar.gz | tar xz
● Go to the install directory:
○ $ cd dsc-cassandra-2.0.x
● Start Cassandra Server
○ $ sudo bin/cassandra
● Verify that DataStax Community is running. From the install:
○ $ bin/nodetool status
Install GUI Client for Cassandra
● Download WSO2 Carbon Server
○ wget https://www.dropbox.com/s/m00uodj1ymkpdzb/wso2carbon-4.0.0-
SNAPSHOT.zip
● Extract zip File
● Start WSO2 Carbon Server
○ Go to $WSO2_HOME/bin
○ sh wso2server.sh -Ddisable.cassandra.server.startup=true
and log in with default username and password (admin, admin)
List Key Spaces.

47 of 56
ElasticSearch Configuration
● Download ElasticSearch
○ wget
https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-
0.19.4.tar.gz
● Untar file of ElasticSearch
○ tar -vxf elasticsearch-0.19.4.tar.gz
● Start the ElasticSearch server in the foreground
○ bin/elasticsearch -f
● User Interface of ElasticSearch
○ Index information
○ Index Data

49 of 56
Run Nutch jobs on Hadoop Master Node
● Create a directory in HDFS to upload the seed urls.
○ hadoop dfs -mkdir urls
-urls HDFS directory Name
● Create a text file with the seed URLs for the crawl.
○ hadoop dfs -put seed.txt urls
● Run Inject job
○ nutch inject urls
● Run generate job
○ nutch generate -topN N
● Run nutch fetch
○ nutch fetch -all
● Run nutch parse job
○ nutch parse -all
● Run nutch updatedb job
○ nutch updatedb
Conclusion
After configure all this framework. we achieved basic crawl and basic text search.Now we are
ready for crawl billions of urls and index it. after indexing this content to elasticSearch we can get
text Result as json format .we were use CURL for fetching data from
elasticsearch.when we pass some parameter using CURL. we were get json result it had
Content.url.Content Type,digest,

50 of 56
4. Development
Objective
Main goal of this Development is to implement intermediate api. this api is communicate with
Geliyoo Search engine. when user pass some Query to Geliyoo Search UI then pass this Query to
GeliyooSearchApi. Base on this Query this api get result from Elasticsearch and return back to
user.

51 of 56
Development Completed
Prototype
We focused on the user side of the application, i.e. the basic search engine development and
hence we decided to work on the prototype development of the same first. So for that we made
following two prototypes for this web application.
● Search Page
● Result Page
We will make more prototypes as we continue further development.
Implementation :
Implementation has basically four main parts:
1) Configuration of the selected components, which we covered in the previous topic,
2) The Web API development
3) The extension of these components that will allow extended searching (i.e. the search
that is not provided by these components)
4) The web application.
The Web Application :
With above prototypes, we implement basic search as below,

52 of 56
To search information for any word or text, user need to enter that text in search box as shown in
image.
Search functionality will start as soon as user enter any single character in the search box. Search
results will be displayed as above image.
Following things should be noticed in the above image
● Titles: Titles are links pointing to the urls containing information regarding searched word.
● Highlighted word: Words or text searched by user will be highlighted in the results.
● Pagination: Below on the screen there is a pagination. Each page will show 10 results on it.
So user can easily navigate between pages for desired results and no need to do much
scrolling on a single page.
● Search Box: Above on the screen there is a search box. User can edit the text or word he
searched for. He can also search for new word or text using this search box. So using this
search box, user need not to go back to search page for new search.

53 of 56
If there will no information for the word or text search by user, we will display message as above.
REST API
When user makes request for searching, crawling or indexing, it is call to web server api for
crawling which is deployed in another server, which server have hadoop master and connecting
with all hadoop slave. Nutch will manage this hadoop cluster by giving job for crawling and
indexing. This part is still remains and cover in future development.

54 of 56
Currently we are working on part as figure above from overall architecture. When user submit
query for search, web application calls to restful api. An API is responsible for web search based on
query. API builds query and call to elasticsearch cluster for searching result using jest client. We
have developed for web searching, a keywords is enters by users as query and get list of web urls
with small content of that site contains keywords with highlight as result. Each query is stored in
cassandra database for semantic search functionality.
We are working on image searching based on keywords. For this we need to crawl and index all
web images. Apache nutch 2.2 is unable to crawl images. We tried to add many parser plugins for
parsing images, gone through tika parser and modified code for crawling procedure and enable to
fetch images and parse it for create indexes.
Jest is a Java HTTP Rest client for ElasticSearch. As mention in above section ElasticSearch is an
Open Source (Apache 2), Distributed, RESTful, Search Engine built on top of Apache Lucene.
ElasticSearch already has a Java API which is also used by ElasticSearch internally, but Jest fills a
gap, it is the missing client for ElasticSearch Http Rest interface. Jest client is request elasticsearch
cluster for result. A json is returns from this api and then it is forwarded to web application from
where request is initiated.
As result of this search API returns total number of pages, total result and list of all founded web
site, their content and titles.
We deployed web application and web api in different server for load balancing of requests to
server.
Future Development
Content Searching
Currently we are working on basic search functionality. For that we crawl websites and save their
text contents. When user will search for any text or word, we will use these contents to get
results. So results are limited to the text contents of websites. No doubt, currently users are
allowed to search any text or word in any language with this functionality.
We are planning that once we fully achieve this basic search functionality, we will work on the
functionalities using which search on all the information of a content like name, text information,
metadata will be possible and user will also allow for specific search for the following categories.
 Image search
 Video search
 News search
 Sports search
 Audio search

55 of 56
 Forum search
 Blog search
 Wiki search
 Pdf search
 Ebay, Amazon, Twitter, iTunes search
Using these functionalities, user can make more specific search and get desired results more
faster. For that we will crawl whole websites with images, videos, news etc and save their
information like name, url, metadata and content-type. Now, when user will search for any text or
word, we will use this information to get search results. It means, because of these functionalities
search will be possible each and every information of a content, and user will get best results.
When user wants to make specific search, we will make this kind of search using content-type of
saved information to get the results. For e.g. user wants to search images only, we will use
content-type equal to image and go through our saved information for images only. It is important
to note that we will search for the images but we will search the entered text in images' name, url,
metadata etc for results.
Semantic Search
We will use "Semantic search" concept to improve our search functionality so that user will get
desired result more faster. Semantic search seeks to improve search accuracy by understanding
searcher intent and the contextual meaning of terms as they appear in the searchable dataspace.
Semantic search systems consider various points including context of search, location, intent,
variation of words, synonyms, generalized and specialized queries, concept matching and natural
language queries to provide relevant search results. We will save users url, country, browser, time
etc information and text for which user searching. When search for any information we will use his
passed searches and history to get more user specific results.

56 of 56
Prototypes
Protypes for some of future development may be as below :
Image search
Video search

Arama Motoru Geliyoo'nun düzenlemesi hakkında.

Recommended

Recommended

More Related Content

Similar to Arama Motoru Geliyoo'nun düzenlemesi hakkında.

Similar to Arama Motoru Geliyoo'nun düzenlemesi hakkında. (20)

Recently uploaded

Recently uploaded (20)

Arama Motoru Geliyoo'nun düzenlemesi hakkında.