Arama Motoru Geliyoo'nun düzenlemesi hakkında.
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Arama Motoru Geliyoo'nun düzenlemesi hakkında.

on

  • 809 views

Arama Motoru Geliyoo'da yapılan son deişiklikler hakkında ingilizce olarak düzenlenmiş olan makaledir. yapılan çalışmalar ve üzerinde yapılan son değişiklikler hakkında bazı ...

Arama Motoru Geliyoo'da yapılan son deişiklikler hakkında ingilizce olarak düzenlenmiş olan makaledir. yapılan çalışmalar ve üzerinde yapılan son değişiklikler hakkında bazı değerlendirmeler.

Statistics

Views

Total Views
809
Views on SlideShare
790
Embed Views
19

Actions

Likes
1
Downloads
4
Comments
0

3 Embeds 19

https://twitter.com 17
https://www.linkedin.com 1
http://huhry.dyndns-web.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Arama Motoru Geliyoo'nun düzenlemesi hakkında. Document Transcript

  • 1. GELIYOO.COM Search Engine Development Project Description [Type the abstract of the document here. The abstract is typically a short summary of the contents of the document. Type the abstract of the document here. The abstract is typically a short summary of the contents of the document.] Buray Savas ANIL 3/1/2014
  • 2. 2 of 56
  • 3. 3 of 56 Table of Contents Introduction: .......................................................................................................................................5 Phases of Development: .....................................................................................................................6 1. Component Selection..............................................................................................................7 Objective .....................................................................................................................................7 Components Considered.............................................................................................................8 1. Apache Nutch ..................................................................................................................8 2. Hadoop ..........................................................................................................................10 3. Apache Hortonworks.....................................................................................................12 4. MongoDB.......................................................................................................................13 5. HBase.............................................................................................................................14 6. Cassandra.......................................................................................................................15 7. Elastic Search.................................................................................................................17 8. Apache solr....................................................................................................................18 9. Spring MVC ....................................................................................................................19 10. Easy-Cassandra..............................................................................................................21 Conclusion.................................................................................................................................22 2. Architecture Design...............................................................................................................24 Objective ...................................................................................................................................24 System Design ...........................................................................................................................25 1. Web Server for Geliyoo.com : .......................................................................................26 2. Web Server for hosting WebService API: ......................................................................28 3. ElasticSearch Cluster:.....................................................................................................28 4. Hadoop Cluster:.............................................................................................................29 5. Cassandra Cluster ..........................................................................................................34 6. Horizontal Web Server Clustering .................................................................................35 Conclusion.....................................................................................................................................37 3. Component Configuration.....................................................................................................38 Objective ...................................................................................................................................38 Configuration Parameters.........................................................................................................39 Hortonworks Configuration ..................................................................................................39 Installing and running Ambari Server....................................................................................42
  • 4. 4 of 56 Nutch configuration on hadoop Master Node......................................................................44 Cassandra Configuration.......................................................................................................46 ElasticSearch Configuration..................................................................................................47 Run Nutch jobs on Hadoop Master Node .............................................................................49 Conclusion.....................................................................................................................................49 4. Development.........................................................................................................................50 Objective ...................................................................................................................................50 Development Completed..........................................................................................................51 Prototype ..............................................................................................................................51 Implementation :...................................................................................................................51 Future Development.................................................................................................................54 Content Searching.................................................................................................................54 Semantic Search....................................................................................................................55 Prototypes.............................................................................................................................56 Video search..........................................................................................................................56
  • 5. 5 of 56 Geliyoo Search Engine Project Documentation Introduction: We are developing a semantic search engine, which will be available for the general user for searching the internet. Also, this product can be customized so that it can also be installed on a company's intranet, which will help the users to search the documents & images that are made available by the each individual and company as a whole to be available for general access. Objective is to create semantic search engine. For searching we need data from different websites. We need to crawl many websites for collecting data. Data is stored in a large data store. For searching from this data store it needs to index those data. For this all process we need different components for different task. The semantic search engine development process has following three major components: 1. Crawling Web crawling is harvesting web contents by visiting each website, and find its all outlinks for fetching their content too. Crawler is continuous process for fetching web content up to Nth depth of website. It is restricted by robot.txt from many sites. Web content mean all text content, images, docs etc. In short Fetching all content from web that available in website. We need tool for fetch all content , parse by their mime type and finds their outlinks for same. Crawlers are rather dumb processes that fetch content supplied by Web servers answering (HTTP) requests of requested URIs. Crawlers get their URIs from a crawling engine that’s feeded from different sources, including links extracted from previously crawled Web documents. 2. Indexing Indexing means making sense out of the retrieved contents, storing the processing results in a document index. All harvested data must be indexed for searching from them. 3. Searching We need component for getting efficient result of search query. For searching from large indexed data, it must be processed in speedy manner and return all relevant results.
  • 6. 6 of 56 Phases of Development: Search engine development follows the following phases of development : 1. Component Selection : Select the components that are useful for the implementation of the search engine. 2. Architecture Design : Design architecture of the system that will allow both internet based search as well as intranet based search. 3. Component Configuration: Configure the selected components as per our requirements that will augment the searching process. 4. Development : We will develop a search engine web application and the remaining components of the system that are not available with the current components. 5. Future Development : The tasks that still needs development are mentioned here.
  • 7. 7 of 56 1. Component Selection Objective There are many open source components available which can help us to develop the search engines. Instead of creating everything from scratch we planned to used some open source components and customize and extend them as per our requirements. This will save us lot of time and money to recreate the same thing that has already been developed. For this we need to figure out the right components that will match our requirements.
  • 8. 8 of 56 Components Considered We go through many tools for achieve this project’s objective. 1. Apache Nutch First component which we evaluated was Apache Nutch for crawling the website links. Apache Nutch is an open source web crawler written in Java. By using it, we can find webpage hyperlinks in an automated manner, reduce lots of maintenance work, for example checking broken links, and create a copy of all the visited pages for searching over. It is highly scalable and relatively feature rich crawler. It can easily crawl lots of web pages and can find its invert links for crawl them again. It provides easy integration with Hadoop, Elastic Search and Apache Cassandra. Fig 1. Basic Workflow of Apache Nutch List of Nutch Jobs 1. Inject The nutch inject command allows you to add a list of seed URLs in database for your crawl. It takes urls seed files. We can define url validation with nutch and it will check with injection and parsing operation, urls which are not validates are rejected while rest of urls are inserted in database.  Command: bin/nutch inject <url_dir>
  • 9. 9 of 56  Example: bin/nutch inject urls 2. Generate The nutch generate command will take the list of outlinks generated from a previous cycle and promote them to the fetch list and return a batch ID for this cycle. You will need this batch ID for subsequent calls in this cycle. Number of top URLs to be selected by passing top score as argument with this operation.  Command: bin/nutch generate batch id or -all  Example: bin/nutch generate -all 3. Fetch The Nutch Fetch command will crawl the pages listed in the column family and write out the contents into new columns. We need to pass in the batch ID from the previous step. We can also pass ‘all’ value instead of batch id if we want to fetch all url. 4. Parse The nutch parse command will loop through all the pages, analyze the page content to find outgoing links, and write them out in the another column family. 5. Updatedb The nutch updatedb command takes the url values from the previous stage and places it into the another column family, so they can be fetched in the next crawl cycle. Features ● Fetching and parsing are done separately by default, this reduces the risk of an error corrupting the fetch parse stage of a crawl with Nutch ● Plugins have been overhauled as a direct result of removal of legacy Lucene dependency for indexing and search. ● Easily configurable and movable ● We can create or add extra plugins for scale its functionality ● Validation rules are available for restrict other websites or contents. ● Tika parser plugin available for parsing all types of content types ● OPIC Scoring plugin or LinkRank plugin is used for calculation of webpage rank with nutch.
  • 10. 10 of 56 2. Hadoop Hadoop itself refers to the overall system that runs jobs in one or more machines parallel, distributes tasks (pieces of these jobs) and stores data in a parallel and distributed fashion. Hadoop cluster has Multiple Process node it include some master Node and some slave Node. It has it own Filesystem it’s called Hdfs.the HDFS is managed through a dedicated NameNode server to host the file system index, and a secondary NameNode that can generate snapshots of the NameNode's memory structures, the HDFS manage Replication on one or more Machit. so if Data loss from one Node it can recover from another Node itself. Hadoop is easily configure with apache nutch. Hence all nutch crawling and indexing processes are performed parallely in different nodes for decrease processes time. Nutch gives job to hadoop for their operation and hadoop perform its job and return to nutch. HDFS user manual Screen Browse HDFS file System and HDFS storage information
  • 11. 11 of 56 Nutch running job information.
  • 12. 12 of 56 3. Apache Hortonworks Hortonworks Data Platform (HDP) is open source, fully-tested and certified, Apache™ Hadoop® data platform. Hortonworks Data Platform is designed for facilitates integrating Apache Hadoop with an enterprise’s existing data architectures. We can say HDP is bunch of all components that provides reliable access for hadoop clustering. The Apache Ambari project is aimed at making Hadoop management simpler by developing software for provisioning, managing, and monitoring Apache Hadoop clusters. Ambari provides an intuitive, easy-to-use Hadoop management web UI backed by its REST APIs. HDP saves our time for manages hadoop cluster by giving attractive web ui. We can easily scale hadoop cluster by web application. We can also analyse performance and health of hadoop job and cluster by different graphs. Like we can get details of Memory usage, network usage, cluster load, cpu usage etc.
  • 13. 13 of 56 4. MongoDB MongoDB is an open-source document database that provides high performance, high availability, and automatic scaling. A record in MongoDB is a document, which is a data structure composed of field and value pairs. MongoDB documents are similar to JSON objects. The values of fields may include other documents, arrays, and arrays of documents. The advantages of using documents are: ● Documents (i.e. objects) correspond to native data types in many programming language. ● Embedded documents and arrays reduce need for expensive joins. ● Dynamic schema supports fluent polymorphism. Features: 1. High Performance MongoDB provides high performance data persistence. In particular,  Support for embedded data models reduces I/O activity on database system.  Indexes support faster queries and can include keys from embedded documents and arrays. 2. High Availability To provide high availability, MongoDB’s replication facility, called replica sets, provide:  automatic failover.  data redundancy. A replica set is a group of MongoDB servers that maintain the same data set, providing redundancy and increasing data availability.
  • 14. 14 of 56 5. HBase HBase is a column-oriented database that’s an open-source implementation of Google’s Big Table storage architecture. It can manage structured and semi-structured data and has some built-in features such as scalability, versioning, compression and garbage collection. Since its uses write-ahead logging and distributed configuration, it can provide fault-tolerance and quick recovery from individual server failures. HBase built on top of Hadoop / HDFS and the data stored in HBase can be manipulated using Hadoop’s MapReduce capabilities. HBase Architecture: The HBase Physical Architecture consists of servers in a Master-Slave relationship as shown below. Typically, the HBase cluster has one Master node, called HMaster and multiple Region Servers called HRegionServer. Each Region Server contains multiple Regions. Regions Just like in a Relational Database, data in HBase is stored in Tables and these Tables are stored in Regions. When a Table becomes too big, the Table is partitioned into multiple Regions. These Regions are assigned to Region Servers across the cluster. HBase Components 1. HMaster  Performing Administration  Managing and Monitoring the Cluster  Assigning Regions to the Region Servers  Controlling the Load Balancing and Failover 2. HRegionServer  Hosting and managing Regions  Splitting the Regions automatically  Handling the read/write requests  Communicating with the Clients directly Features  Linear and modular scalability.  Strictly consistent reads and writes.  Automatic and configurable sharding of tables  Automatic failover support between RegionServers.  Convenient base classes for backing Hadoop MapReduce jobs with Apache HBase tables.  Easy to use Java API for client access.  Block cache and Bloom Filters for real-time queries.  Query predicate push down via server side Filters  Thrift gateway and a REST-ful Web service that supports XML, Protobuf, and binary data encoding options.
  • 15. 15 of 56 6. Cassandra Cassandra is open source distributed database system that is designed for storing and managing large amounts of data across commodity servers.Cassandra is designed to have peer-to-peer symmetric nodes, instead of master or named nodes, to ensure there can never be a single point of failure .Cassandra automatically partitions data across all the nodes in the database cluster, we can add N number of node in cassandra. Features 1. Decentralized Every node in the cluster has the same role. There is no single point of failure. Data is distributed across the cluster (so each node contains different data), but there is no master as every node can service any request. 2. Supports replication and multi data center replication Replication strategies are configurable. Cassandra is designed as a distributed system, for deployment of large numbers of nodes across multiple data centers. Key features of Cassandra’s distributed architecture are specifically tailored for multiple-data center deployment, for redundancy, for failover and disaster recovery. 3. Scalability Read and write throughput both increase linearly as new machines are added, with no downtime or interruption to applications. 4. Fault-tolerant Data is automatically replicated to multiple nodes for fault-tolerance. Replication across multiple data centers is supported. Failed nodes can be replaced with no downtime. 5. MapReduce support Cassandra has Hadoop integration, with MapReduce support. 6. Query language CQL (Cassandra Query Language) was introduced, a SQL-like alternative to the traditional RPC interface. Language drivers are available for Java (JDBC). Replication in Cassandra Replication is the process of storing copies of data on multiple nodes to ensure reliability and fault tolerance. When you create a keyspace in Cassandra, you must decide the replica placement strategy: the number of replicas and how those replicas are distributed across nodes in the cluster. The replication strategy relies on the cluster-configured snitch to help it determine the physical location of nodes and their proximity to each other.
  • 16. 16 of 56 Replication Strategies: 1. Simple Strategy: Simple Strategy is the default replica placement strategy when creating a keyspace using Cassandra CLI. Simple Strategy places the first replica on a node determined by the partitioner. Additional replicas are placed on the next nodes clockwise in the ring without considering rack or data center location. Fig: Simple Strategy diagram 2. Network Topology Strategy: As the name indicates, this strategy is aware of the network topology (location of nodes in racks, data centers etc.) and is much intelligent than Simple Strategy. This strategy is a must if your Cassandra cluster spans multiple data centers and lets you specify how many replicas you want per data center. It tries to distribute data among racks to minimize failures. That is, when choosing nodes to store replicas, it will try to find a node on a different rack.
  • 17. 17 of 56 7. Elastic Search  Elasticsearch is a search server based on Lucene. It provides a distributed, multitenant- capable full-text search engine with a RESTful web interface and schema-free JSON documents. Elasticsearch is developed in Java and is released as open source under the terms of the Apache License.  It easily integrates with Apache Nutch and Nutch operates this for indexing web pages. Indexed data will store on its file system.  ElasticSearch is distributed, which means that indices can be divided into shards and each shard can have zero or more replicas. Each node hosts one or more shards, and acts as a coordinator to delegate operations to the correct shard(s). Rebalancing and routing are done automatically.  We have a series of distinct ElasticSearch instances work in a coordinated manner without much administrative intervention at all. Clustering ElasticSearch instances (or nodes) provides data redundancy as well as data availability.  Indexed data is stored in the file system of nodes within cluster. Elasticsearch provides a full query based on JSON to define queries. In general, there are basic queries such as term or prefix. There are also compound queries like the bool query. Queries can also have filters associated with them such as the filtered or constant_score queries, with specific filter queries. Query is pass to elasticSearch cluster and it will match query parameter and return result. Features  First, by having a rich RESTful HTTP API, it’s trivial to query elastic search with Ajax. (elasticsearch further supports JavaScript developers with cross-origin resource sharing by sending an Access-Control-Allow-Origin header to browsers.)  Second, since elasticsearch stores schema-free documents serialized as JSON — coming from “JavaScript Object Notation”, so obviously a native entity in JavaScript code —, it can be used not only as a search engine, but also as a persistence engine.
  • 18. 18 of 56 8. Apache solr  Apache Solr is an open source search platform built upon a Java library called Lucene.  Solr is a popular search platform for Web sites because it can index and search multiple sites and return recommendations for related content based on the search query’s taxonomy. Solr is also a popular search platform for enterprise search because it can be used to index and search documents and email attachments.  Solr works with Hypertext Transfer Protocol (HTTP) Extensible Markup Language (XML). It offers application program interfaces (APIs) for Javascript Object Notation (JSON), Python, and Ruby. According to the Apache Lucene Project, Solr offers capabilities that have made it popular with administrators including: o Indexing in near real time o Automated index replication o Server statistics logging o Automated failover and recovery o Rich document parsing and indexing o Multiple search indexes o User-extensible caching o Design for high-volume traffic o Scalability, flexibility and extensibility o Advanced full-text searching o Geospatial searching o Load-balanced querying
  • 19. 19 of 56 9. Spring MVC Spring MVC is the web component of Spring’s framework. Spring Framework is a Java platform that provides comprehensive infrastructure support for developing Java applications. Spring handles the infrastructure so one can focus on his/her application. It provides a rich functionality for building robust Web Applications. The Spring MVC Framework is architected and designed in such a way that every piece of logic and functionality is highly configurable. Following is the Request process lifecycle of Spring 3.0 MVC *Here, User needs to define BeanNameUrlHandlerMapping / SimpleUrlHandlingMapping etc that inherits HandlerMapping interface. **Here, You can define multiple controllers like SimpleFormController/MultiActionController etc that ultimately inherits Controller interface. Features  Spring enables developers to develop enterprise-class applications using POJOs. The benefit of using only POJOs is that no need an EJB container product such as an application server instead there is an option of using only a robust servlet container such as Tomcat or some commercial product.  Spring is organized in a modular fashion. Even though the number of packages and classes are substantial, so need to worry only about needed ones and ignore the rest.
  • 20. 20 of 56  Spring does not reinvent the wheel instead, it truly makes use of some of the existing technologies like several ORM frameworks, logging frameworks, JEE, Quartz and JDK timers, other view technologies.  Testing an application written with Spring is simple because environment-dependent code is moved into this framework. Furthermore, by using JavaBean-style POJOs, it becomes easier to use dependency injection for injecting test data.  Spring's web framework is a well-designed web MVC framework, which provides a great alternative to web frameworks such as Struts or other over engineered or less popular web frameworks.  Spring provides a convenient API to translate technology-specific exceptions (thrown by JDBC, Hibernate, or JDO, for example) into consistent, unchecked exceptions.  Lightweight IoC containers tend to be lightweight, especially when compared to EJB containers, for example. This is beneficial for developing and deploying applications on computers with limited memory and CPU resources.  Spring provides a consistent transaction management interface that can scale down to a local transaction (using a single database, for example) and scale up to global transactions (using JTA, for example).  Spring has @Async annotation. Using this annotation one can run necessary processes asynchronously. This feature is very useful for Geliyoo Search Engine to minimize the search time.
  • 21. 21 of 56 10. Easy-Cassandra We use cassandra ,which is nosql, to save and retrive data. So we need to make integration between Spring MVC and Cassandra. For that we use easy-cassandra api. Easy-Cassandra is a framework ORM API and a high client for Apache Cassandra in java. Using this, it is possible to persist information from the Java Object in easy way. To persist information, it adds some annotations at some fields and classes. It works like an abstraction's tier in the Thrift, doing call for Cassandra. The EasyCassandra uses the Thrift implementation and has like the main objective be one simple ORM( Object relational manager). Features  An ORM easy to use in Cassandra.  Only need is to use some Annotations in a class to persist informations.  Persists many Java Objects in way extremely easy (e.g: all primitives types, java.Lang.String, java.lang.BigDecimal, java.io.File, etc.).  Compatible with CQL 3.0.  In the Apache version 2.0 license.  Supporting JPA 2.0 annotation.  Work with multi-nodes.  Complex rowkey (a key with tow or more keyrow).  Map some collections (java.util.List, java.util.Set, java.util.Map).  Find automatically the others clusters which do part of the same cluster.  May use multiple keyspaces simultaneously.  Integrations with Spring.
  • 22. 22 of 56 Conclusion We had different options for NoSQL database and we compared them based on the features we need for the development of this project. Following is the component feature compatibility table Feature HBase MongoDB Cassandra Hortonwork Suppot 0.96.4 No Support No Support Developed language Java Java C++ Best used Hadoop is probably still the best way to run Map/Reduce jobs on huge datasets. Best if you use the Hadoop/HDFS stack already. If you need dynamic queries. If you prefer to define indexes, not map/reduce functions. If you need good performance on a big DB. If you wanted CouchDB, but your data changes too much, filling up disks. When you write more than you read (logging). If every component of the system must be in Java. ("No one gets fired for choosing Apache's stuff.") Main point Billions of rows X millions of columns Retains some friendly properties of SQL. (Query, index) Best of BigTable and Dynamo Server-side scripts yes JavaScript no Replication methods selectable replication factor Master-slave replication selectable replication factor Consistency concepts Immediate Consistency Eventual Consistency,Immediate Consistency Eventual Consistency,Immediate Consistency Nutch Support 0.90.4 2.22 2.2 hadoop Support 1.2.1 1.1.X 2.2
  • 23. 23 of 56 Apache Solr v/s Elastic search  ElasticSearch was released specifically designed to make up for the lacking distributed features of Solr. For this reason, it may find it easier and more intuitive to start up an ElasticSearch cluster rather than a SolrCloud cluster  ElasticSearch will automatically load balance and move shards to new nodes in the cluster. This automatic shard rebalancing behavior does not exist in Solr.  There was an issue in solr + nutch for making solr distributed, hence we choose elastic search for its great features of distribution, searching query etc. Final Component Selection : We had gone through all above component and based on our requirements and their respective features we have finalized the following components : Service Selected Component Parallel Processing Apache Hadoop Crawling Apache Nutch NoSQL Cassandra Searching Elastic Search MVC Spring MVC ORM EasyCassandra
  • 24. 24 of 56 2. Architecture Design Objective To identify an architecture that will meet all the project requirements for the Geliyoo Search engine development. The design will be based on the components that we selected and based on the configurable items that they provide. Also, we need to consider the other non functional factors for the development like number of requests per second, active users etc. Since there is a fair chance that this site would have so much load we need to architect in order to come up with a decent architecture.
  • 25. 25 of 56 System Design
  • 26. 26 of 56 1. Web Server for Geliyoo.com : There are three parts of the web application that we propose to develop. 1. Super Admin Panel : This panel will allow the super user to manage various settings of the system and also perform functionality like adding urls ,scheduling the indexing and crawling of the urls, manage users, etc. 2. User Admin Panel : Admin panel will have allow the registered administrator user to add their sites they propose to crawl, index and also see their results. 3. General Users The general user, is a user who will be allowed to search various site indexed by geliyoo search engine. They will be given an interface to search the web. Since we expect that there would be too much load on this server we will have a cluster of the webservers for load balancing and high availability.
  • 27. 27 of 56
  • 28. 28 of 56 2. Web Server for hosting WebService API: This webserver will host Web Service API for searching and related functionalities. We have bifurcated this with the admin panel functionalities, so as to manage the load for searching. The web services will call the elastic search cluster's API to get the search results. 3. ElasticSearch Cluster: Searching
  • 29. 29 of 56 Figure 2.3 Indexed data is stored in the file system of nodes within cluster. Elasticsearch provides a full query based on JSON to define queries. In general, there are basic queries such as term or prefix. There are also compound queries like the bool query. Queries can also have filters associated with them such as the filtered or constant_score queries, with specific filter queries. Query is pass to elasticSearch cluster and it will match query parameter and return result. 4. Hadoop Cluster: This is the most important part of the system. It will host all the services related to crawling and indexing. It will also host the web services for the functionalities provided to admin and super admin users.
  • 30. 30 of 56 Fig: Hadoop Cluster Diagram Nutch (Crawling & Indexing ) : For crawling and indexing we will use Nutch. Following is the current architecture of Nutch Crawler Basic Components: Nutch Flow:
  • 31. 31 of 56 There are two procedure take place in overall system as per following: 1. Crawling : Crawling is continuous process, injection is done by only once when injecting urls, but all other operations expect inject is perform continuously until how depth we want to go in urls. This all operation gives their job to hadoop and hadoop will perform these tasks parallelly by distributing their task among nodes. Following operations are performed for crawling with nutch: Inject The nutch inject command allows you to add a list of seed URLs in database for your crawl. It takes urls seed files from hdfs directory. We can define url validation with nutch and it will check with injection and parsing operation, urls which are not validates are rejected while rest of urls are inserted in database. Generate The nutch generate command will take the list of outlinks generated from a previous cycle and promote them to the fetch list and return a batch ID for this cycle. You will need this batch ID for subsequent calls in this cycle. Number of top URLs to be selected by passing top score as argument with this operation.
  • 32. 32 of 56 Fetch The Nutch Fetch command will crawl the pages listed in the column family and write out the contents into new columns. We need to pass in the batch ID from the previous step. We can also pass ‘all’ value instead of batch id if we want to fetch all url. Parse The nutch parse command will loop through all the pages, analyze the page content to find outgoing links, and write them out in the another column family. Update db The nutch updatedb command takes the url values from the previous stage and places it into the another column family, so they can be fetched in the next crawl cycle. 2. Indexing Figure 2.2 Indexing can be done by elasticSearch which is configured with nutch, and nutch is responsible for fire operation of indexing. Elastic Index command takes two mandatory arguments: ● First argument is Cluster Name and ● Second is either of “batch Id” (which is get by previous operations of nutch), “all” (for all non indexed data) or “reindex” (for doing again index of all data). After executing command nutch will give job to hadoop and hadoop will divide job into smaller tasks. Each task stores indexed data on to the file system of elasticSearch cluster in distributed manner.
  • 33. 33 of 56 Hadoop: A small Hadoop cluster includes a single master and multiple worker nodes. The master node consists of a JobTracker, TaskTracker, NameNode and DataNode. A slave or worker node acts as both a DataNode and TaskTracker, though it is possible to have data-only worker nodes and compute-only worker nodes. In a larger cluster, the HDFS is managed through a dedicated NameNode server to host the file system index, and a secondary NameNode that can generate snapshots of the NameNode's memory structures, thus preventing file-system corruption and reducing loss of data. Similarly, a standalone JobTracker server can manage job scheduling. In clusters where the Hadoop MapReduce engine is deployed against an alternate file system, the NameNode, secondary NameNode and DataNode architecture of HDFS is replaced by the file-system-specific equivalent.
  • 34. 34 of 56 5. Cassandra Cluster Cassandra cluster contains one or more data centers and each data center have number of nodes. Cassandra stores crawled data as distributed manner resulting in a good load balancing. Key features of Cassandra’s distributed architecture are specifically tailored for multiple-data center deployment, for redundancy, for failover and disaster recovery.
  • 35. 35 of 56 6. Horizontal Web Server Clustering Objective: The circumstances may occur in which the machine ,on which Geliyoo Search Api or Geliyoo web application deployed, will down or become slow because of heavy traffic. To copup this circumstances we need to make tomcat server clustering. In which our Api and Application will be deployed on multiple machines(at least more than one). So that if one server in the cluster goes down, then other servers in the cluster should be able to take over -- as transparently to the end user as possible. Process: Under Horizontal Clustering there can be any no of systems and on each system we have one Tomcat server running.To make Horizontal tomcat clustering, we are using Apache http server. The Apache Httpd Server runs on only one of the system and it controls all the Tomcats running on other systems including the one which installed on the same system.We are also using mod_jk as load balancer. mod_jk is an Apache module used to connect the Tomcat servlet container with web servers such as Apache.
  • 36. 36 of 56 Apache http server and mod_jk can be used to balance server load across multiple Tomcat instances, or divide Tomcat instances into various namespaces, managed by Apache http server. Requests hit the Apache server in front and are distributed to backend Tomcat containers depending on load and availability.The clients know of only one IP (Apache) but the requests are distributed over multiple containers.So this is in the case you deploy a kind of distributed web application and you need it robust. By using Apache HTTP as a front end you can let Apache HTTP act as a front door to your content to multiple Apache Tomcat instances. If one of your Apache Tomcats fails, Apache HTTP ignores it. The Apache Tomcats can then be each in a protected area and from a security point of view, you only need to worry about the Apache HTTP server. Essentially, Apache becomes a smart proxy server. you can load balance multiple instances of your application behind Apache. This will allow you to handle more volume, and increase stability in the event one of your instances goes down. Apache Tomcat uses Connector components to allow communication between a Tomcat instance and another party, such as a browser, server, or another Tomcat instance that is part of the same network. Configuration of this involves enabling mod_jk in Apache, configuring a AJP connector in your application server, and directing Apache to forward certain paths to the application server via mod_jk.
  • 37. 37 of 56 The mod_jk connector allows HTTPD to communicate with Apache Tomcat instances over the AJP protocol. AJP ,acronymn for Apache Jserv Protocol, is a wire protocol. It an optimized version of the HTTP protocol to allow a standalone web server such as Apache to talk to Tomcat. The idea is to let Apache serve the static content when possible, but proxy the request to Tomcat for Tomcat related content. Conclusion We have test current environment in different different combination of urls and cluster node. on this test combination. we have measure HDFS_BYTES_READ (Bytes),Virtual memory (bytes),Physical memory (bytes) of cluster.
  • 38. 38 of 56 3. Component Configuration Objective We are using the lot of open source components for the purpose of creating this search engine. The components like Hadoop, Nutch and Casandra needs to be configured to achieve what is required for the purpose of developing the search engine. After analysis, we have decided to configure best combination of clusters on OVH Dedicated server and also on development Environment. We have decide to implement the following o One Master node of Hadoop, o 4 Slave Node of Hadoop and o 1 Node of Cassandra .
  • 39. 39 of 56 Configuration Parameters Hortonworks Configuration 1. Minimum Requirement ● Operation System ○ Red Hat Enterprise Linux (RHEL) v5.x or 6.x (64-bit) ○ CentOS v5.x or 6.x (64-bit) ○ Oracle Linux v5.x or 6.x (64-bit) ○ SUSE Linux Enterprise Server (SLES) 11, SP1 (64-bit) ● Browser Requirements ○ Windows (Vista, 7) ○ Internet Explorer 9.0 and higher (for Vista + Windows 7) ○ Firefox latest stable release ○ Safari latest stable release ○ Google Chrome latest stable release ○ Mac OS X (10.6 or later)  Firefox latest stable release  Safari latest stable release  Google Chrome latest stable release ○ Linux (RHEL, CentOS, SLES, Oracle Linux)  Firefox latest stable release  Google Chrome latest stable release ● Software Requirements ○ yum ○ rpm ○ scp ○ curl ○ php_curl ○ wget ○ JDK Requirement  Oracle JDK 1.6.0_31 64-bit  Oracle JDK 1.7 64-bit  Open JDK 7 64-bit 2. Set Up Password-less SSH  Generate public and private SSH keys on the Ambari Server host. o ssh-keygen  Copy the SSH Public Key (.ssh/id_rsa.pub) to the root account on your target o scp /root/.ssh/id_rsa.pub <username>@<hostname>:/root/.ssh
  • 40. 40 of 56  Add the SSH Public Key to the authorized_keys file on your target hosts. o cat id_rsa.pub >> authorized_keys o .......................directory (to 700) and the authorized_keys file in that directory (to 600) on the target hosts. o chmod 700 ~/.ssh o chmod 600 ~/.ssh/authorized_keys  From the Ambari Server, make sure you can connect to each host in the cluster using SSH. o ssh root@{remote.target.host} 3. Enable ntp  If not installed then install o yum install ntp o chkconfig ntpd on o ntpdate 0.centos.pool.ntp.org o service ntpd start 4. Check DNS  Edit Host file o Open host file on every host in your cluster  vi /etc/hosts o Add a line for each host in your cluster. The line should consist of the IP address and the FQDN. For example:  1.2.3.4 fully.qualified.domain.name  Set Hostname o Use the "hostname" command to set the hostname on each host in your cluster. For example: hostname fully.qualified.domain.name o Confirm that the hostname is set by running the following command:  hostname -f  Edit the Network Configuration File o Using a text editor, open the network configuration file on every host. This file is used to set the desired network configuration for each host. For example:  vi /etc/sysconfig/network  Modify the HOSTNAME property to set the fully.qualified.domain.name. NETWORKING=yes NETWORKING_IPV6=yes HOSTNAME=fully.qualified.domain.name 5. Configuring Iptables  Temporary disable iptables
  • 41. 41 of 56 chkconfig iptables off /etc/init.d/iptables stop Note: You can restart iptables after setup is complete. 6. Disable SELinux and PackageKit and check the umask Value ● SELinux must be temporarily disabled for the Ambari setup to function. Run the following command on each host in your cluster: o setenforce 0 ● On the RHEL/CentOS installation host, if PackageKit is installed, open /etc/yum/pluginconf.d/refresh-packagekit.conf with a text editor and make this change: o enabled=0 ● Make sure umask is set to 022.
  • 42. 42 of 56 Installing and running Ambari Server 1. Log into the machine that serves the Ambari Server as root. You may login and sudo as su if this is what your environment requires. This machine is the main installation host. 2. Download the the Ambari repository file and copy it to your repos.d. Platform Access RHEL, CentOS, and Oracle Linux 5 wget http://public-repo-1.hortonworks.com/ambari/centos5/1.x/updates/1.4.1.61/ambari.repo cp ambari.repo /etc/yum.repos.d RHEL, CentOS and Oracle Linux 6 wget http://public-repo-1.hortonworks.com/ambari/centos6/1.x/updates/1.4.1.61/ambari.repo cp ambari.repo /etc/yum.repos.d SLES 11 wget http://public-repo-1.hortonworks.com/ambari/suse11/1.x/updates/1.4.1.61/ambari.repo cp ambari.repo /etc/yum.repos.d Table I.2.1. Download the repo 3. Install ambari server on master yum install ambari-server 4. Set up the Master Server ambari-server setup o If you have not temporarily disabled SELinux, you may get a warning. Enter ‘y’ to continue. o By default, Ambari Server runs under root. If you want to create a different user to run the Ambari Server instead, or to assign a previously created user, select y at Customize user account for ambari-server daemon and give the prompt the username you want to use. o If you have not temporarily disabled iptables you may get a warning. Enter y to continue. See Configuring Ports for (2.x) or (1.x) for more information on the ports that must be open and accessible. o Agree to the Oracle JDK license when asked. You must accept this license to be able to download the necessary JDK from Oracle. The JDK is installed during the deploy phase. Note: By default, Ambari Server setup will download and install Oracle JDK 1.6. If you plan to download this JDK and install on all your hosts, or plan to use a different version of the JDK, skip this step and see Setup Options for more information o At Enter advanced database configuration:
  • 43. 43 of 56  To use the default PostgreSQL database, named ambari, with the default username and password (ambari/bigdata), enter n.  To use an existing Oracle 11g r2 instance or to select your own database name, username and password for either database, enter y.  Select the database you want to use and provide any information required by the prompts, including hostname, port, Service Name or SID, username, and password. o Setup completes 5. Start the Ambari Server 1) To start the Ambari Server: o ambari-server start 2) To check the Ambari Server processes: o ps -ef | grep Ambari 3) To stop the Ambari Server: o ambari-server stop 6. Installing, Configuring and deploying cluster 1) Step 1: Point your browser to http://{main.install.hostname}:8080. 2) Step 2: Log in to the Ambari Server using the default username/password: admin/admin. 3) Step 3: At welcome screen, type a name for the cluster you want to create in the text box. No white spaces or special characters can be used in the name. Select version of hdp and click on next. 4) Step 4: At Install option: o Use the Target Hosts text box to enter your list of host names, one per line. You can use ranges inside brackets to indicate larger sets of hosts. For example, for host01.domain through host10.domain use host[01-10].domain o If you want to let Ambari automatically install the Ambari Agent on all your hosts using SSH, select Provide your SSH Private Key and either use the Choose File button in the Host Registration Information section to find the private key file that matches the public key you installed earlier on all your hosts or cut and paste the key into the text box manually. o Fill in the username for the SSH key you have selected. If you do not want to use root, you must provide the username for an account that can execute sudo without entering a password o If you do not want Ambari to automatically install the Ambari Agents, select Perform manual registration. See Appendix: Installing Ambari Agents Manually for more information. o Advanced Options
  • 44. 44 of 56 a) If you want to use a local software repository (for example, if your installation does not have access to the Internet), check Use a Local Software Repository. For more information on using a local repository see Optional: Configure the Local Repositories b) Click the Register and Confirm button to continue. 5) Step 5: Confirm hosts If any hosts get warning, Click Click here to see the warnings to see a list of what was checked and what caused the warning. On the same page you can get access to a python script that can help you clear any issues you may encounter and let you run Rerun Checks. Python script for clear host: python /usr/lib/python2.6/site-packages/ambari_agent/HostCleanup.py 6) When you are satisfied with the list of hosts, click Next. 7) Step 7: Choose services 8) Step 8: Assign masters 9) Step 9: Assign slaves and clients 10) Step 10: Customize Services o Add property in hbase custom_site.xml o hbase.data.umask.enable = true o Add nagios password and email address for notification. 11) Step 11: Review it and install. Nutch configuration on hadoop Master Node ● Download Nutch ○ wget http://www.eu.apache.org/dist/nutch/2.2.1/apache-nutch-2.2.1-src.tar.gz ● Untar this file Nutch tar File ○ tar -vxf apache-nutch-2.2.1-src.tar.gz ● Export Nutch Class path ○ export NUTCH_HOME=/home/hadoop/webcrawer/apache-nutch-2.2.1 ○ export PATH=$NUTCH_HOME/runtime/deploy/bin ● change /$NUTCH_HOME/conf as below ○ Add property in nutch-site.xml file org.apache.gora.cassandra.store.CassandraStore <property> <name>org.apache.gora.cassandra.store.CassandraStore</name> <value>hdfs://master:9001/hbase</value> </property> <property> <name>http.agent.name</name>
  • 45. 45 of 56 <value>GeliyooBot</value> </property> <property> <name>http.robots.agents</name> <value>GeliyooBot.*</value> </property> ○ add property in gora-cassandra.property file gora.datastore.default=org.apache.gora.cassandra.store.CassandraStore gora.cassandrastore.servers=localhost:9160 ○ Add Dependency in $NUTCH_HOME/ivy/ivy.xml <dependency org="org.apache.gora" name="gora-cassandra" rev="0.3" conf="*- >default" /> ● go to nutch installation folder($NUTCH_HOME) and run ant clean ant runtime
  • 46. 46 of 56 Cassandra Configuration ● Download the DataStax Community tarball curl -L http://downloads.datastax.com/community/dsc.tar.gz | tar xz ● Go to the install directory: ○ $ cd dsc-cassandra-2.0.x ● Start Cassandra Server ○ $ sudo bin/cassandra ● Verify that DataStax Community is running. From the install: ○ $ bin/nodetool status Install GUI Client for Cassandra ● Download WSO2 Carbon Server ○ wget https://www.dropbox.com/s/m00uodj1ymkpdzb/wso2carbon-4.0.0- SNAPSHOT.zip ● Extract zip File ● Start WSO2 Carbon Server ○ Go to $WSO2_HOME/bin ○ sh wso2server.sh -Ddisable.cassandra.server.startup=true and log in with default username and password (admin, admin) List Key Spaces.
  • 47. 47 of 56 ElasticSearch Configuration ● Download ElasticSearch ○ wget https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch- 0.19.4.tar.gz ● Untar file of ElasticSearch ○ tar -vxf elasticsearch-0.19.4.tar.gz ● Start the ElasticSearch server in the foreground ○ bin/elasticsearch -f ● User Interface of ElasticSearch ○ Index information ○ Index Data
  • 48. 48 of 56
  • 49. 49 of 56 Run Nutch jobs on Hadoop Master Node ● Create a directory in HDFS to upload the seed urls. ○ hadoop dfs -mkdir urls -urls HDFS directory Name ● Create a text file with the seed URLs for the crawl. ○ hadoop dfs -put seed.txt urls ● Run Inject job ○ nutch inject urls ● Run generate job ○ nutch generate -topN N ● Run nutch fetch ○ nutch fetch -all ● Run nutch parse job ○ nutch parse -all ● Run nutch updatedb job ○ nutch updatedb Conclusion After configure all this framework. we achieved basic crawl and basic text search.Now we are ready for crawl billions of urls and index it. after indexing this content to elasticSearch we can get text Result as json format .we were use CURL for fetching data from elasticsearch.when we pass some parameter using CURL. we were get json result it had Content.url.Content Type,digest,
  • 50. 50 of 56 4. Development Objective Main goal of this Development is to implement intermediate api. this api is communicate with Geliyoo Search engine. when user pass some Query to Geliyoo Search UI then pass this Query to GeliyooSearchApi. Base on this Query this api get result from Elasticsearch and return back to user.
  • 51. 51 of 56 Development Completed Prototype We focused on the user side of the application, i.e. the basic search engine development and hence we decided to work on the prototype development of the same first. So for that we made following two prototypes for this web application. ● Search Page ● Result Page We will make more prototypes as we continue further development. Implementation : Implementation has basically four main parts: 1) Configuration of the selected components, which we covered in the previous topic, 2) The Web API development 3) The extension of these components that will allow extended searching (i.e. the search that is not provided by these components) 4) The web application. The Web Application : With above prototypes, we implement basic search as below,
  • 52. 52 of 56 To search information for any word or text, user need to enter that text in search box as shown in image. Search functionality will start as soon as user enter any single character in the search box. Search results will be displayed as above image. Following things should be noticed in the above image ● Titles: Titles are links pointing to the urls containing information regarding searched word. ● Highlighted word: Words or text searched by user will be highlighted in the results. ● Pagination: Below on the screen there is a pagination. Each page will show 10 results on it. So user can easily navigate between pages for desired results and no need to do much scrolling on a single page. ● Search Box: Above on the screen there is a search box. User can edit the text or word he searched for. He can also search for new word or text using this search box. So using this search box, user need not to go back to search page for new search.
  • 53. 53 of 56 If there will no information for the word or text search by user, we will display message as above. REST API When user makes request for searching, crawling or indexing, it is call to web server api for crawling which is deployed in another server, which server have hadoop master and connecting with all hadoop slave. Nutch will manage this hadoop cluster by giving job for crawling and indexing. This part is still remains and cover in future development.
  • 54. 54 of 56 Currently we are working on part as figure above from overall architecture. When user submit query for search, web application calls to restful api. An API is responsible for web search based on query. API builds query and call to elasticsearch cluster for searching result using jest client. We have developed for web searching, a keywords is enters by users as query and get list of web urls with small content of that site contains keywords with highlight as result. Each query is stored in cassandra database for semantic search functionality. We are working on image searching based on keywords. For this we need to crawl and index all web images. Apache nutch 2.2 is unable to crawl images. We tried to add many parser plugins for parsing images, gone through tika parser and modified code for crawling procedure and enable to fetch images and parse it for create indexes. Jest is a Java HTTP Rest client for ElasticSearch. As mention in above section ElasticSearch is an Open Source (Apache 2), Distributed, RESTful, Search Engine built on top of Apache Lucene. ElasticSearch already has a Java API which is also used by ElasticSearch internally, but Jest fills a gap, it is the missing client for ElasticSearch Http Rest interface. Jest client is request elasticsearch cluster for result. A json is returns from this api and then it is forwarded to web application from where request is initiated. As result of this search API returns total number of pages, total result and list of all founded web site, their content and titles. We deployed web application and web api in different server for load balancing of requests to server. Future Development Content Searching Currently we are working on basic search functionality. For that we crawl websites and save their text contents. When user will search for any text or word, we will use these contents to get results. So results are limited to the text contents of websites. No doubt, currently users are allowed to search any text or word in any language with this functionality. We are planning that once we fully achieve this basic search functionality, we will work on the functionalities using which search on all the information of a content like name, text information, metadata will be possible and user will also allow for specific search for the following categories.  Image search  Video search  News search  Sports search  Audio search
  • 55. 55 of 56  Forum search  Blog search  Wiki search  Pdf search  Ebay, Amazon, Twitter, iTunes search Using these functionalities, user can make more specific search and get desired results more faster. For that we will crawl whole websites with images, videos, news etc and save their information like name, url, metadata and content-type. Now, when user will search for any text or word, we will use this information to get search results. It means, because of these functionalities search will be possible each and every information of a content, and user will get best results. When user wants to make specific search, we will make this kind of search using content-type of saved information to get the results. For e.g. user wants to search images only, we will use content-type equal to image and go through our saved information for images only. It is important to note that we will search for the images but we will search the entered text in images' name, url, metadata etc for results. Semantic Search We will use "Semantic search" concept to improve our search functionality so that user will get desired result more faster. Semantic search seeks to improve search accuracy by understanding searcher intent and the contextual meaning of terms as they appear in the searchable dataspace. Semantic search systems consider various points including context of search, location, intent, variation of words, synonyms, generalized and specialized queries, concept matching and natural language queries to provide relevant search results. We will save users url, country, browser, time etc information and text for which user searching. When search for any information we will use his passed searches and history to get more user specific results.
  • 56. 56 of 56 Prototypes Protypes for some of future development may be as below : Image search Video search