• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Building a distributed search system with Hadoop and Lucene
 

Building a distributed search system with Hadoop and Lucene

on

  • 2,287 views

Big Data Problem

Big Data Problem
Map and Reduce approach: Apache Hadoop
Distributing a Lucene index using Hadoop
Measuring Performance
Conclusion

Statistics

Views

Total Views
2,287
Views on SlideShare
1,958
Embed Views
329

Actions

Likes
4
Downloads
33
Comments
0

4 Embeds 329

http://www.mccalv.com 324
http://www.linkedin.com 3
http://cloud.feedly.com 1
http://translate.googleusercontent.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Building a distributed search system with Hadoop and Lucene Building a distributed search system with Hadoop and Lucene Presentation Transcript

    • Building a distributed search system with Apache Hadoop and Lucene Anno Accademico 2012-2013
    • Outline • Big Data Problem • Map and Reduce approach: Apache Hadoop • Distributing a Lucene index using Hadoop • Measuring Performance • Conclusion Mirko Calvaresi, "Building a distributed search system with Apache Hadoop and Lucene"
    • “Big Data” This works analyzes the technological challenge to manage and administer quantity of information with global dimension in the order of Terabyte (10E12 bytes) or Petabyte (10E15 bytes) and with an exponential growth rate. • Facebook processes 2.5 billion contents/day. • Youtube: 72 hours of video uploaded per minutes. • Twitter:50 million tweets per day. Mirko Calvaresi, "Building a distributed search system with Apache Hadoop and Lucene"
    • Multitier architecture vs Cloud computing Mirko Calvaresi, "Building a distributed search system with Apache Hadoop and Lucene" Front End Servers Database Servers Client Front End Servers Cloud Client Data asynchronous analysis Realtimeprocessing Realtimeprocessing
    • Apache Hadoop architecture Mirko Calvaresi, "Building a distributed search system with Apache Hadoop and Lucene" A Hadoop cluster scales computation capacity, storage capacity and IO bandwidth by simply adding commodity servers
    • HDFS: the distributed file system • Files are stored as sets of (large) blocks – Default block size: 64 MB (ext4 default is 4kB) – Blocks are replicated for durability and availability • Namespace is managed by a single name node – Actual data transfer is directly between client & data node – Pros and cons of this decision? foo.txt: 3,9,6 bar.data: 2,4 block #2 of foo.txt? 9 Read block 9 9 9 9 93 3 3 2 2 24 4 4 6 6 Name node Data nodesClient Mirko Calvaresi, "Building a distributed search system with Apache Hadoop and Lucene"
    • Map and Reduce The computation takes a set of input key/value pairs, and produces a set of output key/value pairs. Mirko Calvaresi, "Building a distributed search system with Apache Hadoop and Lucene"
    • Recap: Map Reduce approach Mapper Mapper Mapper Mapper Reducer Reducer Reducer Reducer Inputdata Outputdata "The Shuffle" Intermediate (key,value) pairs Mirko Calvaresi, "Building a distributed search system with Apache Hadoop and Lucene"
    • Map and Reduce: where is applicable Mirko Calvaresi, "Building a distributed search system with Apache Hadoop and Lucene" • Distributed “Grep” • Count of URL Access Frequency • Reverse Web-Link Graph • Term-Vector per Host • Reduce a n level graph in a redundant hash table
    • Implementation: distributing a Lucene index using Map and Reduce The scope of the implementation is to: 1. populate a Lucene distributed index using the HDFS cluster 2. distributing and retrieving results using Map and Reduce Mirko Calvaresi, "Building a distributed search system with Apache Hadoop and Lucene"
    • Apache Lucene: indexing Mirko Calvaresi, "Building a distributed search system with Apache Hadoop and Lucene" n Apache Lucene is the standard de facto in the open source community for textual search Document Field(type)->Value Field(type)->Value Field(type)->Value
    • Apache Lucene: searching Mirko Calvaresi, "Building a distributed search system with Apache Hadoop and Lucene" In Lucene each document is a vector. A measure of the relevance is the value of the θ angle between the document and the query vector
    • Distributing Lucene indexes using Hadoop Mirko Calvaresi, "Building a distributed search system with Apache Hadoop and Lucene" Index 1 Lucene Indexer Job Indexing Searching Index 2 Index 3 PDF doc archive Map Phase: Creates and populate each index Reduce Phase: None HDFSCluster Index 1 Lucene Search Job Index 2 Index 3 HDFSCluster map Sort Reduce ResulSet Combine map map {Search Filter} (list of Lucene Restrictions) Map Phase: Queries the indexes Reduce Phase: Merges and orders result set
    • Measuring Performance The entire execution time can be formally defined as: While the single Map (or Reduce) phase: Mirko Calvaresi, "Building a distributed search system with Apache Hadoop and Lucene" Where α is the % of reduce tasks still on going after map phase completion.
    • Measuring Performance Mirko Calvaresi, "Building a distributed search system with Apache Hadoop and Lucene" Data nodes CPU Of the Nodes RAM available Name Nodes Number of file Total Bytes read Job Submission Cost Total Job Time 2 Intel i7 CPU 2.67 GHZ 4 GB 1 1330 2.043 GB 1 min 5 sec 24m 35 sec 3 Intel i7 CPU 2.67 GHZ 4 GB 1 1330 2.043 GB 1 min 21 sec 12 min 10 sec 4 Intel i7 CPU 2.67 GHZ 4 GB 1 1330 2.043 GB 1 min 40 sec 8 min 22 sec 1 (No Hadoop) Intel i7 CPU 2.67 GHZ 4GB 0 1330 2043 GB 0 10 min 11 sec With 4 or more data nodes Hadoop infrastructure setup cost is compensated
    • Measuring Performance (Word Count) Mirko Calvaresi, "Building a distributed search system with Apache Hadoop and Lucene" Having a single Big file speeds up Hadoop consistently, so performance are not really determined by the quantity of data but how many splits are added to the HDFS Data nodes Cpu Of the Nodes RAM available Name Nodes Number of file Total Bytes read Job Submission Cost 3 Intel i7 Cpu 2.67 GHZ 4 GB 1 1 942 MB 3 min 18 sec 4 Intel i7 Cpu 2.67 GHZ 4 GB 1 1 942 MB 2 min 17 sec 1 (No Hadoop) Intel i7 Cpu 2.67 GHZ 4 GB 1 1 942 MB 4 min 27 sec
    • Job Detail Page Tasks Queue Tasks currently running Mirko Calvaresi, "Building a distributed search system with Apache Hadoop and Lucene"
    • Conclusion Mirko Calvaresi, "Building a distributed search system with Apache Hadoop and Lucene" What • Analysis of the current status of Open Source technologies • Analysis of the potential applications for the web • Implemented a full working Hadoop architecture • Designed a web portal based on the previous architecture Objectives: • Explore Map and Reduce approach to analyze unstructured data • Measure performance and understand the Apache Hadoop framework Outcomes • Setup of the entire architecture in my company environment (Innovation Engineering) • Main benefits in the indexing phase • Poor impact on the search side (for standard queries format) • In general major benefits when the HDFS is populated by a relatively small number of Big (GB) files