Nov 2011 HUG: Blur - Lucene on Hadoop


Published on

Published in: Technology
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Nov 2011 HUG: Blur - Lucene on Hadoop

  1. 1. Blur - Lucene on Hadoop Aaron McCurry
  2. 2. Aaron McCurry •  Programming Java for 10+ years •  Working with BigData for 3 years •  Using Lucene for 3 years •  Using Hadoop for 2 years •  B.S. in Computer Engineering from Virginia Tech •  Sr. Software Engineer at Near Infinity CorporationDeveloping Blur for 1.5 years
  3. 3. AgendaDefinition of Blur and the key benefitsBlur architecture and its component partsQuery capabilitiesChallenges that Blur had to overcome
  4. 4. What is Blur?A distributed search capability built on top of Hadoop and Lucene•  Built specifically for Big Data•  Scalability, redundancy, and performance baked in from the start•  Leverages all the goodness built into the Hadoop and Lucene stack Blur uses the Apache 2.0 license
  5. 5. Why should I use Blur? Benefits Description Store, index and search massive amounts of Scalable data Performance similar to standard Lucene Fast implementation Stores data updates in write ahead log Durable (WAL) in case of node failure Auto-detects node failure and re-assigns Failover indexes to surviving nodes Provides all the standard Lucene queriesQuery flexibility plus joining data
  6. 6. Blur Data ModelBlur stores information in Tables that contain RowsRows contain RecordsRecords exist in column families (Used for grouping information)Records contain ColumnsColumns contain a name / value pairing (Stored as Strings)NOTE: Columns with the same name can exist in the same Record
  7. 7. Blur Data Model in JSON{ rowid : "", records : [ { recordid : "324182347", family : "messages", columns : [ { name : "to", value : "" }, { name : "to", value : "" }, { name : "subject", value : "important!" }, { name : "body", value : "This is a very important email...." } ] }, { recordid : "234123412", family : "contacts", columns : [ { name : "name",value:"Jon Doe" }, { name : "email",value:"" } ] } ]}
  8. 8. Blur Data Model Table RowID = RecordID = 324182347 RecordID = 234123412 family = messages family = contacts to: name: Jon Doe to: email: subject: important! body: This is a... RowID = ...
  9. 9. Blur ArchitectureComponent Purpose Lucene Perform actual search duties HDFS Store Lucene indexesMapReduce Uses Hadoop s MR to index data Thrift All inter-process communicationsZooKeeper Manages system state and stores metadata
  10. 10. Blur uses Two Types of Server Processes Orchestrates communication between all of the shard servers for queries Uses: HDFS, Thrift and Zookeeper Responsible for performing searches for each shard and returns results to controller Uses: Same as controller plus Lucene
  11. 11. Blur Architecture in Practice
  12. 12. Why Lucene for Search? Key Benefit Stable, performant with robust features likeFeatures NRT, GIS, new Analyzers, etc. Seems like everyone is using it -- LuceneAdoption directly, Solr, Elastic SearchCommunity Very active open source project API Easy to extend (analyzers, directories, etc.) Levenshtein Automaton (4.0), Flexible Future indexing (4.0)
  13. 13. HDFS for Storage Index data is stored in HDFS Data updates are written to a Write Ahead Log (WAL) before being indexed into the appropriate Lucene index Sync is called on the WAL before the call returns for durability (can be disabled during mutation calls)
  14. 14. Zookeeper for Meta Data and State Shard server state is stored in Zookeeper Table meta data along with Lucene writer locks are stored under the table node All online controllers are also registered in Zookeeper
  15. 15. Blur QueryBlur uses the standard Lucene query syntax +doe)Blur also allows for cross column family intersection queries +doe) Which in effect gives you a join like query, because messages and contacts are stored in different Records. find messages where the message was sent to Joe and Doe and where the user has a contact named BillBlur supports any Lucene query (limited to Java clients only)
  16. 16. Challenges that Blur SolvesReindexing of DatasetsRandom Access Writes with LuceneRandom Access Latency with HDFSJVM GC - LUCENE-2205 Lucene low memory patch
  17. 17. Reindexing of DatasetsProblem:To be able to reindex all of the data whenever needed as fast aspossible without affecting performance of existing online datasets.
  18. 18. Reindexing of DatasetsSolution:MapReduce to the rescue, Blur uses Hadoop to build the indexes anddeliver them to the HDFS. This allows the very CPU and I/O intensivecomputations to occur on the Hadoop cluster where you probably havethe most computing resources.The delivery of the indexes can be controlled to reduce I/O impact tothe running systems. Also the indexes can be delivered to differentHDFS instances for total separation of I/O.
  19. 19. Random Access Writes w/LuceneProblem:Writes in Lucene and in HDFS share a common trait in that once a fileis closed for writing is it never modified, immutable. However Lucenerequires random access writes while the file is open for writes andHDFS cannot natively support this type of operation.
  20. 20. Random Access Writes w/LuceneSolution:A logically layered file that writes only appends. While a file is open forwrites and when a seek is called (that actually needs to move to a newposition), a new logical block is created that stores the logical positionof the data, the real position of the data on disk and the length of thedata.When the file is opened for reading, the meta data about the logicalblocks are read into memory and used during reads to calculate thereal position of the data requested.
  21. 21. Random Access Latency w/HDFSProblem:HDFS is not very good at random access reads. Great improvementshave been made and more are coming in 0.23, but it still won t beenough to support low latency Lucene accesses. Lucene relies on filesystem caching or MMAP of index files for performance whenexecuting queries on a single machine with a normal OS file system.
  22. 22. Random Access Latency w/HDFSSolution:Add a Lucene Directory level block cache to store the hot blocks fromthe files that Lucene uses for searching. A concurrent LRU map storesthe location of the blocks in pre allocated slabs of memory. The slabsof memory are allocated at startup and in essence are used in place ofa OS filesystem cache.A side benefit to this feature is that writing new data to the HDFSinstance does not unload the hot blocks from memory. For example ifa new table of data is being written to HDFS from a MapReduce job.
  23. 23. FutureBlur is a new project which means that there is a lot of work that needsto be done to make sure it is ready for "web scale", but I believe is canbe done.Future Tasks: More Performance Tuning More Tests More Documentation Native GIS Queries Incremental Updates from MapReduce Index Splits
  24. 24. Questions? Blur: Blur 0.1.rc1 now availableBlog: