Your SlideShare is downloading. ×
0
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Platform" by Abinasha Karana
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Platform" by Abinasha Karana
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Platform" by Abinasha Karana
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Platform" by Abinasha Karana
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Platform" by Abinasha Karana
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Platform" by Abinasha Karana
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Platform" by Abinasha Karana
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Platform" by Abinasha Karana
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Platform" by Abinasha Karana
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Platform" by Abinasha Karana
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Platform" by Abinasha Karana
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Platform" by Abinasha Karana
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Platform" by Abinasha Karana
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Platform" by Abinasha Karana
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Platform" by Abinasha Karana
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Platform" by Abinasha Karana
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Platform" by Abinasha Karana
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Platform" by Abinasha Karana
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Platform" by Abinasha Karana
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Platform" by Abinasha Karana
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Platform" by Abinasha Karana
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Platform" by Abinasha Karana

7,775

Published on

0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
7,775
On Slideshare
0
From Embeds
0
Number of Embeds
12
Actions
Shares
0
Downloads
100
Comments
0
Likes
4
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Searching Information Inside Hadoop Platform Abinasha KaranaDirector-TechnologyBizosys Technologies Pvt Ltd.abinash@bizosys.comwww.bizosys.com<br />
  • 2. To search a large dataset inside HDFS and HBase, <br />At Bizosys we started with Map-Reduce and Lucene/Solr<br />
  • 3. Map-reduce<br />What didn’t work for us<br />Result not in a mouse click<br />
  • 4. It required vertical scaling with manual sharding and subsequent resharding as data grew<br />Lucene/Solr<br />What didn’t work for us<br />
  • 5. We built a new search forHadoop Platform <br />HDFS and HBase<br />What we did<br />
  • 6. In the next few slides you will hear about<br />my learning from designing, developing and benchmarking a distributed, real-time search engine whose Index is stored and served out of HBase<br />
  • 7. Key Learning<br />Using SSD is a design decision.<br />Methods to reduce HBase table storage size <br />Serving a request without accessing ALL region servers<br />Methods to move processing near the data<br />Byte block caching to lower network and I/O trips to HBase<br />Configuration to balance network vs. CPU vs. I/O vs memory <br />
  • 8. Using SSD is a design decision<br />1<br />SSD improved HSearch response time by 66% over SATA.However, SSD is costlier. <br />In HBase Table Schema Design we considered “Data Access Frequency”, “Data Size” and “Desired Response Time” for selective SSD deployment.<br />
  • 9. .. Our SSD Friendly Schema Design<br />Keyword:Reads all for a query.<br />Document: Reads 10 docs / query.<br />Keyword + Document in 1 Table<br />Keyword + Document in 2 Tables<br />SSD deployment is All or none<br />SSD deployment is only for Keyword Table<br />
  • 10. Key length<br />Value length<br />Row length<br />Row Bytes<br />Family Length<br />Family Bytes<br />Qualifier Bytes<br />Timestamp<br />Key Type <br />Value Bytes<br />4 BYTES<br />1 BYTE<br />4 BYTES<br />4 BYTES<br />2 BYTES<br />BYTES<br />1 BYTE<br />BYTES<br />8 BYTES<br />BYTES<br />2<br />Methods to reduceHBase table storage size<br />Storing a 4 byte cell requires &gt;27bytes in HBase. <br />
  • 11. .. to 1/3rd<br />Stored large cell values by merging cells<br />Reduced the Family name to 1 Character<br />Reduced the Qualifier name to 1 Character<br />
  • 12. Serving a request without accessing ALL region servers<br />3<br />Consider a 100 node cluster of HBase and a single search request need to access all of them.<br />Bad Design.. <br />Clogged Network.. <br />No scaling<br />
  • 13. Index Table was divided on Column-Family as separate tables<br />Scan Table A - 3 Machines Hit <br />Table B<br />Table A<br />Machine 5<br />Machine 4<br />Machine 5<br />4-5 M<br />Machine 3<br />Machine 3<br /><br />Machine 4<br />3-4 M<br />Machine 2<br />Row Ranges<br />2-3 M<br />Machine 3<br />Machine 1<br />1-2 M<br />Machine 2<br />0-1 M<br />Machine 1<br />3<br />And our solution…<br />Scan “Family A” - 5 Machines Hit<br />Family A<br />Family B<br />
  • 14. Methods to move processing near the data<br />4<br /><ul><li>Sent filtered Rows over network. </li></ul>public class TermFilter implements Filter {<br />public ReturnCode filterKeyValue(KeyValue kv) {<br /> boolean isMatched = isFound(kv);<br /> if (isMatched ) return ReturnCode.INCLUDE;<br /> return ReturnCode.NEXT_ROW;<br />}<br />…<br />E.g. Matched rows for a keyword<br /><ul><li>Sent relevant Fields of a Row over network.
  • 15. Sent relevant section of a Field over network.</li></ul>public class DocFilter implements Filter {<br />public void filterRow(List&lt;KeyValue&gt; kvL) {<br /> byte[] val = extractNeededPiece(kvL);<br /> kvL.clear();<br /> kvL.add(new KeyValue(row,fam,,val));<br />}<br />….<br />E.g. Computing a best match section from within a document for a given query<br />
  • 16. Byte block caching to lower network and I/O trips to HBase<br />5<br />Object caching – With growing number of objects we encountered ‘Out of Memory’ exception<br />HBase commit - Frequent flushing to HBase introduced network and I/O latencies.<br /><br />Converting Objects to intermediate Byte Blocks increased record processing by 20x in 1 batch.<br />
  • 17. Configuration to balance Network vs. CPU vs. I/O vs. Memory <br />6<br />Disk<br />I/O<br />Block Caching<br />Compression<br />Memory<br />CPU<br />Aggressive GC<br />Network<br />IPC Caching<br />Compression<br />In a Single Machine<br />
  • 18. … and it’s settings<br />Network<br />Increased IPC Cache Limits (hbase.client.scanner.caching)<br />CPU<br />JVM agressive heap (&quot;-server -XX:+UseParallelGC -XX:ParallelGCThreads=4 XX:+AggressiveHeap “)<br />I/O<br />LZO index compression (“Inbuilt oberhumer LZO” or “Intel IPP native LZO”)<br />Memory<br />HBase block caching (hfile.block.cache.size) and overall memory allocation for data-node and region-server.<br />
  • 19. .. and parallelized to multi-machines<br /><ul><li>HTable.batch (Get, Put, Deletes)
  • 20. ParallelHTable (Scans)
  • 21. FUTURE-coprocessors (hbase 0.92 release).</li></ul>Allocating appropriate resources dfs.datanode.max.xcievers, hbase.regionserver.handler.count and dfs.datanode.handler.count<br />
  • 22. HSearch Benchmarks on AWS<br />Amazon Large instance 7.5 GB Memory * 11 Machines with a single 7.5K SATA drive<br />100 Million Wikipedia pages of total 270GB and completely indexed (Included common stopwords)  <br />10 Million pages repeated 10 times. (Total indexing time is 5 Hours)<br />Search Query Response speed using a <br />regular word is 1.5 sec<br />common word such as “hill” found 1.6 million matches and sorted in 7 seconds<br />
  • 23. www.sourceforge.net/bizosyshsearch<br />Apache-licensed (2 versions released)<br />Distributed, real-time search<br />Supports XML documents with rich search syntax and various filtration criteria such as document type, field type.<br />
  • 24. References<br />Initial performance reports (Bizosys HSearch, a Nosql search engine, featured in Intel Cloud Builders Success Stories (July, 2010)<br />HSearch is currently in use at http://www.10screens.com<br />More on hsearch<br />http://www.bizosys.com/blog/<br />http://bizosyshsearch.sourceforge.net/<br />More on SSD Product Technical Specifications http://download.intel.com/design/flash/nand/extreme/extreme-sata-ssd-product-brief.pdf<br />

×