Your SlideShare is downloading. ×
0
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Platform" by Abinasha Karana
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Platform" by Abinasha Karana
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Platform" by Abinasha Karana
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Platform" by Abinasha Karana
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Platform" by Abinasha Karana
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Platform" by Abinasha Karana
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Platform" by Abinasha Karana
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Platform" by Abinasha Karana
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Platform" by Abinasha Karana
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Platform" by Abinasha Karana
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Platform" by Abinasha Karana
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Platform" by Abinasha Karana
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Platform" by Abinasha Karana
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Platform" by Abinasha Karana
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Platform" by Abinasha Karana
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Platform" by Abinasha Karana
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Platform" by Abinasha Karana
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Platform" by Abinasha Karana
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Platform" by Abinasha Karana
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Platform" by Abinasha Karana
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Platform" by Abinasha Karana
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Platform" by Abinasha Karana

7,673

Published on

0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
7,673
On Slideshare
0
From Embeds
0
Number of Embeds
12
Actions
Shares
0
Downloads
100
Comments
0
Likes
4
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Searching Information Inside Hadoop Platform Abinasha KaranaDirector-TechnologyBizosys Technologies Pvt Ltd.abinash@bizosys.comwww.bizosys.com
  • 2. To search a large dataset inside HDFS and HBase,
    At Bizosys we started with Map-Reduce and Lucene/Solr
  • 3. Map-reduce
    What didn’t work for us
    Result not in a mouse click
  • 4. It required vertical scaling with manual sharding and subsequent resharding as data grew
    Lucene/Solr
    What didn’t work for us
  • 5. We built a new search forHadoop Platform
    HDFS and HBase
    What we did
  • 6. In the next few slides you will hear about
    my learning from designing, developing and benchmarking a distributed, real-time search engine whose Index is stored and served out of HBase
  • 7. Key Learning
    Using SSD is a design decision.
    Methods to reduce HBase table storage size
    Serving a request without accessing ALL region servers
    Methods to move processing near the data
    Byte block caching to lower network and I/O trips to HBase
    Configuration to balance network vs. CPU vs. I/O vs memory
  • 8. Using SSD is a design decision
    1
    SSD improved HSearch response time by 66% over SATA.However, SSD is costlier.
    In HBase Table Schema Design we considered “Data Access Frequency”, “Data Size” and “Desired Response Time” for selective SSD deployment.
  • 9. .. Our SSD Friendly Schema Design
    Keyword:Reads all for a query.
    Document: Reads 10 docs / query.
    Keyword + Document in 1 Table
    Keyword + Document in 2 Tables
    SSD deployment is All or none
    SSD deployment is only for Keyword Table
  • 10. Key length
    Value length
    Row length
    Row Bytes
    Family Length
    Family Bytes
    Qualifier Bytes
    Timestamp
    Key Type
    Value Bytes
    4 BYTES
    1 BYTE
    4 BYTES
    4 BYTES
    2 BYTES
    BYTES
    1 BYTE
    BYTES
    8 BYTES
    BYTES
    2
    Methods to reduceHBase table storage size
    Storing a 4 byte cell requires >27bytes in HBase.
  • 11. .. to 1/3rd
    Stored large cell values by merging cells
    Reduced the Family name to 1 Character
    Reduced the Qualifier name to 1 Character
  • 12. Serving a request without accessing ALL region servers
    3
    Consider a 100 node cluster of HBase and a single search request need to access all of them.
    Bad Design..
    Clogged Network..
    No scaling
  • 13. Index Table was divided on Column-Family as separate tables
    Scan Table A - 3 Machines Hit
    Table B
    Table A
    Machine 5
    Machine 4
    Machine 5
    4-5 M
    Machine 3
    Machine 3

    Machine 4
    3-4 M
    Machine 2
    Row Ranges
    2-3 M
    Machine 3
    Machine 1
    1-2 M
    Machine 2
    0-1 M
    Machine 1
    3
    And our solution…
    Scan “Family A” - 5 Machines Hit
    Family A
    Family B
  • 14. Methods to move processing near the data
    4
    • Sent filtered Rows over network.
    public class TermFilter implements Filter {
    public ReturnCode filterKeyValue(KeyValue kv) {
    boolean isMatched = isFound(kv);
    if (isMatched ) return ReturnCode.INCLUDE;
    return ReturnCode.NEXT_ROW;
    }

    E.g. Matched rows for a keyword
    • Sent relevant Fields of a Row over network.
    • 15. Sent relevant section of a Field over network.
    public class DocFilter implements Filter {
    public void filterRow(List<KeyValue> kvL) {
    byte[] val = extractNeededPiece(kvL);
    kvL.clear();
    kvL.add(new KeyValue(row,fam,,val));
    }
    ….
    E.g. Computing a best match section from within a document for a given query
  • 16. Byte block caching to lower network and I/O trips to HBase
    5
    Object caching – With growing number of objects we encountered ‘Out of Memory’ exception
    HBase commit - Frequent flushing to HBase introduced network and I/O latencies.

    Converting Objects to intermediate Byte Blocks increased record processing by 20x in 1 batch.
  • 17. Configuration to balance Network vs. CPU vs. I/O vs. Memory
    6
    Disk
    I/O
    Block Caching
    Compression
    Memory
    CPU
    Aggressive GC
    Network
    IPC Caching
    Compression
    In a Single Machine
  • 18. … and it’s settings
    Network
    Increased IPC Cache Limits (hbase.client.scanner.caching)
    CPU
    JVM agressive heap ("-server -XX:+UseParallelGC -XX:ParallelGCThreads=4 XX:+AggressiveHeap “)
    I/O
    LZO index compression (“Inbuilt oberhumer LZO” or “Intel IPP native LZO”)
    Memory
    HBase block caching (hfile.block.cache.size) and overall memory allocation for data-node and region-server.
  • 19. .. and parallelized to multi-machines
    • HTable.batch (Get, Put, Deletes)
    • 20. ParallelHTable (Scans)
    • 21. FUTURE-coprocessors (hbase 0.92 release).
    Allocating appropriate resources dfs.datanode.max.xcievers, hbase.regionserver.handler.count and dfs.datanode.handler.count
  • 22. HSearch Benchmarks on AWS
    Amazon Large instance 7.5 GB Memory * 11 Machines with a single 7.5K SATA drive
    100 Million Wikipedia pages of total 270GB and completely indexed (Included common stopwords)  
    10 Million pages repeated 10 times. (Total indexing time is 5 Hours)
    Search Query Response speed using a
    regular word is 1.5 sec
    common word such as “hill” found 1.6 million matches and sorted in 7 seconds
  • 23. www.sourceforge.net/bizosyshsearch
    Apache-licensed (2 versions released)
    Distributed, real-time search
    Supports XML documents with rich search syntax and various filtration criteria such as document type, field type.
  • 24. References
    Initial performance reports (Bizosys HSearch, a Nosql search engine, featured in Intel Cloud Builders Success Stories (July, 2010)
    HSearch is currently in use at http://www.10screens.com
    More on hsearch
    http://www.bizosys.com/blog/
    http://bizosyshsearch.sourceforge.net/
    More on SSD Product Technical Specifications http://download.intel.com/design/flash/nand/extreme/extreme-sata-ssd-product-brief.pdf

×