Geo-based content processing using hbase

  • 1,329 views
Uploaded on

Presented at Chicago Data Summit 2011 hosted by Cloudera.

Presented at Chicago Data Summit 2011 hosted by Cloudera.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,329
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
0
Comments
0
Likes
2

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Geo-based Content Processing using HBase
    Ravi Veeramachaneni
    NAVTEQ
    1
  • 2. Agenda
    2
  • 7. Problem
    • Ineffective to scaleout
    • 8. Cost:Expensive Oracle license cost
    • 9. Technology: Inherent limitations of RDBMS
    • 10. Need to support flexible functionality
    • 11. Need to decouple content from the map
    • 12. Need flexibility to quickly add new content providers
    • 13. Need to support community input
    • 14. Need for real-time data Availability
    • 15. Content needs to be updated and delivered much faster than before
    • 16. Unable to deliver better, richer, faster contextual content more efficiently
    • 17. Need to support both customers with connected and disconnected devices
    3
  • 18. Our Data is Constantly Growing
    • Content Breadth
    100s ofmillions of content records
    100s of content suppliers + community input
    • Content Depth
    On average, a content record has 120attributes
    Certain types of content have more than 400 attributes
    Content classified across 270+ categories
    • Our content is
    Sparse and unstructured
    Provided in multiple data formats
    Ingested, processed and delivered in transactional and batch mode
    Constantly growing (in higher TB)
    4
  • 19.
    • Provide scalable content processing platform to handle spikes in content processing demand
    • 20. Providing for horizontal scalability (Hadoop/HBase)
    • 21. Provide business rules management system to adapt to changing processing needs
    • 22. Flexible business rules based on the supplier and content information
    • 23. Flexible business flows and processing to meet SLA
    • 24. Provide high value and high quality content to customers fast
    • 25. Corroborate multiple sources to produce the best quality information
    • 26. Utilize open source software wherever applicable with commercial support
    5
    Our Approach for Content Processing Challenges
  • 27. 6
    Content Processing Overview
    Batch and Real-time Ingestion
    Supplier n
    Supplier 1
    Supplier 2
    Location ID
    Permanent ID
    Source & Blended Record Management
    PUBLISHING
    Permanent ID
    Location ID
    real-time, on-demand
  • 28. 7
    System Decomposition
  • 29. Why HBase?
    • HBase Scales (runs on top of Hadoop)
    • 30. HBase stores null values for free
    • 31. Saves both disk space and disk IO time
    • 32. HBase supports unstructured data through column families
    • 33. HBase has built-in version management
    • 34. HBase provides fast table scans for time ranges and fast key based lookups
    • 35. Map Reduce data input
    • 36. Tables are sorted and have unique keys
    Reducer often times optional
    Combiner not needed
    • Strong community support and wider adoption
    8
  • 37. HBase @NAVTEQ
    • Started in late-2009, hbase 0.19.x (apache)
    • 38. 8-node VMWare Sandbox Cluster
    • 39. Flaky, unstable, RS Failures
    • 40. Switched to CDH
    • 41. No Cloudera support
    • 42. Late 2010, hbase 0.89 (CDH3b3)
    • 43. Most stable than any other version in the past
    • 44. Multiple teams at NAVTEQ exploring to use Hadoop/HBase
    • 45. Official Cloudera supported version
    • 46. Early 2010, hbase 0.20.x (CDH2)
    • 47. 10-node Physical Sandbox Cluster
    • 48. Still had lot of challenges, RS Failures, META corruption
    • 49. Cluster expanded significantly with multiple environments
    • 50. Signed Cloudera support
    • 51. No official Cloudera support for HBase
    • 52. Current (hbase 0.90.1)
    • 53. Waiting on moving to CDH3 official release
    • 54. Looking into Hive, Oozie, Lucene/Solr integration, Cloudera Enterprise and few other
    • 55. Several initiatives to use HBase
    9
  • 56. HBase @NAVTEQ
    DEVELOPMENT
    INTEGRATION
    STAGING
    PRODUCTION
    10
  • 64.
    • Database/schema design
    • 65. Transition to Column-oriented or flat schema
    • 66. Row-key design/implementation
    • 67. Sequential keys
    Suffers from distribution of load but uses the block caches
    Can be addressed by pre-splitting the regions
    • Randomize keys to get better distribution
    Achieved through hashing on Key Attributes – SHA1
    Suffers range scans
    • Too many Column Families
    • 68. Initially we had about 30 or so, now reduced to 8
    • 69. Compression
    • 70. LZO didn’t work out well with CDH2, using default Block compression
    • 71. Need to revisit with CDH3
    11
    Challenges/Lessons Learned
  • 72.
    • Serialization
    • 73. AVRO didn’t work well – deserialization issue
    • 74. Developed configurable serialization mechanism that uses JSON except Date type
    • 75. Secondary Indexes
    • 76. Were using ITHBase and IHBase from contrib – doesn’t work well
    • 77. Redesigned schema without need for index
    • 78. We still need it though
    • 79. Performance
    • 80. Several tunable parameters
    Hadoop, HBase, OS, JVM, Networking, Hardware
    • Scalability
    • 81. Interfacing with real-time (interactive) systems from batch oriented system
    12
    Challenges/Lessons Learned
  • 82.
    • Configuring HBase
    • 83. Configuration is the key
    • 84. Many moving parts – typos, out of synchronization
    • 85. Operating System
    Number of open files (ulimit) to 32K or even higher
    vm.swapiness to lower or 0
    • HDFS
    Adjust block size based on the use case
    Increase xceivers to 2047 (dfs.datanode.max.xceivers)
    Set socket timeout to 0 (dfs.datanode.socket.write.timeout)
    • HBase
    Needs more memory
    ZK on DN – have a separate ZK quorum
    No swapping – JVM hates it
    GC pauses could cause timeouts or RS failures (read article posted by Todd Lipcon on avoiding full GC)
    13
    Challenges/Lessons Learned
  • 86.
    • HBase
    Write (puts) optimization (Ryan Rawson HUG8 presentation – HBase importing)
    hbase.regionserver.global.memstore.upperLimit=0.3
    hbase.regionserver.global.memstore.lowerLimit=0.15
    hbase.regionserver.handler.count=256
    hbase.hregion.memstore.block.multiplier=8
    hbase.hstore.blockingStoreFiles=25
    Control number of store files (hbase.hregion.max.filesize)
    • Security
    • 87. Introduced in CDH3b3 but in flux, need robust RBAC
    • 88. Reliability
    • 89. Name Node is SPOF
    • 90. HBase is sensitive
    Region Server Failures
    14
    Challenges/Lessons Learned
  • 91.
    • Initially, we had on-site training classes
    • 92. Cluster configurations were reviewed and made some recommendations
    • 93. Tickets resolved within the reasonable SLAs
    • 94. Knowledgeable support team
    • 95. Ability to have access to technology experts, if needed
    • 96. With Cloudera support for almost a year now
    • 97. Our needs and demands are increasing and looking towards enterprise support
    15
    Why Cloudera?
  • 98.
    • Better operational tools for using Hadoop and HBase
    • 99. Job management, backup, restore, user provisioning, general administrative tasks, etc.
    • 100. Support for Secondary Indexes
    • 101. Full-text Indexes and Searching (Lucene/Solr integration?)
    • 102. HA support for Name Node
    • 103. Need Data Replication for HA & DR
    • 104. Best practices and supported material
    16
    Features Needed
  • 105. 17
    Thank You
    Q & A