Geo-based Content Processing using HBase<br />Ravi Veeramachaneni<br />NAVTEQ<br />1<br />
Agenda<br /><ul><li>Problem
Solution
Why HBase?
Challenges
Why we chose Cloudera?</li></ul>2<br />
Problem<br /><ul><li>Ineffective to scaleout
Cost:Expensive Oracle license cost
Technology: Inherent limitations of RDBMS
Need to support flexible functionality
Need to decouple content from the map
Need flexibility to quickly add new content providers
Need to support community input
Need for real-time data Availability
Content needs to be updated and delivered much faster than before
Unable to deliver better, richer, faster contextual content  more efficiently
Need to support both customers with connected and disconnected devices</li></ul>3<br />
Upcoming SlideShare
Loading in...5
×

Geo-based content processing using hbase

1,407

Published on

Presented at Chicago Data Summit 2011 hosted by Cloudera.

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,407
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Transcript of "Geo-based content processing using hbase"

  1. 1. Geo-based Content Processing using HBase<br />Ravi Veeramachaneni<br />NAVTEQ<br />1<br />
  2. 2. Agenda<br /><ul><li>Problem
  3. 3. Solution
  4. 4. Why HBase?
  5. 5. Challenges
  6. 6. Why we chose Cloudera?</li></ul>2<br />
  7. 7. Problem<br /><ul><li>Ineffective to scaleout
  8. 8. Cost:Expensive Oracle license cost
  9. 9. Technology: Inherent limitations of RDBMS
  10. 10. Need to support flexible functionality
  11. 11. Need to decouple content from the map
  12. 12. Need flexibility to quickly add new content providers
  13. 13. Need to support community input
  14. 14. Need for real-time data Availability
  15. 15. Content needs to be updated and delivered much faster than before
  16. 16. Unable to deliver better, richer, faster contextual content more efficiently
  17. 17. Need to support both customers with connected and disconnected devices</li></ul>3<br />
  18. 18. Our Data is Constantly Growing<br /><ul><li>Content Breadth</li></ul>100s ofmillions of content records<br />100s of content suppliers + community input<br /><ul><li>Content Depth</li></ul>On average, a content record has 120attributes<br />Certain types of content have more than 400 attributes<br />Content classified across 270+ categories<br /><ul><li>Our content is</li></ul>Sparse and unstructured<br />Provided in multiple data formats<br />Ingested, processed and delivered in transactional and batch mode<br />Constantly growing (in higher TB)<br />4<br />
  19. 19. <ul><li>Provide scalable content processing platform to handle spikes in content processing demand
  20. 20. Providing for horizontal scalability (Hadoop/HBase)
  21. 21. Provide business rules management system to adapt to changing processing needs
  22. 22. Flexible business rules based on the supplier and content information
  23. 23. Flexible business flows and processing to meet SLA
  24. 24. Provide high value and high quality content to customers fast
  25. 25. Corroborate multiple sources to produce the best quality information
  26. 26. Utilize open source software wherever applicable with commercial support</li></ul>5<br />Our Approach for Content Processing Challenges<br />
  27. 27. 6<br />Content Processing Overview<br />Batch and Real-time Ingestion<br />Supplier n<br />Supplier 1<br />Supplier 2<br />Location ID<br />Permanent ID<br />Source & Blended Record Management<br />PUBLISHING<br />Permanent ID<br />Location ID<br />real-time, on-demand<br />
  28. 28. 7<br />System Decomposition<br />
  29. 29. Why HBase?<br /><ul><li>HBase Scales (runs on top of Hadoop)
  30. 30. HBase stores null values for free
  31. 31. Saves both disk space and disk IO time
  32. 32. HBase supports unstructured data through column families
  33. 33. HBase has built-in version management
  34. 34. HBase provides fast table scans for time ranges and fast key based lookups
  35. 35. Map Reduce data input
  36. 36. Tables are sorted and have unique keys</li></ul>Reducer often times optional<br />Combiner not needed<br /><ul><li>Strong community support and wider adoption</li></ul>8<br />
  37. 37. HBase @NAVTEQ<br /><ul><li>Started in late-2009, hbase 0.19.x (apache)
  38. 38. 8-node VMWare Sandbox Cluster
  39. 39. Flaky, unstable, RS Failures
  40. 40. Switched to CDH
  41. 41. No Cloudera support
  42. 42. Late 2010, hbase 0.89 (CDH3b3)
  43. 43. Most stable than any other version in the past
  44. 44. Multiple teams at NAVTEQ exploring to use Hadoop/HBase
  45. 45. Official Cloudera supported version
  46. 46. Early 2010, hbase 0.20.x (CDH2)
  47. 47. 10-node Physical Sandbox Cluster
  48. 48. Still had lot of challenges, RS Failures, META corruption
  49. 49. Cluster expanded significantly with multiple environments
  50. 50. Signed Cloudera support
  51. 51. No official Cloudera support for HBase
  52. 52. Current (hbase 0.90.1)
  53. 53. Waiting on moving to CDH3 official release
  54. 54. Looking into Hive, Oozie, Lucene/Solr integration, Cloudera Enterprise and few other
  55. 55. Several initiatives to use HBase</li></ul>9<br />
  56. 56. HBase @NAVTEQ<br /><ul><li>Hardware / Environment
  57. 57. DELL R410
  58. 58. 64GB RAM (ECC)
  59. 59. 4x2TB (JBOD)
  60. 60. RHEL 5.4
  61. 61. 500+ CPU cores
  62. 62. 200+ TB configured Disk Capacity
  63. 63. Multiple environments</li></ul>DEVELOPMENT<br />INTEGRATION<br />STAGING<br />PRODUCTION<br />10<br />
  64. 64. <ul><li>Database/schema design
  65. 65. Transition to Column-oriented or flat schema
  66. 66. Row-key design/implementation
  67. 67. Sequential keys</li></ul>Suffers from distribution of load but uses the block caches<br />Can be addressed by pre-splitting the regions<br /><ul><li>Randomize keys to get better distribution</li></ul>Achieved through hashing on Key Attributes – SHA1<br />Suffers range scans<br /><ul><li>Too many Column Families
  68. 68. Initially we had about 30 or so, now reduced to 8
  69. 69. Compression
  70. 70. LZO didn’t work out well with CDH2, using default Block compression
  71. 71. Need to revisit with CDH3</li></ul>11<br />Challenges/Lessons Learned<br />
  72. 72. <ul><li>Serialization
  73. 73. AVRO didn’t work well – deserialization issue
  74. 74. Developed configurable serialization mechanism that uses JSON except Date type
  75. 75. Secondary Indexes
  76. 76. Were using ITHBase and IHBase from contrib – doesn’t work well
  77. 77. Redesigned schema without need for index
  78. 78. We still need it though
  79. 79. Performance
  80. 80. Several tunable parameters</li></ul>Hadoop, HBase, OS, JVM, Networking, Hardware<br /><ul><li>Scalability
  81. 81. Interfacing with real-time (interactive) systems from batch oriented system</li></ul>12<br />Challenges/Lessons Learned<br />
  82. 82. <ul><li>Configuring HBase
  83. 83. Configuration is the key
  84. 84. Many moving parts – typos, out of synchronization
  85. 85. Operating System</li></ul>Number of open files (ulimit) to 32K or even higher<br />vm.swapiness to lower or 0<br /><ul><li>HDFS</li></ul>Adjust block size based on the use case<br />Increase xceivers to 2047 (dfs.datanode.max.xceivers)<br />Set socket timeout to 0 (dfs.datanode.socket.write.timeout)<br /><ul><li>HBase</li></ul>Needs more memory<br />ZK on DN – have a separate ZK quorum<br />No swapping – JVM hates it<br />GC pauses could cause timeouts or RS failures (read article posted by Todd Lipcon on avoiding full GC)<br />13<br />Challenges/Lessons Learned<br />
  86. 86. <ul><li>HBase</li></ul>Write (puts) optimization (Ryan Rawson HUG8 presentation – HBase importing)<br />hbase.regionserver.global.memstore.upperLimit=0.3<br />hbase.regionserver.global.memstore.lowerLimit=0.15<br />hbase.regionserver.handler.count=256 <br />hbase.hregion.memstore.block.multiplier=8<br />hbase.hstore.blockingStoreFiles=25<br />Control number of store files (hbase.hregion.max.filesize)<br /><ul><li>Security
  87. 87. Introduced in CDH3b3 but in flux, need robust RBAC
  88. 88. Reliability
  89. 89. Name Node is SPOF
  90. 90. HBase is sensitive</li></ul>Region Server Failures<br />14<br />Challenges/Lessons Learned<br />
  91. 91. <ul><li>Initially, we had on-site training classes
  92. 92. Cluster configurations were reviewed and made some recommendations
  93. 93. Tickets resolved within the reasonable SLAs
  94. 94. Knowledgeable support team
  95. 95. Ability to have access to technology experts, if needed
  96. 96. With Cloudera support for almost a year now
  97. 97. Our needs and demands are increasing and looking towards enterprise support</li></ul>15<br />Why Cloudera?<br />
  98. 98. <ul><li>Better operational tools for using Hadoop and HBase
  99. 99. Job management, backup, restore, user provisioning, general administrative tasks, etc.
  100. 100. Support for Secondary Indexes
  101. 101. Full-text Indexes and Searching (Lucene/Solr integration?)
  102. 102. HA support for Name Node
  103. 103. Need Data Replication for HA & DR
  104. 104. Best practices and supported material</li></ul>16<br />Features Needed<br />
  105. 105. 17<br />Thank You<br />Q & A<br />

×