Geo-based Content Processing using HBase<br />Ravi Veeramachaneni<br />NAVTEQ<br />1<br />
Agenda<br /><ul><li>Problem
Solution
Why HBase?
Challenges
Why Cloudera?</li></ul>2<br />
Problem<br /><ul><li>Ineffective to scaleout
Cost:Expensive Oracle license cost
Technology: Inherent limitations of RDBMS
Need for real-time data Availability
Content needs to be updated and delivered much faster than before
Unable to deliver better, richer, faster contextual content  more efficiently
Need to support both customers with connected and disconnected devices
Need to support flexible functionality
Need to support both structured and unstructured data
Need to decouple content from the map
Need flexibility to quickly add new content providers
Upcoming SlideShare
Loading in...5
×

Chicago Data Summit: Geo-based Content Processing Using HBase

3,289

Published on

NAVTEQ uses Cloudera Distribution including Apache Hadoop (CDH) and HBase with Cloudera Enterprise support to process and store location content data. With HBase and its distributed and column-oriented architecture, NAVTEQ is able to process large amounts of data in a scalable and cost-effective way.

Published in: Technology
0 Comments
6 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,289
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
0
Comments
0
Likes
6
Embeds 0
No embeds

No notes for slide

Chicago Data Summit: Geo-based Content Processing Using HBase

  1. 1. Geo-based Content Processing using HBase<br />Ravi Veeramachaneni<br />NAVTEQ<br />1<br />
  2. 2. Agenda<br /><ul><li>Problem
  3. 3. Solution
  4. 4. Why HBase?
  5. 5. Challenges
  6. 6. Why Cloudera?</li></ul>2<br />
  7. 7. Problem<br /><ul><li>Ineffective to scaleout
  8. 8. Cost:Expensive Oracle license cost
  9. 9. Technology: Inherent limitations of RDBMS
  10. 10. Need for real-time data Availability
  11. 11. Content needs to be updated and delivered much faster than before
  12. 12. Unable to deliver better, richer, faster contextual content more efficiently
  13. 13. Need to support both customers with connected and disconnected devices
  14. 14. Need to support flexible functionality
  15. 15. Need to support both structured and unstructured data
  16. 16. Need to decouple content from the map
  17. 17. Need flexibility to quickly add new content providers
  18. 18. Need to support community input</li></ul>3<br />
  19. 19. Our Data is Constantly Growing<br /><ul><li>Content Breadth</li></ul>100s ofmillions of content records<br />100s of content suppliers + community input<br /><ul><li>Content Depth</li></ul>On average, a content record has 120attributes<br />Certain types of content have more than 400 attributes<br />Content classified across 270+ categories<br /><ul><li>Our content is</li></ul>Sparse and unstructured<br />Provided in multiple data formats<br />Ingested, processed and delivered in transactional and batch mode<br />Constantly growing (higher TB)<br />4<br />
  20. 20. Why HBase?<br /><ul><li>HBase Scales
  21. 21. As it runs on top of Hadoop
  22. 22. HBase stores null values for free
  23. 23. Saves both disk space and disk IO time
  24. 24. HBase supports unstructured data through column families
  25. 25. Individual records are called cells
  26. 26. Cells are addressed with row key/column family/cell qualifier/timestamp tuple
  27. 27. HBase has built-in version management
  28. 28. HBase provides fast table scans for time ranges and fast key based lookups
  29. 29. Map Reduce data input
  30. 30. Tables are sorted and have unique keys</li></ul>Reducer often times optional<br />Combiner not needed<br /><ul><li>Strong community support and wider adoption</li></ul>5<br />
  31. 31. <ul><li>Provide scalable content processing platform to handle spikes in content processing demand
  32. 32. Providing for horizontal scalability (Hadoop/HBase)
  33. 33. Provide business rules management system to adapt to changing processing needs
  34. 34. Flexible business rules based on the supplier and content information
  35. 35. Flexible business flows and processing to meet SLA
  36. 36. Provide high value and high quality content to customers fast
  37. 37. Corroborate multiple sources to produce the best quality information
  38. 38. Utilize open source software whereeverapplicable
  39. 39. Use commercial support for open source solutions to have a piece of mind</li></ul>6<br />Our Approach for Content Processing Challenges<br />
  40. 40. 7<br />Content Processing Overview<br />Batch and Transactional Ingestion<br />Supplier n<br />Supplier 1<br />Supplier 2<br />Location ID<br />Permanent ID<br />Source & Blended Record Management<br />PUBLISHING<br />Permanent ID<br />Location ID<br />real-time, on-demand<br />
  41. 41. 8<br />System Decomposition<br />
  42. 42. HBase @NAVTEQ<br /><ul><li>Started in late-2009, hbase 0.19.x (apache)
  43. 43. 8-node VMWare Sandbox Cluster
  44. 44. Flaky, unstable, RS Failures
  45. 45. Switched to CDH
  46. 46. No Cloudera support
  47. 47. Early 2010, hbase 0.20.x (CDH2)
  48. 48. 10-node Physical Sandbox Cluster
  49. 49. Still had lot of challenges, RS Failures, META corruption
  50. 50. Cluster expanded significantly with multiple environments
  51. 51. Signed Cloudera support
  52. 52. No official Cloudera support for HBase
  53. 53. Late 2010, hbase 0.89 (CDH3b3)
  54. 54. Most stable than any other version in the past
  55. 55. Multiple teams at NAVTEQ exploring to use Hadoop/HBase
  56. 56. Official Cloudera supported version
  57. 57. Current
  58. 58. Waiting on moving to CDH3 official release
  59. 59. Looking into Hive, Oozie, Lucene/Solr integration, Cloudera Enterprise and few other
  60. 60. Several initiatives to use HBase</li></ul>9<br />
  61. 61. HBase @NAVTEQ<br /><ul><li>Hardware / Environment
  62. 62. Dual-quad core with HT
  63. 63. 64GB RAM (ECC)
  64. 64. 4x2TB (JBOD)
  65. 65. RHEL 5.4
  66. 66. 500+ CPU cores
  67. 67. 200+ TB configured Disk Capacity
  68. 68. Multiple environments</li></ul>DEVELOPMENT<br />INTEGRATION<br />STAGING<br />PRODUCTION<br />10<br />
  69. 69. <ul><li>Database/schema design
  70. 70. Transition to Column-oriented or flat schema
  71. 71. Row-key design/implementation
  72. 72. Sequential keys (or UUID)</li></ul>Suffers from distribution of load but uses the block caches<br />Can be addressed by pre-splitting the regions<br /><ul><li>Randomize keys to get better distribution</li></ul>Suffers range scans<br />Achieved through hashing on Key Attributes – SHA1<br /><ul><li>Too many Column Families
  73. 73. Performance was horrible
  74. 74. Initially we had about 100 or so, now downed to 8
  75. 75. Compression
  76. 76. LZO didn’t work out well with CDH2, using default Block compression
  77. 77. Need to revisit with CDH3</li></ul>11<br />Challenges/Lessons Learned<br />
  78. 78. <ul><li>Serialization
  79. 79. Everything in HBase stored as Byte Array with the exception of timestamp
  80. 80. AVRO didn’t work well – deserialization issue
  81. 81. Developed configurable serialization mechanism that uses JSON except Date type
  82. 82. Indexing
  83. 83. Were using ITHBase and IHBase from contrib – doesn’t work well
  84. 84. Redesigned schema without need for index
  85. 85. We still need it though
  86. 86. Cluster configuration
  87. 87. Configuration is the key
  88. 88. Many moving parts – typos, out of synchronization</li></ul>Automating deployments with Puppet (looking in to it)<br /><ul><li>Hadoop</li></ul>Adjust block size based on the use case<br />12<br />Challenges/Lessons Learned<br />
  89. 89. <ul><li>Cluster Configuration
  90. 90. HBase</li></ul>Needs more memory<br />ZK on DN – have a separate ZK quorum<br />No swapping – JVM hates it<br />GC pauses could cause timeouts or RS failures (read article posted by Todd Lipcon on avoiding full GC)<br />Write (puts) optimization (Ryan Rawson HUG8 presentation – HBase importing)<br />hbase.regionserver.global.memstore.upperLimit=0.3<br />hbase.regionserver.global.memstore.lowerLimit=0.15<br />hbase.hstore.blockingStoreFiles=25<br />hbase.hregion.memstore.block.multiplier=8<br />hbase.regionserver.handler.count=30<br /><ul><li>Operating System
  91. 91. Number of open files (ulimit) to 32K or even higher
  92. 92. vm.swapiness to lower or 0</li></ul>13<br />Challenges/Lessons Learned<br />
  93. 93. <ul><li>Performance
  94. 94. Several tunable parameters (Hadoop, HBase, OS, JVM, Networking, Hardware)
  95. 95. Scalability
  96. 96. Interacting with transactional-based (interactive) systems from batch oriented system
  97. 97. Security
  98. 98. Introduced in CDH3b3 but in flux, need robust RBAC
  99. 99. Reliability
  100. 100. Name Node is SPOF
  101. 101. HBase is sensitive</li></ul>Region Server Failures<br />14<br />Challenges/Lessons Learned<br />
  102. 102. <ul><li>Had a few on-site training classes
  103. 103. Cluster configurations were reviewed and made some recommendations
  104. 104. Tickets resolved within the SLAs
  105. 105. Knowledgeable support team
  106. 106. Ability to have access to technology experts, if needed
  107. 107. With Cloudera support for almost a year
  108. 108. Our needs and demand increasing and looking towards enterprise support</li></ul>15<br />Why Cloudera?<br />
  109. 109. <ul><li>Better operational tools for using Hadoop and HBase
  110. 110. Job management, backup, restore, user provisioning, general administrative tasks, etc.
  111. 111. Support for Secondary Indexes
  112. 112. Indexing and Searching (Lucene/Solr integration?)
  113. 113. HA support for Name Node
  114. 114. Need Data Replication for HA & DR
  115. 115. Best practices and supported material</li></ul>16<br />Features Needed<br />
  116. 116. 17<br />Thank You<br />Q & A<br />

×