• Save
Geo-based content processing using hbase
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Geo-based content processing using hbase

on

  • 1,785 views

Presented at Chicago Data Summit 2011 hosted by Cloudera.

Presented at Chicago Data Summit 2011 hosted by Cloudera.

Statistics

Views

Total Views
1,785
Views on SlideShare
1,785
Embed Views
0

Actions

Likes
2
Downloads
0
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Geo-based content processing using hbase Presentation Transcript

  • 1. Geo-based Content Processing using HBase
    Ravi Veeramachaneni
    NAVTEQ
    1
  • 2. Agenda
    • Problem
    • 3. Solution
    • 4. Why HBase?
    • 5. Challenges
    • 6. Why we chose Cloudera?
    2
  • 7. Problem
    • Ineffective to scaleout
    • 8. Cost:Expensive Oracle license cost
    • 9. Technology: Inherent limitations of RDBMS
    • 10. Need to support flexible functionality
    • 11. Need to decouple content from the map
    • 12. Need flexibility to quickly add new content providers
    • 13. Need to support community input
    • 14. Need for real-time data Availability
    • 15. Content needs to be updated and delivered much faster than before
    • 16. Unable to deliver better, richer, faster contextual content more efficiently
    • 17. Need to support both customers with connected and disconnected devices
    3
  • 18. Our Data is Constantly Growing
    • Content Breadth
    100s ofmillions of content records
    100s of content suppliers + community input
    • Content Depth
    On average, a content record has 120attributes
    Certain types of content have more than 400 attributes
    Content classified across 270+ categories
    • Our content is
    Sparse and unstructured
    Provided in multiple data formats
    Ingested, processed and delivered in transactional and batch mode
    Constantly growing (in higher TB)
    4
  • 19.
    • Provide scalable content processing platform to handle spikes in content processing demand
    • 20. Providing for horizontal scalability (Hadoop/HBase)
    • 21. Provide business rules management system to adapt to changing processing needs
    • 22. Flexible business rules based on the supplier and content information
    • 23. Flexible business flows and processing to meet SLA
    • 24. Provide high value and high quality content to customers fast
    • 25. Corroborate multiple sources to produce the best quality information
    • 26. Utilize open source software wherever applicable with commercial support
    5
    Our Approach for Content Processing Challenges
  • 27. 6
    Content Processing Overview
    Batch and Real-time Ingestion
    Supplier n
    Supplier 1
    Supplier 2
    Location ID
    Permanent ID
    Source & Blended Record Management
    PUBLISHING
    Permanent ID
    Location ID
    real-time, on-demand
  • 28. 7
    System Decomposition
  • 29. Why HBase?
    • HBase Scales (runs on top of Hadoop)
    • 30. HBase stores null values for free
    • 31. Saves both disk space and disk IO time
    • 32. HBase supports unstructured data through column families
    • 33. HBase has built-in version management
    • 34. HBase provides fast table scans for time ranges and fast key based lookups
    • 35. Map Reduce data input
    • 36. Tables are sorted and have unique keys
    Reducer often times optional
    Combiner not needed
    • Strong community support and wider adoption
    8
  • 37. HBase @NAVTEQ
    • Started in late-2009, hbase 0.19.x (apache)
    • 38. 8-node VMWare Sandbox Cluster
    • 39. Flaky, unstable, RS Failures
    • 40. Switched to CDH
    • 41. No Cloudera support
    • 42. Late 2010, hbase 0.89 (CDH3b3)
    • 43. Most stable than any other version in the past
    • 44. Multiple teams at NAVTEQ exploring to use Hadoop/HBase
    • 45. Official Cloudera supported version
    • 46. Early 2010, hbase 0.20.x (CDH2)
    • 47. 10-node Physical Sandbox Cluster
    • 48. Still had lot of challenges, RS Failures, META corruption
    • 49. Cluster expanded significantly with multiple environments
    • 50. Signed Cloudera support
    • 51. No official Cloudera support for HBase
    • 52. Current (hbase 0.90.1)
    • 53. Waiting on moving to CDH3 official release
    • 54. Looking into Hive, Oozie, Lucene/Solr integration, Cloudera Enterprise and few other
    • 55. Several initiatives to use HBase
    9
  • 56. HBase @NAVTEQ
    • Hardware / Environment
    • 57. DELL R410
    • 58. 64GB RAM (ECC)
    • 59. 4x2TB (JBOD)
    • 60. RHEL 5.4
    • 61. 500+ CPU cores
    • 62. 200+ TB configured Disk Capacity
    • 63. Multiple environments
    DEVELOPMENT
    INTEGRATION
    STAGING
    PRODUCTION
    10
  • 64.
    • Database/schema design
    • 65. Transition to Column-oriented or flat schema
    • 66. Row-key design/implementation
    • 67. Sequential keys
    Suffers from distribution of load but uses the block caches
    Can be addressed by pre-splitting the regions
    • Randomize keys to get better distribution
    Achieved through hashing on Key Attributes – SHA1
    Suffers range scans
    • Too many Column Families
    • 68. Initially we had about 30 or so, now reduced to 8
    • 69. Compression
    • 70. LZO didn’t work out well with CDH2, using default Block compression
    • 71. Need to revisit with CDH3
    11
    Challenges/Lessons Learned
  • 72.
    • Serialization
    • 73. AVRO didn’t work well – deserialization issue
    • 74. Developed configurable serialization mechanism that uses JSON except Date type
    • 75. Secondary Indexes
    • 76. Were using ITHBase and IHBase from contrib – doesn’t work well
    • 77. Redesigned schema without need for index
    • 78. We still need it though
    • 79. Performance
    • 80. Several tunable parameters
    Hadoop, HBase, OS, JVM, Networking, Hardware
    • Scalability
    • 81. Interfacing with real-time (interactive) systems from batch oriented system
    12
    Challenges/Lessons Learned
  • 82.
    • Configuring HBase
    • 83. Configuration is the key
    • 84. Many moving parts – typos, out of synchronization
    • 85. Operating System
    Number of open files (ulimit) to 32K or even higher
    vm.swapiness to lower or 0
    • HDFS
    Adjust block size based on the use case
    Increase xceivers to 2047 (dfs.datanode.max.xceivers)
    Set socket timeout to 0 (dfs.datanode.socket.write.timeout)
    • HBase
    Needs more memory
    ZK on DN – have a separate ZK quorum
    No swapping – JVM hates it
    GC pauses could cause timeouts or RS failures (read article posted by Todd Lipcon on avoiding full GC)
    13
    Challenges/Lessons Learned
  • 86.
    • HBase
    Write (puts) optimization (Ryan Rawson HUG8 presentation – HBase importing)
    hbase.regionserver.global.memstore.upperLimit=0.3
    hbase.regionserver.global.memstore.lowerLimit=0.15
    hbase.regionserver.handler.count=256
    hbase.hregion.memstore.block.multiplier=8
    hbase.hstore.blockingStoreFiles=25
    Control number of store files (hbase.hregion.max.filesize)
    • Security
    • 87. Introduced in CDH3b3 but in flux, need robust RBAC
    • 88. Reliability
    • 89. Name Node is SPOF
    • 90. HBase is sensitive
    Region Server Failures
    14
    Challenges/Lessons Learned
  • 91.
    • Initially, we had on-site training classes
    • 92. Cluster configurations were reviewed and made some recommendations
    • 93. Tickets resolved within the reasonable SLAs
    • 94. Knowledgeable support team
    • 95. Ability to have access to technology experts, if needed
    • 96. With Cloudera support for almost a year now
    • 97. Our needs and demands are increasing and looking towards enterprise support
    15
    Why Cloudera?
  • 98.
    • Better operational tools for using Hadoop and HBase
    • 99. Job management, backup, restore, user provisioning, general administrative tasks, etc.
    • 100. Support for Secondary Indexes
    • 101. Full-text Indexes and Searching (Lucene/Solr integration?)
    • 102. HA support for Name Node
    • 103. Need Data Replication for HA & DR
    • 104. Best practices and supported material
    16
    Features Needed
  • 105. 17
    Thank You
    Q & A