• Save
Geo-based content processing using hbase
Upcoming SlideShare
Loading in...5

Geo-based content processing using hbase



Presented at Chicago Data Summit 2011 hosted by Cloudera.

Presented at Chicago Data Summit 2011 hosted by Cloudera.



Total Views
Slideshare-icon Views on SlideShare
Embed Views



0 Embeds 0

No embeds



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

    Geo-based content processing using hbase Geo-based content processing using hbase Presentation Transcript

    • Geo-based Content Processing using HBase
      Ravi Veeramachaneni
    • Agenda
      • Problem
      • Solution
      • Why HBase?
      • Challenges
      • Why we chose Cloudera?
    • Problem
      • Ineffective to scaleout
      • Cost:Expensive Oracle license cost
      • Technology: Inherent limitations of RDBMS
      • Need to support flexible functionality
      • Need to decouple content from the map
      • Need flexibility to quickly add new content providers
      • Need to support community input
      • Need for real-time data Availability
      • Content needs to be updated and delivered much faster than before
      • Unable to deliver better, richer, faster contextual content more efficiently
      • Need to support both customers with connected and disconnected devices
    • Our Data is Constantly Growing
      • Content Breadth
      100s ofmillions of content records
      100s of content suppliers + community input
      • Content Depth
      On average, a content record has 120attributes
      Certain types of content have more than 400 attributes
      Content classified across 270+ categories
      • Our content is
      Sparse and unstructured
      Provided in multiple data formats
      Ingested, processed and delivered in transactional and batch mode
      Constantly growing (in higher TB)
      • Provide scalable content processing platform to handle spikes in content processing demand
      • Providing for horizontal scalability (Hadoop/HBase)
      • Provide business rules management system to adapt to changing processing needs
      • Flexible business rules based on the supplier and content information
      • Flexible business flows and processing to meet SLA
      • Provide high value and high quality content to customers fast
      • Corroborate multiple sources to produce the best quality information
      • Utilize open source software wherever applicable with commercial support
      Our Approach for Content Processing Challenges
    • 6
      Content Processing Overview
      Batch and Real-time Ingestion
      Supplier n
      Supplier 1
      Supplier 2
      Location ID
      Permanent ID
      Source & Blended Record Management
      Permanent ID
      Location ID
      real-time, on-demand
    • 7
      System Decomposition
    • Why HBase?
      • HBase Scales (runs on top of Hadoop)
      • HBase stores null values for free
      • Saves both disk space and disk IO time
      • HBase supports unstructured data through column families
      • HBase has built-in version management
      • HBase provides fast table scans for time ranges and fast key based lookups
      • Map Reduce data input
      • Tables are sorted and have unique keys
      Reducer often times optional
      Combiner not needed
      • Strong community support and wider adoption
    • HBase @NAVTEQ
      • Started in late-2009, hbase 0.19.x (apache)
      • 8-node VMWare Sandbox Cluster
      • Flaky, unstable, RS Failures
      • Switched to CDH
      • No Cloudera support
      • Late 2010, hbase 0.89 (CDH3b3)
      • Most stable than any other version in the past
      • Multiple teams at NAVTEQ exploring to use Hadoop/HBase
      • Official Cloudera supported version
      • Early 2010, hbase 0.20.x (CDH2)
      • 10-node Physical Sandbox Cluster
      • Still had lot of challenges, RS Failures, META corruption
      • Cluster expanded significantly with multiple environments
      • Signed Cloudera support
      • No official Cloudera support for HBase
      • Current (hbase 0.90.1)
      • Waiting on moving to CDH3 official release
      • Looking into Hive, Oozie, Lucene/Solr integration, Cloudera Enterprise and few other
      • Several initiatives to use HBase
    • HBase @NAVTEQ
      • Hardware / Environment
      • DELL R410
      • 64GB RAM (ECC)
      • 4x2TB (JBOD)
      • RHEL 5.4
      • 500+ CPU cores
      • 200+ TB configured Disk Capacity
      • Multiple environments
      • Database/schema design
      • Transition to Column-oriented or flat schema
      • Row-key design/implementation
      • Sequential keys
      Suffers from distribution of load but uses the block caches
      Can be addressed by pre-splitting the regions
      • Randomize keys to get better distribution
      Achieved through hashing on Key Attributes – SHA1
      Suffers range scans
      • Too many Column Families
      • Initially we had about 30 or so, now reduced to 8
      • Compression
      • LZO didn’t work out well with CDH2, using default Block compression
      • Need to revisit with CDH3
      Challenges/Lessons Learned
      • Serialization
      • AVRO didn’t work well – deserialization issue
      • Developed configurable serialization mechanism that uses JSON except Date type
      • Secondary Indexes
      • Were using ITHBase and IHBase from contrib – doesn’t work well
      • Redesigned schema without need for index
      • We still need it though
      • Performance
      • Several tunable parameters
      Hadoop, HBase, OS, JVM, Networking, Hardware
      • Scalability
      • Interfacing with real-time (interactive) systems from batch oriented system
      Challenges/Lessons Learned
      • Configuring HBase
      • Configuration is the key
      • Many moving parts – typos, out of synchronization
      • Operating System
      Number of open files (ulimit) to 32K or even higher
      vm.swapiness to lower or 0
      • HDFS
      Adjust block size based on the use case
      Increase xceivers to 2047 (dfs.datanode.max.xceivers)
      Set socket timeout to 0 (dfs.datanode.socket.write.timeout)
      • HBase
      Needs more memory
      ZK on DN – have a separate ZK quorum
      No swapping – JVM hates it
      GC pauses could cause timeouts or RS failures (read article posted by Todd Lipcon on avoiding full GC)
      Challenges/Lessons Learned
      • HBase
      Write (puts) optimization (Ryan Rawson HUG8 presentation – HBase importing)
      Control number of store files (hbase.hregion.max.filesize)
      • Security
      • Introduced in CDH3b3 but in flux, need robust RBAC
      • Reliability
      • Name Node is SPOF
      • HBase is sensitive
      Region Server Failures
      Challenges/Lessons Learned
      • Initially, we had on-site training classes
      • Cluster configurations were reviewed and made some recommendations
      • Tickets resolved within the reasonable SLAs
      • Knowledgeable support team
      • Ability to have access to technology experts, if needed
      • With Cloudera support for almost a year now
      • Our needs and demands are increasing and looking towards enterprise support
      Why Cloudera?
      • Better operational tools for using Hadoop and HBase
      • Job management, backup, restore, user provisioning, general administrative tasks, etc.
      • Support for Secondary Indexes
      • Full-text Indexes and Searching (Lucene/Solr integration?)
      • HA support for Name Node
      • Need Data Replication for HA & DR
      • Best practices and supported material
      Features Needed
    • 17
      Thank You
      Q & A