• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Geo-based content processing using hbase
 

Geo-based content processing using hbase

on

  • 1,646 views

Presented at Chicago Data Summit 2011 hosted by Cloudera.

Presented at Chicago Data Summit 2011 hosted by Cloudera.

Statistics

Views

Total Views
1,646
Views on SlideShare
1,646
Embed Views
0

Actions

Likes
2
Downloads
0
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Geo-based content processing using hbase Geo-based content processing using hbase Presentation Transcript

    • Geo-based Content Processing using HBase
      Ravi Veeramachaneni
      NAVTEQ
      1
    • Agenda
      • Problem
      • Solution
      • Why HBase?
      • Challenges
      • Why we chose Cloudera?
      2
    • Problem
      • Ineffective to scaleout
      • Cost:Expensive Oracle license cost
      • Technology: Inherent limitations of RDBMS
      • Need to support flexible functionality
      • Need to decouple content from the map
      • Need flexibility to quickly add new content providers
      • Need to support community input
      • Need for real-time data Availability
      • Content needs to be updated and delivered much faster than before
      • Unable to deliver better, richer, faster contextual content more efficiently
      • Need to support both customers with connected and disconnected devices
      3
    • Our Data is Constantly Growing
      • Content Breadth
      100s ofmillions of content records
      100s of content suppliers + community input
      • Content Depth
      On average, a content record has 120attributes
      Certain types of content have more than 400 attributes
      Content classified across 270+ categories
      • Our content is
      Sparse and unstructured
      Provided in multiple data formats
      Ingested, processed and delivered in transactional and batch mode
      Constantly growing (in higher TB)
      4
      • Provide scalable content processing platform to handle spikes in content processing demand
      • Providing for horizontal scalability (Hadoop/HBase)
      • Provide business rules management system to adapt to changing processing needs
      • Flexible business rules based on the supplier and content information
      • Flexible business flows and processing to meet SLA
      • Provide high value and high quality content to customers fast
      • Corroborate multiple sources to produce the best quality information
      • Utilize open source software wherever applicable with commercial support
      5
      Our Approach for Content Processing Challenges
    • 6
      Content Processing Overview
      Batch and Real-time Ingestion
      Supplier n
      Supplier 1
      Supplier 2
      Location ID
      Permanent ID
      Source & Blended Record Management
      PUBLISHING
      Permanent ID
      Location ID
      real-time, on-demand
    • 7
      System Decomposition
    • Why HBase?
      • HBase Scales (runs on top of Hadoop)
      • HBase stores null values for free
      • Saves both disk space and disk IO time
      • HBase supports unstructured data through column families
      • HBase has built-in version management
      • HBase provides fast table scans for time ranges and fast key based lookups
      • Map Reduce data input
      • Tables are sorted and have unique keys
      Reducer often times optional
      Combiner not needed
      • Strong community support and wider adoption
      8
    • HBase @NAVTEQ
      • Started in late-2009, hbase 0.19.x (apache)
      • 8-node VMWare Sandbox Cluster
      • Flaky, unstable, RS Failures
      • Switched to CDH
      • No Cloudera support
      • Late 2010, hbase 0.89 (CDH3b3)
      • Most stable than any other version in the past
      • Multiple teams at NAVTEQ exploring to use Hadoop/HBase
      • Official Cloudera supported version
      • Early 2010, hbase 0.20.x (CDH2)
      • 10-node Physical Sandbox Cluster
      • Still had lot of challenges, RS Failures, META corruption
      • Cluster expanded significantly with multiple environments
      • Signed Cloudera support
      • No official Cloudera support for HBase
      • Current (hbase 0.90.1)
      • Waiting on moving to CDH3 official release
      • Looking into Hive, Oozie, Lucene/Solr integration, Cloudera Enterprise and few other
      • Several initiatives to use HBase
      9
    • HBase @NAVTEQ
      • Hardware / Environment
      • DELL R410
      • 64GB RAM (ECC)
      • 4x2TB (JBOD)
      • RHEL 5.4
      • 500+ CPU cores
      • 200+ TB configured Disk Capacity
      • Multiple environments
      DEVELOPMENT
      INTEGRATION
      STAGING
      PRODUCTION
      10
      • Database/schema design
      • Transition to Column-oriented or flat schema
      • Row-key design/implementation
      • Sequential keys
      Suffers from distribution of load but uses the block caches
      Can be addressed by pre-splitting the regions
      • Randomize keys to get better distribution
      Achieved through hashing on Key Attributes – SHA1
      Suffers range scans
      • Too many Column Families
      • Initially we had about 30 or so, now reduced to 8
      • Compression
      • LZO didn’t work out well with CDH2, using default Block compression
      • Need to revisit with CDH3
      11
      Challenges/Lessons Learned
      • Serialization
      • AVRO didn’t work well – deserialization issue
      • Developed configurable serialization mechanism that uses JSON except Date type
      • Secondary Indexes
      • Were using ITHBase and IHBase from contrib – doesn’t work well
      • Redesigned schema without need for index
      • We still need it though
      • Performance
      • Several tunable parameters
      Hadoop, HBase, OS, JVM, Networking, Hardware
      • Scalability
      • Interfacing with real-time (interactive) systems from batch oriented system
      12
      Challenges/Lessons Learned
      • Configuring HBase
      • Configuration is the key
      • Many moving parts – typos, out of synchronization
      • Operating System
      Number of open files (ulimit) to 32K or even higher
      vm.swapiness to lower or 0
      • HDFS
      Adjust block size based on the use case
      Increase xceivers to 2047 (dfs.datanode.max.xceivers)
      Set socket timeout to 0 (dfs.datanode.socket.write.timeout)
      • HBase
      Needs more memory
      ZK on DN – have a separate ZK quorum
      No swapping – JVM hates it
      GC pauses could cause timeouts or RS failures (read article posted by Todd Lipcon on avoiding full GC)
      13
      Challenges/Lessons Learned
      • HBase
      Write (puts) optimization (Ryan Rawson HUG8 presentation – HBase importing)
      hbase.regionserver.global.memstore.upperLimit=0.3
      hbase.regionserver.global.memstore.lowerLimit=0.15
      hbase.regionserver.handler.count=256
      hbase.hregion.memstore.block.multiplier=8
      hbase.hstore.blockingStoreFiles=25
      Control number of store files (hbase.hregion.max.filesize)
      • Security
      • Introduced in CDH3b3 but in flux, need robust RBAC
      • Reliability
      • Name Node is SPOF
      • HBase is sensitive
      Region Server Failures
      14
      Challenges/Lessons Learned
      • Initially, we had on-site training classes
      • Cluster configurations were reviewed and made some recommendations
      • Tickets resolved within the reasonable SLAs
      • Knowledgeable support team
      • Ability to have access to technology experts, if needed
      • With Cloudera support for almost a year now
      • Our needs and demands are increasing and looking towards enterprise support
      15
      Why Cloudera?
      • Better operational tools for using Hadoop and HBase
      • Job management, backup, restore, user provisioning, general administrative tasks, etc.
      • Support for Secondary Indexes
      • Full-text Indexes and Searching (Lucene/Solr integration?)
      • HA support for Name Node
      • Need Data Replication for HA & DR
      • Best practices and supported material
      16
      Features Needed
    • 17
      Thank You
      Q & A