HBase At Meetup
Upcoming SlideShare
Loading in...5

HBase At Meetup



Overview of HBase at Meetup

Overview of HBase at Meetup



Total Views
Views on SlideShare
Embed Views



7 Embeds 534

http://nosql.mypopescu.com 460
http://www.slideshare.net 57
http://www.meetup.com 10
http://bender 3
http://bender.radada.gruik.org 2
http://www.readpath.com 1
https://twitter.com 1



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
  • Gary,

    When using columns to store your secondary indexes, how you handle when a group has a very old timeline (older than the TTL)? Do you rebuild the secondary index? Or you access the feed items directlly?
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

HBase At Meetup HBase At Meetup Presentation Transcript

  • HBase @ Meetup Gary Helmling – Lead SW Engineer
  • The Problem circa Jan 2009
    • Groups doing great things, but how do you find it all?
      • Wait til the next event
      • Click around (a lot)
    • Wanted to show what's happening in groups
      • Discussions, photos, new members, RSVPs, etc.
      • But requires 10 different queries!
  • The Solution
    • Show activity from all your groups in one place
      • real-time updates
      • better discovery of what's going on
      • find new ways to participate and get to know your groups
  • Challenges
    • Normalized schema
      • Each type of activity requires querying a separate table
        • already wasn't scaling at the group level
    • Query efficiency
      • Activity occurs at group level
      • Members can be in hundreds of groups
      • For member home page we need activity from all groups ordered by most recent
        • N subqueries by group ID merged back by descending timestamp
  • Options
    • De-normalize MySQL
      • Stuff different activity types into a common table (with different fields for different types of activity)
      • Duplicate entity data (or we're still doing N queries)
      • Start to lose a lot of the benefits of RDBMS
      • Query efficiency still a problem
      • Single system scaling limit
    • Something new
      • the Cloud
        • Google App Engine
        • Amazon SimpleDB
      • Hadoop/HBase
      • CouchDB
      • MongoDB
      • Voldemort
      • Cassandra
  • Why HBase?
    • We own infrastructure, no usage limits
    • Data model
      • Semi-structured data in HBase (easily handles multiple types in same table)
      • Time-series ordered
      • Scaling is built in (just add more servers)
      • But extra indexing is DIY
    • Very active developer community
    • Established, mature project (in relative terms!)
    • Matches our own toolset (java/linux based)
  • What is HBase?
    • Clone of Google's BigTable
      • Distributed (automatic partitioning)
      • Column-oriented
      • Semi-structured (columns can be added just by inserting)
      • Built-in versioning
    • Not an RDBMS
      • No joins
      • No SQL
      • Data usually not normalized
      • Transactions & built-in secondary indexes available (as contrib) but immature
    • Need to think differently about how you structure data
      • Denormalize your data where necessary
      • Structure data & row keys around common access
  • What is HBase? Data Storage
    • Table
      • Regions, defined by row [start key, end key)
        • Store, 1 per family
          • 1+ Store Files (Hfile format on HDFS)
    • (table, rowkey, family, column, timestamp) = value
    • Everything is byte[]
    • Rows are ordered sequentially by key
    • Special tables: -ROOT-, .META.
      • Tell clients where to find user data
  • HBase Architecture Courtesy of Lars George see http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html for more detail from http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html
  • What is HBase? Data Access
    • Random access (Gets)
      • by rowkey only
    • Sequential reads (Scans)
      • starting row key
      • where you stop is as important as where you start
        • ending row key (optional)
        • server-side filter (optional)
    • Writes (Puts)
      • No insert vs. update distinction
  • How It Works Storing activity data in HBase
    • FeedItem : stores activity data for all types
      • keyed by group and descending timestamp
        • ch < chapterID > -ts < Long.MAX_VALUE–timestamp > - < type > - < entityID >
      • each row only contains data for that type
    Row Key info: content: ch1261585-ts9223... item_type = chapter_greeting target_greeting = 8104438 greeting = “Hi, Gary” ch1261585-ts9223... item_type = new_discussion target_forum = 847743 target_thread = 7369603 title = “Improvements” body = “When a discussion is created...”
    • MemberFeedIndex : index of FeedItem rows from all of a member's groups
      • one row per member (keyed by member ID)
      • columns store refs to FeedItem row keys for that member's groups
      • TTL of 2 months expires old index values
    Row Key item: 4679998 ch176399-ts9223370788400750807-mem-10044424 = new_member ch1261585-ts9223370787431124807-ptag-8525047 = photo_tag ...
  • How It Works MemberFeedIndex
    • Steps in displaying member home page feed
      • lookup member record in MemberFeedIndex by ID
      • grab the X most recent columns & values
        • use a time range for paging (older pages start with an earlier start time)
      • get each row from FeedItem using (column key as row key)
        • N gets, where N is number of items to display
      • populate some basic info about members and aggregate the results
        • still query MySQL for core entity info (member, group, event)
  • How it Works Secondary index tables
    • Still need to find rows by column values
      • tried “tableindexed” contrib (0.19 release), high CPU usage & contention on scans
      • decided to update to 0.20 release for other performance improvements
      • built secondary indexing into app layer
    • Separate table per indexed column
      • FeedItem info:actor_member indexed by FeedItem-by_actor_member
      • Index table rows keyed by column value and descending timestamp
        • <column value>-< Long.MAX_VALUE–timestamp >-<orig row key>
      • Zero pad numeric values (or big-endian representation) for correct byte ordering
  • How it Works Secondary index tables ex. FeedItem-by_actor_member Row Key info: __idx__: 0002851766-9223370783553935005- rowkey actor_member = 2851766 item_type = new_rsvp pub_date = row = ch1143475-ts9223370783553935005-rsvp-54704795 0004679998-9223370783650851832- rowkey actor_member = 4679998 item_type = new_discussion pub_date = row = ch1261585-ts9223370783650851832-disc-7369603 Row Key info: content: ch1143475-ts9223370783553935005-rsvp-54704795 actor_member = 2851766 item_type = new_rsvp pub_date = comment = “See you there” ch1261585-ts9223370783650851832-disc-7369603 actor_member = 4679998 item_type = new_discussion pub_date = title = “Next month” body = “...” indexes FeedItem
  • Interacting with HBase Meetup.Beeno package com.meetup.feeds.db; ... @HEntity (name=&quot;FeedItem&quot;) public class FeedItem implements Externalizable { ... @HRowKey public String getId() { return this.id; } public void setId(String id) { this.id = id; } @HProperty (family=&quot;info&quot;, name=&quot;actor_member&quot;, indexes = { @HIndex (date_col=&quot;info:pub_date&quot;, date_invert=true, extra_cols={&quot;info:item_type&quot;}) } ) public Integer getMemberId() { return this.memberId; } public void setMemberId(Integer id) { this.memberId = id; } Java Beans mapped to HBase tables
  • Interacting with HBase Services Base service class provides round-tripping based on annotations public class EntityService<T> { public T get( String rowKey ) throws HBaseException {…} public void save( T entity ) throws HBaseException {…} public void saveAll( List<T> entities ) throws HBaseException {…} public void delete( String rowKey ) throws HBaseException {…} public Query<T> query() throws MappingException {…} } easily extended for specific needs Almost all HBase interaction through service instances.
  • Interacting with HBase Queries Find all items related to a discussion FeedItemService service = new FeedItemService(DiscussionItem.class); Query query = service.query() .using( Criteria.eq(&quot;threadId&quot;, threadId) ); List items = query.execute(); Find all greetings from a given member FeedItemService service = new FeedItemService(GreetingItem.class); Query query = service.query() .using( Criteria.eq(&quot;memberId&quot;, memberId) ) .where( Criteria.eq(“type”, FeedItem.ItemType.CHAPTER_GREETING) ); List items = query.execute(); Simple Query API uses mappings and secondary index tables
  • Interacting with HBase Member Feed Retrieval // retrieve the member's index record HTable mfiTable = HUtil.getTable(&quot;MemberFeedIndex&quot;); Get get = new Get( Bytes.toBytes(String.valueOf(memberId)) ); get.addFamily( Bytes.toBytes(&quot;item&quot;) ); Result r = mfiTable.get(get); FeedItemService service = new FeedItemService(); Set<IndexKey> sortedKeys = sortKeys(r); List<FeedItem> items = new ArrayList<FeedItem>(); // for each index col get the entity record for (IndexKey key : sortedKeys) { FeedItem item = service.get(key.getKey()); if (item != null) items.add(item); } // populate member and chapter info … Get latest activity from all a member's groups using MemberFeedIndex
  • HBase @ Meetup Issues along the way
    • Performance testing
      • Product targeting 3 of our highest traffic pages, simulating load is hard
      • Started with load scripts
      • Moved to testing with live traffic
        • Use AJAX calls to simulate requests
        • Selective enable for X% of traffic
      • Launched data collection/write traffic first
        • Allowed tweaking configuration before impacting user experience
  • HBase @ Meetup Issues along the way
    • High CPU / Concurrency issues
      • Updated to 0.20 release for performance gains across the board
      • Replaced “tableindexed” usage with application level secondary indexing
    • “Hot regions” - profile page hits small table every page load
      • Force split table to distribute across multiple servers
      • “Newest” region still handling high load
        • changed index keying to <value % 100>-<value>-<timestamp> for even distribution
    • I/O Heavy load / MemberFeedIndex table growing
      • Lowered MemberFeedIndex time-to-live to 2 months
      • Enabled LZO compression
  • HBase @ Meetup Current Status
    • Live traffic growing
      • Cluster handling ~2.5k – 3k request/sec
      • 50+% still write traffic
      • ~17% of page views hit HBase (for reads)
      • Expanding to 30% of page views in coming months
    • Meetup.Beeno now open-source on Github:
      • http://github.com/ghelmling/meetup.beeno
    • Next up
      • Continue tweaking
      • Site analytics