Your SlideShare is downloading. ×
  • Like
MyLife with HBase or HBase three flavors
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

MyLife with HBase or HBase three flavors

  • 372 views
Published

Description: …

Description:

A HBase is a NoSQL column store. What does that mean functionally to a software developer?

-A conceptional view of HBase
-How to use HBase
-What features HBase has
-Benefits of HBase

How are we using HBase here at MyLife? I will describe three projects here at MyLife that are currently using HBase in production that I was/am involved with.

-Email content storage
-Connection-Identity mappings
-User stream cache backing

Each of these projects uses HBase in a different way.

Published in Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
372
On SlideShare
0
From Embeds
0
Number of Embeds
2

Actions

Shares
Downloads
11
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • I could talk about HBase operationally.
  • HBase vs other data stores
  • My personal mantra
  • Said by someone far more quotable
  • Like a RDMS or a file stored data but in different ways
  • Again from a functional POV
  • That’s it. remember that. The rest of the terminology just tells you where you are in that nest of maps.
  • Before we get to far since HBase stores data in maps lets take a brief step back here and let me describe a map as quickly as I can since it fundamental to HBase.
    A map is away of storing data so it can be retrieved by a key. This is concept most people are familiar with like finding a co-worker’s extension on the company directory. Here we have a key Last name, first name, middle initial which MAPS to the extension. BTW we are going to talk about keys A LOT!
  • To be precise it would if that other book also listed other books but you get the idea. BTW that is a real example you can look it up. Also I think you all just passed CS 201
  • SOOO… back to HBase.
    This is HBase in a nutshell.
    To quote from the HBase documentation: “All other map returning methods make use of this map internally.”
    So they even say this pretty much all there is to it functionally. But this structure allows for some very cool things.
    I get some freebies here. Quick looks up by key or key prefixes (more on that later). Flexibility. Versioning. These are things we have used and will be looking at in our implementations later.
  • SOOO… back to HBase.
    This is HBase in a nutshell.
    To quote from the HBase documentation: “All other map returning methods make use of this map internally.”
    So they even say this pretty much all there is to it functionally. But this structure allows for some very cool things.
    I get some freebies here. Quick looks up by key or key prefixes (more on that later). Flexibility. Versioning. These are things we have used and will be looking at in our implementations later.
  • From this map structure we get flexibility since we can add or remove items from the Map with one caveat column families are fixed. “The Map” in HBase terms is the table
  • /hbase-identity-secondary-index-migrator/src/test/java/PresentationUnitTest.java
    HBase shell:
    create 'PRESENTATION_TABLE', {NAME => 'CONTENT', REPLICATION_SCOPE => '0', VERSIONS => '1'}
    put 'PRESENTATION_TABLE', 1,'CONTENT:firstname','mike’
    scan 'PRESENTATION_TABLE'
  • SOOO… back to HBase.
    This is HBase in a nutshell.
    To quote from the HBase documentation: “All other map returning methods make use of this map internally.”
    So they even say this pretty much all there is to it functionally. But this structure allows for some very cool things.
    I get some freebies here. Quick looks up by key or key prefixes (more on that later). Flexibility. Versioning. These are things we have used and will be looking at in our implementations later.
  • We currently have three production solutions implemented using HBase
    The Test case our first production use of HBase
    The ideal case which an almost perfect match for HBase
    And finally the awesome case where we added HBase to something great to make it awesome
  • This (to say the least was) not ideal for several reasons including cost and scalability
  • accountId is mylife.com accountid
    providerAccountId is the id we give to the relation between a mylife account and a provider account ie this users gmail account
    messageId is the unique id each email message is given
    bodyId is a reverse timestamp given to each body (html or text)
  • Like our example of a company directory you can easily find everyone with the same last name
  • Our first use of HBase was very straight forward but it works ! And it works well.
  • This implementation is faster, cheaper and saves precious DB resources for where they are needed most. Things that need query and transaction capability
  • What we call an Identity here is really a person. One person probably has many social profiles like a linkedin and a Facebook profile.
  • What is sparse data? That is when the record you store that is mostly empty fields. Like the contact page in your phone has a name and phone number but probably not much else even though there is a place for home address, company name, birthday, anniversary and bunch of other stuff. That is also sparse data.
    Remember I said HBase is flexible? Well this is how you use that flexibility.
    Social profiles are similarly sparse. There is a lot of potential data in social profiles but usually only a few items of data will be there most of the time and the potential fields vary from provider to provider.
    For example first name is almost always in a social profile but middle name probably not. HBase is great for this since it only stores that data that is there no wasting space storing empty cells or time transferring them over the network. It also allows us to store fields for different social providers together or add new fields as we add providers without having to update the storage just the code that needs the data.
    The only accessed by key bit is important also but we have already covered that.
  • Exciting no? its all fitting together. So we know about key-value pairs but what is the reverse part about?
  • Time for another data structures interlude!
    Last time we had this.
  • The reverse index is simply the same data REVERSED!
    So you get a call from an extension you don’t know you go look up the name it belongs to.
    This has been another data structures interlude!
  • A simplified data flow. For social connections.
    In step 2 we are using HBase’s versioning to keep versions of the social profile so we get a history of changes
    Step 4 is where we are doing our reverse index. So we can find the identity of a social profile.
    So how did we implement number 4 and make this part of the ideal HBase use case?
  • A coprocessor is an HBase feature we have not touched on till now (have to save a few surprises)
    In our case we built a coprocessor to update the profile record when we are associating it with its identity.
    This has several advantages:
    The reverse index is built at the same time as the primary index
    The reverse index gets created no matter the source of the put
    Any application can rely on the primary and reverse indexes always existing together
  • I mentioned briefly in our first case about message streams here is another part of that same system that uses HBase. Once we have a users provider streams and have homogenized them we need them available to build the users personal aggregated stream.
  • Persistence – in this case we are using another HBase feature TTLs so that streams that have not been updated in 4 weeks gets removed automatically.
    Speed- read through (times when a stream is gone from the in memory cache and has to be fetched from HBase) are basically as fast as the network since we are getting by key.

Transcript

  • 1. MyLife with HBase OR HBase three flavors
  • 2. HBase: In brief I could talk about… Operational HBase
  • 3. HBase: In brief I could talk about… ZooKeeper quorums Source: aazk.org
  • 4. HBase: In brief I could talk about… Compaction Source: www.wasteprousa.com
  • 5. HBase: In brief I could talk about… How HBase is Implemented HDFS Blocks Regions META table Etc…
  • 6. HBase: In brief I could talk about… HBase VS Cassandra Redis MySQL Etc…
  • 7. HBase: In brief However none of those are my primary view as a developer. As a developer I want to talk about what HBase can do for me. How it can make MyLife (pun intended) easier.
  • 8. HBase: In brief “I choose a lazy person to do a hard job. Because a lazy person will find an easy way to do it.”
  • 9. HBase: In brief “I choose a lazy person to do a hard job. Because a lazy person will find an easy way to do it.” –Bill Gates
  • 10. HBase: In brief So what does HBase do for me the developer? TL;DR IT STORES DATA!
  • 11. HBase: In brief How does HBase store data?
  • 12. HBase: In brief As a Map
  • 13. HBase: In brief As a Map Of Maps
  • 14. HBase: In brief As a Map Of Maps Of Maps
  • 15. HBase: In brief As a Map Of Maps Of Maps Of Maps
  • 16. A Data Structures Interlude Key == Last Name, First Name, Middle Initial Value == Extension I.e. Example,Dude,X  x555
  • 17. A Data Structures Interlude So now that we know what a map is what would a map of maps looks like? An HBase like analogy.
  • 18. A Data Structures Interlude An analogy ( a dated analogy if someone can think of a current one please please let me know) to HBase is an index file in a library by ISBN. You look up the a book by ISBN. The ISBN is your key. The value in this case is a book that contains a list of books! Key == ISBN Value == Book that lists other books! 0786704810 Author, Title, Publisher, Year
  • 19. HBase: In brief SortedMap[RowKey, SortedMap[ColumnFamilyName, SortedMap[Qualifier, SortedMap[Timestamp,Value]]]]
  • 20. HBase: In brief Some quick facts: Column families are defined ahead of time and require the table to disabled to be altered. Only Column families are fixed. Everything under that level of maps in flexible.  Qualifiers can be added or removed on the fly.  Along with their versions “The Map” itself is also defined ahead of time
  • 21. HBase: In brief What does this look like? DEMO TIME!
  • 22. HBase: Implementations The Test Case The Ideal Case The Awesome Case
  • 23. HBase: The Test Case One of the services we provide to our users is a message stream. This stream can include email. Which works like an email client (i.e. outlook or mail.app or on your phone) storing your email messages so you can get them quickly. We found ourselves storing 100’s of gigabytes of email contents in our Oracle RAC database.
  • 24. HBase: The Test Case Since this data is only accessed by key it made sense to move out of Oracle and into HBase.
  • 25. HBase: The Test Case Key == accountId_providerAccountId_messageId_bodyId
  • 26. HBase: The Test Case Key == accountId_providerAccountId_messageId_bodyId This is is a nice key because all the messages for a particular user are together by prefix. Since HBase maintains the keys sorted we can use a Scan to grab them all quickly at one time.
  • 27. HBase: The Test Case That’s it!
  • 28. HBase: The Test Case Advantages vs Previous solution: Faster Cheaper Less DB load
  • 29. HBase: The ideal case Another service we offer our users is the ability to import their social and email connections so they can have one unified view of all their connections across providers. Allowing users to manage data by person rather than by account.
  • 30. HBase: The ideal case This has two main pieces of data: 1.The social profile information 2.The relationship between that profile and an Identity
  • 31. HBase: The ideal case What makes this ideal for HBase? 1. The profile is sparse data that is only accessed by key!
  • 32. HBase: The ideal case What makes this ideal for HBase? 2. The relationship between a profile and its identity is only a key-value pair and it reverse!
  • 33. A Data Structures Interlude Key == Last Name, First Name, Middle Initial Value == Extension I.e. Example,Dude,X  x555
  • 34. A Data Structures Interlude Key == Extension Value == Last Name, First Name, Middle Initial I.e. x555 Example,Dude,X
  • 35. HBase: The ideal case Dataflow 1.Get profile from provider 2.Check if the profile maps to an existing Identity in HBase 1. If it doesn’t exist store a version of the profile in HBase with providerId as key and profile information as values 3.Associate profile with identity 1. create row in HBase with identityId_providerId as key 4.Update profile with the identity it is associated with
  • 36. HBase: The ideal case Coprocessors! What are Coprocessors? Another feature of HBase which work like triggers. A coprocessor is a piece of logic attached to an HBase put that is executed on the HBase cluster.
  • 37. HBase: The Awesome Case User stream availability
  • 38. HBase: The Awesome Case Originally this system used local caching to store user stream data but has the stream grew this became impractical. The solution here was a distributed cache great!
  • 39. HBase: The Awesome Case Distributed cache allows us to scale but unless we have a huge grid some user streams will still get evicted from the cache. Which means when the user visits again we have to fetch their streams from the source which is slow…
  • 40. HBase: The Awesome Case Enter HBase from great to awesome! To fix the latency associated with eviction we added HBase as a backing store to our distributed cache. This means that records in our cache are periodically written to HBase and are written HBase before being evicted from the cache.
  • 41. HBase: The Awesome Case Distributed cache + HBase == Awesome! Why? Persistence – user streams now live in HBase for as long as we want them to. Speed – read through from HBase are fast Transparency – as far as application is concerned everything is just in the cache
  • 42. HBase: The Awesome Case Distributed cache + HBase == Awesome! Why? Reliability – HBase been solid and all the data is stored redundantly
  • 43. That’s all folk! Questions?