HBase at Mendeley
Dan Harvey
Data Mining Engineer
dan.harvey@mendeley.com
Overview
➔ What is Mendeley
➔ Why we chose HBase
➔ How we're using HBase
➔ Challenges
Mendeley helps researchers work smarter
Mendeley extracts
research data..
Install
Mendeley Desktop
Mendeley helps researchers work smarter
..and aggregates research
data in the cloud
Mendeley extracts
research data..
Mendeley helps researchers work smarter
Mendeley in numbers
➔ 600,000+ users
➔ 50+ million user documents
➔ Since January 2009
➔ 30 million unique documents
➔ De-...
Data Mining Team
➔ Catalogue
➔ Importing
➔ Web Crawling
➔ De-duplication
➔ Statistics
➔ Related and recommended research
➔...
Starting off
➔ Users data in MySQL
➔ Normalised document tables
➔ Quite a few joins..
➔ Stuck with MySQL for data mining
➔...
But..
➔ Re-process everything often
➔ Algorithms with global counts
➔ Modifying algorithms affect everything
➔ Iterating o...
What we needed
➔ Scale to 100s of millions of documents
➔ ~80 million papers
➔ ~120 million books
➔ ~2-3 billion reference...
So much choice..
But they mostly miss out good scalable processing.
And many more...
HBase and Hadoop
➔ Scalable storage
➔ Scalable processing
➔ Designed to work with map reduce
➔ Fast scans
➔ Incremental up...
Where HBase fits in
How we store data
➔ Mostly documents
➔ Column Families for different data
➔ Metadata / raw pdf files
➔ More efficient scan...
Example Schema
Row Column family Qualifier
sha1_hash metadata document
date_added
date_modified
source
content pdf
full_te...
How we process data
➔ Java Map Reduce
➔ More control over data flows
➔ Allows us to do more complex work
➔ Pig
➔ Don't hav...
Example
➔ Trending keywords over time
➔ For a give keyword, how many documents per year?
➔ Multiple map/reduce tasks
➔ 100...
Pig Example
-- Load the document bag
rawDocs = LOAD 'hbase://canonical_documents'
USING HbaseLoader('metadata:document')
A...
-- Group unique (keyword, year) tuples
yearTag = GROUP tagYear BY (keyword, year);
-- Create (keyword, year, count) tuples...
Challenges
➔ MySQL hard to export from
➔ Many joins slow things down
➔ Don't normalise if you don't have to!
➔ HBase needs...
Challenges: Hardware
➔ Knowing where to start is hard...
➔ 2x quad core Intel cpu
➔ 4x 1TB disks
➔ Memory
➔ Started with 8...
www.mendeley.com
HBase at Mendeley
HBase at Mendeley
HBase at Mendeley
HBase at Mendeley
HBase at Mendeley
Upcoming SlideShare
Loading in...5
×

HBase at Mendeley

11,173

Published on

The details behind how and why we use HBase in the data mining team at Mendeley.

Published in: Technology
4 Comments
12 Likes
Statistics
Notes
  • I believe HBase has now improved to the point you can use it directly for serving from. Though you still need to be careful with load management on your cluster so map reduce tasks don't soak up all the I/O!
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Dan: Would you still use Voldemort for a HBase front end cache now, or anything else?
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • I got asked this during the presentation and on Twitter: How come you are putting data in #Voldemort for serving #Mendeley data? Why not serve it directly from #HBase?

    This was mostly because at that point in time we don't have as much data for serving as process so we can get away with less hardware right now. We tried it out and serving from HBase works fine as long as you are not running a lot of map reduce jobs over it as we do. Over time as we grow and add more features that use HBase in more interesting ways I'm sure we'll be using it for serving too, then we'll need a cluster just for serving and use the replication in 0.90 to link them together. Far cleaner than writing your own code to do that..
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Great to see some technical details about the Mendeley back-end.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total Views
11,173
On Slideshare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
145
Comments
4
Likes
12
Embeds 0
No embeds

No notes for slide

HBase at Mendeley

  1. 1. HBase at Mendeley Dan Harvey Data Mining Engineer dan.harvey@mendeley.com
  2. 2. Overview ➔ What is Mendeley ➔ Why we chose HBase ➔ How we're using HBase ➔ Challenges
  3. 3. Mendeley helps researchers work smarter
  4. 4. Mendeley extracts research data.. Install Mendeley Desktop Mendeley helps researchers work smarter
  5. 5. ..and aggregates research data in the cloud Mendeley extracts research data.. Mendeley helps researchers work smarter
  6. 6. Mendeley in numbers ➔ 600,000+ users ➔ 50+ million user documents ➔ Since January 2009 ➔ 30 million unique documents ➔ De-duplicated from user and other imports ➔ 5TB of papers
  7. 7. Data Mining Team ➔ Catalogue ➔ Importing ➔ Web Crawling ➔ De-duplication ➔ Statistics ➔ Related and recommended research ➔ Search
  8. 8. Starting off ➔ Users data in MySQL ➔ Normalised document tables ➔ Quite a few joins.. ➔ Stuck with MySQL for data mining ➔ Clustering and de-duplication ➔ Got us to launch the article pages
  9. 9. But.. ➔ Re-process everything often ➔ Algorithms with global counts ➔ Modifying algorithms affect everything ➔ Iterating over tables was slow ➔ Could not easily scale processing ➔ Needed to shard for more documents ➔ Daily stats took > 24h to process...
  10. 10. What we needed ➔ Scale to 100s of millions of documents ➔ ~80 million papers ➔ ~120 million books ➔ ~2-3 billion references ➔ More projects using data and processing ➔ Update the data more often ➔ Rapidly prototype and develop ➔ Cost effective
  11. 11. So much choice.. But they mostly miss out good scalable processing. And many more...
  12. 12. HBase and Hadoop ➔ Scalable storage ➔ Scalable processing ➔ Designed to work with map reduce ➔ Fast scans ➔ Incremental updates ➔ Flexible schema
  13. 13. Where HBase fits in
  14. 14. How we store data ➔ Mostly documents ➔ Column Families for different data ➔ Metadata / raw pdf files ➔ More efficient scans ➔ Protocol Buffers for metadata ➔ Easy to manage 100+ fields ➔ Faster serialisation
  15. 15. Example Schema Row Column family Qualifier sha1_hash metadata document date_added date_modified source content pdf full_text entity_extraction canonical_id version_live ● All data for documents in one table
  16. 16. How we process data ➔ Java Map Reduce ➔ More control over data flows ➔ Allows us to do more complex work ➔ Pig ➔ Don't have to think in map reduce ➔ Twitter's Elephant Bird decodes protocol buffers ➔ Enables rapid prototyping ➔ Less efficient than using java map reduce ➔ Quick example...
  17. 17. Example ➔ Trending keywords over time ➔ For a give keyword, how many documents per year? ➔ Multiple map/reduce tasks ➔ 100s of line of java...
  18. 18. Pig Example -- Load the document bag rawDocs = LOAD 'hbase://canonical_documents' USING HbaseLoader('metadata:document') AS (protodoc); -- De-serialise protocol buffer docs = FOREACH rawDocs GENERATE DocumentProtobufBytesToTuple(protodoc)AS doc; -- Get keyword, year tuples tagYear = FOREACH docs GENERATE FLATTEN (doc.(year, keywords_bag)) AS keyword, doc::year AS year;
  19. 19. -- Group unique (keyword, year) tuples yearTag = GROUP tagYear BY (keyword, year); -- Create (keyword, year, count) tuples yearTagCount = FOREACH yearTag GENERATE FLATTEN(group) AS (keyword, year), COUNT(tagYear) AS count; -- Group the counts by keyword tagYearCounts = GROUP yearTagCount BY keyword; -- Group the counts by keyword tagYearCounts = FOREACH tagYearCounts GENERATE group AS keyword, yearTagCount.(year, count) AS years; STORE tagYearCounts INTO 'tag_year_counts';
  20. 20. Challenges ➔ MySQL hard to export from ➔ Many joins slow things down ➔ Don't normalise if you don't have to! ➔ HBase needs memory ➔ Stability issues if you give it too little
  21. 21. Challenges: Hardware ➔ Knowing where to start is hard... ➔ 2x quad core Intel cpu ➔ 4x 1TB disks ➔ Memory ➔ Started with 8GB, then 16GB ➔ Upgrading to 24GB soon ➔ Currently 15 nodes
  22. 22. www.mendeley.com
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×