Cloud Elephants and Witches: A Big Data Tale from Mendeley         Kris Jack, PhD   Data Mining Team Lead
Overview➔    Whats Mendeley?➔    The curse that comes with success➔    A framework for scaling up (Hadoop + MapReduce)➔   ...
Whats Mendeley?
What is Mendeley?...a large data technologystartup company                       ...and its on a mission to               ...
Mendeley          Last.fm                                                   3) Last.fm builds your music                wo...
Mendeley   Last.fmmusic libraries                  research librariesartists                          researcherssongs    ...
Mendeley provides tools to help users......organisetheir research
Mendeley provides tools to help users...                 ...collaborate with                     one another...organisethe...
Mendeley provides tools to help users...                 ...collaborate with                     one another...organise   ...
Mendeley provides tools to help users...                 ...collaborate with                     one another...organise   ...
The curse that comes        with success
In the beginning, there was...➔    MySQL:      ➔        Normalised tables for storing and serving:        ➔          User ...
Heres where the curse of success comes➔  More articles came➔  More users came➔    The system became unhappy➔    Keeping da...
1.6 million+ users; the 20 largest userbases:                    University of Cambridge                         Stanford ...
50m                  Real-time data on 28m unique papers:           Thomson Reuters’          Web of Knowledge          (d...
We had serious needs➔    Scale up to the millions (billions for some items)➔    Keep data fresh➔    Support newly planned ...
A framework for scaling up(Hadoop and MapReduce)
What is Hadoop?The Apache Hadoop project develops open-sourcesoftware for reliable, scalable, distributedcomputing        ...
Hadoop➔    Designed to operate on a cluster of computers    ➔        1...thousands    ➔        Commodity hardware (low cos...
HDFS➔    Hadoop Distributed File System➔    Based on Google File System➔    Replicates data storage (reliability, x3, acro...
MapReduce➔    MapReduce is a programming model➔    Allows distributed processing of large data sets➔    Based on Googles M...
MapReduce Example:  Article Readers by Country doc_id1, reader_id1, usa, 2010, …           HDFS doc_id2, reader_id2, austr...
Hadoop➔    HDFS for storing data➔    MapReduce for processing data➔    Together, bring the program to the data
Hadoops Users
We make a lot of use of HDFS and MapReduce➔    Catalogue Stats➔    Recommendations (Mahout)➔    Log Analysis (business ana...
Beware that these benefits have costs➔    Migrating to a new system (data consistency)➔    Setup costs    ➔        Learn b...
Moving to the cloud            (AWS)
What is AWS?Amazon Web Services (AWS) delivers a set ofservices that together form a reliable, scalable,and inexpensive co...
Why move to AWS?➔    The cost of running your own cluster can be high    ➔        Monetary (e.g. hardware)    ➔        Tim...
Article Recommendations➔    Aim: help researchers to find interest articles    ➔        Combat information deluge    ➔    ...
Article Recommendations in EMR➔    Use Amazons Elastic Map Reduce (EMR)➔    Upload input data (user libraries)➔    Upload ...
Catalogue Search➔    50 million articles➔    50GB index in Solr➔    Variable load (over 24 hours)    ➔        1AM is quiet...
At 1AM, 150 queries/second            1PM, 100 queries/second                                            AWS Instance     ...
Problems weve faced➔    Lack of control can be an issue    ➔        Trade-off administration and control➔    Orchestration...
Conclusions
Conclusions➔ Mendeley has created the worlds largest scientificdatabase➔ Storing and processing this data is a large scale...
Conclusions➔  AWS can make scaling up efficient and costeffective➔    Tap into the rich big data community out there➔ We p...
Conclusions➔ Magic elephants that live in clouds can lift thecurses of evil witches
www.mendeley.com
DataScience Meeting I - Cloud Elephants and Witches: A Big Data Tale from Mendeley
Upcoming SlideShare
Loading in...5
×

DataScience Meeting I - Cloud Elephants and Witches: A Big Data Tale from Mendeley

1,350

Published on

DataScience Talk by Kris Jack, Team Lead of Dataming at Mendeley LTD
Date: February 9th 2012
Graz, Austria

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,350
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
10
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

DataScience Meeting I - Cloud Elephants and Witches: A Big Data Tale from Mendeley

  1. 1. Cloud Elephants and Witches: A Big Data Tale from Mendeley Kris Jack, PhD Data Mining Team Lead
  2. 2. Overview➔ Whats Mendeley?➔ The curse that comes with success➔ A framework for scaling up (Hadoop + MapReduce)➔ Moving to the cloud (AWS)➔ Conclusions
  3. 3. Whats Mendeley?
  4. 4. What is Mendeley?...a large data technologystartup company ...and its on a mission to change the way that research is done!
  5. 5. Mendeley Last.fm 3) Last.fm builds your music works like this: profile and recommends you music you also could like... and1) Install “Audioscrobbler” it’s the world‘s biggest open music database 2) Listen to music
  6. 6. Mendeley Last.fmmusic libraries research librariesartists researcherssongs papersgenres disciplines
  7. 7. Mendeley provides tools to help users......organisetheir research
  8. 8. Mendeley provides tools to help users... ...collaborate with one another...organisetheir research
  9. 9. Mendeley provides tools to help users... ...collaborate with one another...organise ...discover newtheir research research
  10. 10. Mendeley provides tools to help users... ...collaborate with one another...organise ...discover newtheir research research
  11. 11. The curse that comes with success
  12. 12. In the beginning, there was...➔ MySQL: ➔ Normalised tables for storing and serving: ➔ User data ➔ Article data ➔ The system was happy➔ With this, we launched the article catalogue ➔ Lots of number crunching ➔ Many joins for basic stats
  13. 13. Heres where the curse of success comes➔ More articles came➔ More users came➔ The system became unhappy➔ Keeping data fresh was a burden ➔ Algorithms relied on global counts ➔ Iterating over tables was slow ➔ Needed to shard tables to grow catalogue➔ In short, our system didnt scale
  14. 14. 1.6 million+ users; the 20 largest userbases: University of Cambridge Stanford University MIT University of Michigan Harvard University University of Oxford Sao Paulo University Imperial College London University of Edinburgh Cornell University University of California at Berkeley RWTH Aachen Columbia University Georgia Tech University of Wisconsin UC San Diego University of California at LA University of Florida University of North Carolina
  15. 15. 50m Real-time data on 28m unique papers: Thomson Reuters’ Web of Knowledge (dating from 1934) Mendeley after 16 months: >150 millionindividual articles, (>25TB)
  16. 16. We had serious needs➔ Scale up to the millions (billions for some items)➔ Keep data fresh➔ Support newly planned services ➔ Search ➔ Recommendations➔ Business context ➔ Agile development (rapid prototyping) ➔ Cost effective ➔ Going viral
  17. 17. A framework for scaling up(Hadoop and MapReduce)
  18. 18. What is Hadoop?The Apache Hadoop project develops open-sourcesoftware for reliable, scalable, distributedcomputing www.hadoop.apache.org
  19. 19. Hadoop➔ Designed to operate on a cluster of computers ➔ 1...thousands ➔ Commodity hardware (low cost units)➔ Each node offers local computation and storage➔ Provides framework for working with petabytes of data➔ When learning about Hadoop, you need to learn about: ➔ HDFS ➔ MapReduce
  20. 20. HDFS➔ Hadoop Distributed File System➔ Based on Google File System➔ Replicates data storage (reliability, x3, across racks)➔ Designed to handle very large files (e.g. 64MB)➔ Provides high-throughput➔ File access through Java and Thrift APIs, CL and Wepapp➔ Name node is a single point of failure (availability issue)
  21. 21. MapReduce➔ MapReduce is a programming model➔ Allows distributed processing of large data sets➔ Based on Googles MapReduce➔ Inspired by functional programming➔ Take the program to the data, not the data to the program
  22. 22. MapReduce Example: Article Readers by Country doc_id1, reader_id1, usa, 2010, … HDFS doc_id2, reader_id2, austria, 2012, … Large file (150M entries) doc_id1, reader_id3, china, 2010, … Flattened data . Stored across nodes . . Map (pivot countries doc_id1, {usa, china, usa, uk, china, china...} by doc id) doc_id2, {austria, austria, china, china, uk …} ... doc_id1, usa, 0.27 Reduce doc_id1, china, 0.09 (calc. document stats) doc_id1, uk, 0.09 doc_id2, austria, 0.99 . . .
  23. 23. Hadoop➔ HDFS for storing data➔ MapReduce for processing data➔ Together, bring the program to the data
  24. 24. Hadoops Users
  25. 25. We make a lot of use of HDFS and MapReduce➔ Catalogue Stats➔ Recommendations (Mahout)➔ Log Analysis (business analytics)➔ Top Articles➔ … and more➔ Quick, reliable and scalable
  26. 26. Beware that these benefits have costs➔ Migrating to a new system (data consistency)➔ Setup costs ➔ Learn black magic to configure ➔ Hardware for cluster➔ Administrative costs ➔ High learning curve to administrate Hadoop ➔ Still an immature technology ➔ You may need to debug the source code➔ Tips ➔ Get involved in the community (e.g. meetups, forums) ➔ Use good commodity hardware ➔ Consider moving to the cloud...
  27. 27. Moving to the cloud (AWS)
  28. 28. What is AWS?Amazon Web Services (AWS) delivers a set ofservices that together form a reliable, scalable,and inexpensive computing platform “in thecloud” www.aws.amazon.com
  29. 29. Why move to AWS?➔ The cost of running your own cluster can be high ➔ Monetary (e.g. hardware) ➔ Time (e.g. training, setup, administration)➔ AWS takes on these problems, renting theirservices to you based on your usage
  30. 30. Article Recommendations➔ Aim: help researchers to find interest articles ➔ Combat information deluge ➔ Keep up-to-date with recent movements➔ 1.6M users➔ 50M articles➔ Batch process for generating regularrecommendations (using Mahout)
  31. 31. Article Recommendations in EMR➔ Use Amazons Elastic Map Reduce (EMR)➔ Upload input data (user libraries)➔ Upload Mahout jar➔ Spin up cluster➔ Run the job ➔ You decide the number of nodes (cost vs time) ➔ You decide the spec of the nodes (cost vs quality)➔ Retrieve the output
  32. 32. Catalogue Search➔ 50 million articles➔ 50GB index in Solr➔ Variable load (over 24 hours) ➔ 1AM is quieter (100 q/s), 1PM is busier (150 q/s)
  33. 33. At 1AM, 150 queries/second 1PM, 100 queries/second AWS Instance ?, ?, ?... queries (100/s) (150/s) AWS elastic load balancer AWS Instance AWS InstanceCatalogue Search in Context of Variable Load➔ Amazons Elastic Load Balancer➔ Only pay for nodes when you need them ➔ Spin up when load is high ➔ Tear down load is low➔ Cost effective and scalable
  34. 34. Problems weve faced➔ Lack of control can be an issue ➔ Trade-off administration and control➔ Orchestration issues ➔ We have many services to coordinate ➔ Cloud formation & Elastic Beanstalk➔ Migrating live services is hard work
  35. 35. Conclusions
  36. 36. Conclusions➔ Mendeley has created the worlds largest scientificdatabase➔ Storing and processing this data is a large scalechallenge➔ Hadoop, through HDFS and MapReduce, provides aframework for large scale data processing➔ Be aware of administration costs when doing this inhouse
  37. 37. Conclusions➔ AWS can make scaling up efficient and costeffective➔ Tap into the rich big data community out there➔ We plan to have make no more substantialhardware purchases, instead use AWS➔ Scaling up isnt a trivial problem, to save pain,plan for it from the outset
  38. 38. Conclusions➔ Magic elephants that live in clouds can lift thecurses of evil witches
  39. 39. www.mendeley.com
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×