Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Into the Data Mine


Published on

Presented at the 2013 LITA Forum, 7-10 November, Louisville, Kentucky (USA)

Map/Reduce programming has been around for many years. However, with the recent creation of the Hadoop architecture this technology has become more accessible to the larger community. A growing community of developers, engineers and scientists now rely on this technology to process large community data sets. This presentation will explore practical applications that leverage the power of Map/Reduce programming in the Hadoop world. A short introduction to Map/Reduce and Hadoop explaining how these technologies can and have been used to address practical questions and problems will provide “real world” applications. Challenges experienced and ideas to address them will be discussed.

Published in: Education, Technology
  • Be the first to comment

  • Be the first to like this

Into the Data Mine

  1. 1. Louisville, KY 2013 Into the Data Mine: Practical Applications of Hadoop Map/Reduce and Challenges of Working With Large Community Data Sets Jeremy Browning Lynn Silipigni Connaway, Ph.D. Consulting Software Engineer OCLC Research Senior Research Scientist OCLC Research @LynnConnaway
  2. 2. nfo Topics: • Introduction • Big Data • Data Processing with Hadoop • Practical Examples • Pitfalls of Big Data • Recommendations for Using Big Data • Questions ©2013 OCLC. This work is licensed under a Creative Commons Attribution 3.0 Unported License. Suggested attribution: “This work uses content from [presentation title] © OCLC, used under a Creative Commons Attribution license:”
  3. 3. Introduction Jeremy Browning Consulting Software Engineer Lynn Silipigni Connaway Senior Research Scientist
  4. 4. About Me •OCLC Research •Using Hadoop MapReduce for 7 years •Hive for 5 years •HBase for 2 year •Cloudera certified Hadoop administrator and developer
  5. 5. OCLC Contactdag: Utrecht Introduction to Big Data
  6. 6. "Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it…" Pitfalls of “Big Data” - Dan Ariely Duke University
  7. 7. What is Big Data? •Collecting data for years •Currently collecting at larger scale •Collection and storage becomes more difficult •Processing large datasets complicated and time consuming
  8. 8. What is Big Data? •Many formats •Unstructured •Not organized •Hard to fit into traditional databases •Difficult to parse •Examples • Full text •Tweets •Social media status updates/comments
  9. 9. What is Big Data? •Structured/Semi-Structured Data •Organized by some scheme •Can be parsed •Examples •Web server logs • Custom application logs
  10. 10. How Big is Big Data? •Google •2,000,000 Search queries per minute • Twitter •100,000+ Tweets per minute •Facebook • 684,478 Status updates/comments a minute
  11. 11. Problems with Current Big Data Systems • Data exchanges require synchronization • Limited bandwidth • Difficult to deal with failures • Storage is expensive • Typically using a SAN • Data copied to compute nodes as needed
  12. 12. Data Become the Bottleneck • Copying data to processors become bottleneck • Quick calculation • Typical disk data transfer rate • 75Mb/sec • Time taken to transfer 100GB of data • ~ 22 minutes
  13. 13. Need for a New Approach • Partial Failure Support • Failure of node should not bring down entire system • Data Recovery • Failure should not result in data loss • Node Recovery • If a node fails and then recovers, should not require full system restart to rejoin the system
  14. 14. Need for a New Approach • Consistency • Node failure during job execution should not affect output of job • Scalability • Adding more jobs to the system should result in graceful decline in performance • The system should not fail under heavy load • Increase to system resources should be proportional to increase in capacity
  15. 15. OCLC Contactdag: Utrecht Introduction to Hadoop
  16. 16. What is Hadoop? • Based on work done at Yahoo and Facebook • Designed to run on off-the-shelf machinery • High performance parallel data-processing • Reliable data storage
  17. 17. Why Hadoop? • Distributed data • Data stored locally on nodes • Jobs launched on machine where data stored
  18. 18. What is Hadoop? • Distributed file system (HDFS) • Based on Google File System • Uses blocks to separate data • Replicates block across cluster for high availability
  19. 19. HDFS Increases Reliability • Each data block is stored on three or more nodes • If node is lost, data will be replicated again • “Rack awareness”
  20. 20. Map/Reduce • Massive jobs broken down into many smaller jobs • Process data where it is stored • Developers only need to write Map/Reducer • Can be written in any language that understands “Standard I/O” • Java • Python • C++ • Excellent failure recovery
  21. 21. Scheduling and Pools • Job scheduling • Fair scheduler • First in, first out • Capacity scheduler • Assigns jobs as capacity opens up • Guaranteed resources based on pool
  22. 22. OCLC Contactdag: Utrecht Examples of Using Big Data
  23. 23. WorldCat Search Autosuggest AT A GLANCE • Top 100 Search Terms • Nightly job calculates the previous days top searches and updates the list.
  24. 24. The Data • Apache access logs • Nightly job pulls all queries strings • Uses regular expressions to parse data • Calculates totals for each query
  25. 25. WorldCat Collection Analysis AT A GLANCE • Comparison of Library Holdings • Monthly job compares all institutions in WorldCat
  26. 26. The Data • WorldCat XML records • Monthly job • Parses data into XML Dom for ease of extracting data • Compares library collections
  27. 27. EasyBib Citations AT A GLANCE • Displays number of times cited • Links items cited together for recommendations
  28. 28. The Data • Custom log format • Tab Delimited • Nightly process • Parses data and extracts • OCLC number • Time stamp • Bibliography ID
  29. 29. WorldCat Publisher Pages AT A GLANCE • Display Organizational chart • Lists Authors, Subjects and Languages
  30. 30. The Data • WorldCat XML Records • Parses data into XML Dom for ease of extracting data • Grouped data by Publisher ISBN string
  31. 31. OCLC Contactdag: Utrecht Pitfalls of Big Data
  32. 32. BIG Data is not a “Magic Bullet” ©2013 OCLC. This work is licensed under a Creative Commons Attribution 3.0 Unported License. Suggested attribution: “This work uses content from [presentation title] © OCLC, used under a Creative Commons Attribution license:”
  33. 33. Pitfalls of Big Data • Data are raw, non interpreted • Can be very complicated to link data points • Often data are hidden or missing • Easy to fall into “Data Driven” decision instead of “Data Enhanced” discussions
  34. 34. “With too little data, you won’t be able to make any conclusions that you trust. With loads of data you will find relationships that aren’t real… Big data isn’t about bits, it’s about talent” Pitfalls of “Big Data” DOUGLAS MERRIL CONTRIBUTING AUTHOR, FORBES
  35. 35. Recommendations for Using Big Data • Include “Domain Experts” • Make connections between business rules and data • Don’t obfuscate useful data with quantity of data
  36. 36. Jeremy Browning Questions? ©2013 OCLC. This work is licensed under a Creative Commons Attribution 3.0 Unported License. Suggested attribution: “This work uses content from Into the Datamine © OCLC, used under a Creative Commons Attribution license:”