0
Louisville, KY 2013

Into the Data Mine:
Practical Applications of Hadoop Map/Reduce
and Challenges of Working With Large
...
nfo

Topics:
• Introduction

• Big Data
• Data Processing with Hadoop
• Practical Examples
• Pitfalls of Big Data
• Recomm...
Introduction

Jeremy Browning
Consulting Software Engineer

Lynn Silipigni Connaway

Senior Research Scientist
About Me

•OCLC Research

•Using Hadoop MapReduce for 7 years
•Hive for 5 years
•HBase for 2 year
•Cloudera certified Hado...
OCLC Contactdag: Utrecht

Introduction to Big Data
"Big data is like teenage sex:
everyone talks about it, nobody
really knows how to do it,
everyone thinks everyone else
is...
What is Big Data?

•Collecting data for years

•Currently collecting at larger scale
•Collection and storage becomes more ...
What is Big Data?

•Many formats
•Unstructured
•Not organized
•Hard to fit into traditional databases
•Difficult to parse
...
What is Big Data?

•Structured/Semi-Structured Data
•Organized by some scheme
•Can be parsed
•Examples
•Web server logs
• ...
How Big is Big Data?

•Google
•2,000,000 Search queries per minute
• Twitter
•100,000+ Tweets per minute
•Facebook
• 684,4...
Problems with Current Big Data Systems

• Data exchanges require synchronization
• Limited bandwidth

• Difficult to deal ...
Data Become the Bottleneck

• Copying data to processors
become bottleneck

• Quick calculation
• Typical disk data transf...
Need for a New Approach

• Partial Failure Support
• Failure of node should not bring down entire system

• Data Recovery
...
Need for a New Approach

• Consistency
• Node failure during job execution should not affect
output of job

• Scalability
...
OCLC Contactdag: Utrecht

Introduction to Hadoop
What is Hadoop?

• Based on work done at Yahoo and Facebook
• Designed to run on off-the-shelf machinery
• High performanc...
Why Hadoop?

• Distributed data
• Data stored locally on
nodes
• Jobs launched on
machine where data
stored
What is Hadoop?

• Distributed file system (HDFS)

• Based on Google File System
• Uses blocks to separate data
• Replicat...
HDFS Increases Reliability

• Each data block is stored on three or more nodes
• If node is lost, data will be replicated ...
Map/Reduce

• Massive jobs broken down into many smaller jobs
• Process data where it is stored

• Developers only need to...
Scheduling and Pools

• Job scheduling
• Fair scheduler
• First in, first out

• Capacity scheduler
• Assigns jobs as
capa...
OCLC Contactdag: Utrecht

Examples of Using Big Data
WorldCat Search Autosuggest

AT A GLANCE

• Top 100 Search Terms
• Nightly job calculates the
previous days top searches a...
The Data

• Apache access logs
• Nightly job pulls all
queries strings
• Uses regular
expressions to parse
data
• Calculat...
WorldCat Collection Analysis

AT A GLANCE
• Comparison of Library Holdings
• Monthly job compares all
institutions in Worl...
The Data

• WorldCat XML records
• Monthly job
• Parses data into XML
Dom for ease of
extracting data
• Compares library
c...
EasyBib Citations

AT A GLANCE
• Displays number of times cited
• Links items cited together for
recommendations
The Data

• Custom log format
• Tab Delimited

• Nightly process
• Parses data and extracts
• OCLC number

• Time stamp
• ...
WorldCat Publisher Pages

AT A GLANCE
• Display Organizational chart
• Lists Authors, Subjects and
Languages
The Data

• WorldCat XML
Records
• Parses data into XML
Dom for ease of
extracting data
• Grouped data by
Publisher ISBN s...
OCLC Contactdag: Utrecht

Pitfalls of Big Data
BIG Data is not a
“Magic Bullet”

©2013 OCLC. This work is licensed under a Creative Commons Attribution 3.0 Unported Lice...
Pitfalls of Big Data

• Data are raw, non interpreted
• Can be very complicated to link data points
• Often data are hidde...
“With too little data, you won’t
be able to make any
conclusions that you trust. With
loads of data you will find
relation...
Recommendations for Using Big Data

• Include “Domain Experts”
• Make connections between business rules and data

• Don’t...
Jeremy Browning
browninj@oclc.org

Questions?

©2013 OCLC. This work is licensed under a Creative Commons Attribution 3.0 ...
Upcoming SlideShare
Loading in...5
×

Into the Data Mine

191

Published on

http://www.ala.org/lita/conferences/forum/2013

http://www.oclc.org/research/presentations.html

Presented at the 2013 LITA Forum, 7-10 November, Louisville, Kentucky (USA)

Map/Reduce programming has been around for many years. However, with the recent creation of the Hadoop architecture this technology has become more accessible to the larger community. A growing community of developers, engineers and scientists now rely on this technology to process large community data sets. This presentation will explore practical applications that leverage the power of Map/Reduce programming in the Hadoop world. A short introduction to Map/Reduce and Hadoop explaining how these technologies can and have been used to address practical questions and problems will provide “real world” applications. Challenges experienced and ideas to address them will be discussed.

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
191
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "Into the Data Mine"

  1. 1. Louisville, KY 2013 Into the Data Mine: Practical Applications of Hadoop Map/Reduce and Challenges of Working With Large Community Data Sets Jeremy Browning Lynn Silipigni Connaway, Ph.D. Consulting Software Engineer OCLC Research Senior Research Scientist OCLC Research browninj@oclc.org connawal@oclc.org @LynnConnaway
  2. 2. nfo Topics: • Introduction • Big Data • Data Processing with Hadoop • Practical Examples • Pitfalls of Big Data • Recommendations for Using Big Data • Questions ©2013 OCLC. This work is licensed under a Creative Commons Attribution 3.0 Unported License. Suggested attribution: “This work uses content from [presentation title] © OCLC, used under a Creative Commons Attribution license: http://creativecommons.org/licenses/by/3.0/”
  3. 3. Introduction Jeremy Browning Consulting Software Engineer Lynn Silipigni Connaway Senior Research Scientist
  4. 4. About Me •OCLC Research •Using Hadoop MapReduce for 7 years •Hive for 5 years •HBase for 2 year •Cloudera certified Hadoop administrator and developer
  5. 5. OCLC Contactdag: Utrecht Introduction to Big Data
  6. 6. "Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it…" Pitfalls of “Big Data” - Dan Ariely Duke University
  7. 7. What is Big Data? •Collecting data for years •Currently collecting at larger scale •Collection and storage becomes more difficult •Processing large datasets complicated and time consuming
  8. 8. What is Big Data? •Many formats •Unstructured •Not organized •Hard to fit into traditional databases •Difficult to parse •Examples • Full text •Tweets •Social media status updates/comments
  9. 9. What is Big Data? •Structured/Semi-Structured Data •Organized by some scheme •Can be parsed •Examples •Web server logs • Custom application logs
  10. 10. How Big is Big Data? •Google •2,000,000 Search queries per minute • Twitter •100,000+ Tweets per minute •Facebook • 684,478 Status updates/comments a minute
  11. 11. Problems with Current Big Data Systems • Data exchanges require synchronization • Limited bandwidth • Difficult to deal with failures • Storage is expensive • Typically using a SAN • Data copied to compute nodes as needed
  12. 12. Data Become the Bottleneck • Copying data to processors become bottleneck • Quick calculation • Typical disk data transfer rate • 75Mb/sec • Time taken to transfer 100GB of data • ~ 22 minutes
  13. 13. Need for a New Approach • Partial Failure Support • Failure of node should not bring down entire system • Data Recovery • Failure should not result in data loss • Node Recovery • If a node fails and then recovers, should not require full system restart to rejoin the system
  14. 14. Need for a New Approach • Consistency • Node failure during job execution should not affect output of job • Scalability • Adding more jobs to the system should result in graceful decline in performance • The system should not fail under heavy load • Increase to system resources should be proportional to increase in capacity
  15. 15. OCLC Contactdag: Utrecht Introduction to Hadoop
  16. 16. What is Hadoop? • Based on work done at Yahoo and Facebook • Designed to run on off-the-shelf machinery • High performance parallel data-processing • Reliable data storage
  17. 17. Why Hadoop? • Distributed data • Data stored locally on nodes • Jobs launched on machine where data stored
  18. 18. What is Hadoop? • Distributed file system (HDFS) • Based on Google File System • Uses blocks to separate data • Replicates block across cluster for high availability
  19. 19. HDFS Increases Reliability • Each data block is stored on three or more nodes • If node is lost, data will be replicated again • “Rack awareness”
  20. 20. Map/Reduce • Massive jobs broken down into many smaller jobs • Process data where it is stored • Developers only need to write Map/Reducer • Can be written in any language that understands “Standard I/O” • Java • Python • C++ • Excellent failure recovery
  21. 21. Scheduling and Pools • Job scheduling • Fair scheduler • First in, first out • Capacity scheduler • Assigns jobs as capacity opens up • Guaranteed resources based on pool
  22. 22. OCLC Contactdag: Utrecht Examples of Using Big Data
  23. 23. WorldCat Search Autosuggest AT A GLANCE • Top 100 Search Terms • Nightly job calculates the previous days top searches and updates the list.
  24. 24. The Data • Apache access logs • Nightly job pulls all queries strings • Uses regular expressions to parse data • Calculates totals for each query
  25. 25. WorldCat Collection Analysis AT A GLANCE • Comparison of Library Holdings • Monthly job compares all institutions in WorldCat
  26. 26. The Data • WorldCat XML records • Monthly job • Parses data into XML Dom for ease of extracting data • Compares library collections
  27. 27. EasyBib Citations AT A GLANCE • Displays number of times cited • Links items cited together for recommendations
  28. 28. The Data • Custom log format • Tab Delimited • Nightly process • Parses data and extracts • OCLC number • Time stamp • Bibliography ID
  29. 29. WorldCat Publisher Pages AT A GLANCE • Display Organizational chart • Lists Authors, Subjects and Languages
  30. 30. The Data • WorldCat XML Records • Parses data into XML Dom for ease of extracting data • Grouped data by Publisher ISBN string
  31. 31. OCLC Contactdag: Utrecht Pitfalls of Big Data
  32. 32. BIG Data is not a “Magic Bullet” ©2013 OCLC. This work is licensed under a Creative Commons Attribution 3.0 Unported License. Suggested attribution: “This work uses content from [presentation title] © OCLC, used under a Creative Commons Attribution license: http://creativecommons.org/licenses/by/3.0/”
  33. 33. Pitfalls of Big Data • Data are raw, non interpreted • Can be very complicated to link data points • Often data are hidden or missing • Easy to fall into “Data Driven” decision instead of “Data Enhanced” discussions
  34. 34. “With too little data, you won’t be able to make any conclusions that you trust. With loads of data you will find relationships that aren’t real… Big data isn’t about bits, it’s about talent” Pitfalls of “Big Data” DOUGLAS MERRIL CONTRIBUTING AUTHOR, FORBES
  35. 35. Recommendations for Using Big Data • Include “Domain Experts” • Make connections between business rules and data • Don’t obfuscate useful data with quantity of data
  36. 36. Jeremy Browning browninj@oclc.org Questions? ©2013 OCLC. This work is licensed under a Creative Commons Attribution 3.0 Unported License. Suggested attribution: “This work uses content from Into the Datamine © OCLC, used under a Creative Commons Attribution license: http://creativecommons.org/licenses/by/3.0/”
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×