What is Hadoop? 
Hadoop Driven Digital Preservation 
Clemens Neudecker 
KB National Library of the Netherlands 
SCAPE & OPF Hackathon 
Vienna, 2 dec 2013
• Dec 2004: Dean/Ghemawat (Google) MapReduce paper 
2 
Timeline 
• 2005: Doug Cutting and Mike Cafarella (Yahoo) 
create Hadoop, at first only to extend Nutch 
(the name is derived from Doug’s son’s toy elephant) 
• 2006: Yahoo runs Hadoop on 5-20 nodes 
This work was partially supported by the SCAPE Project. 
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
3 
Timeline 
•March 2008: Cloudera founded 
•July 2008: Hadoop wins TeraByte sort benchmark 
(1st time a Java program won this competition) 
•April 2009: Amazon introduce “Elastic MapReduce” 
as a service on S3/EC2 
•June 2011: Hortonworks founded 
This work was partially supported by the SCAPE Project. 
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
4 
Timeline 
•27 dec 2011: Apache Hadoop release 1.0.0 
•June 2012: Facebook claim “biggest Hadoop cluster”, 
totalling more than 100 PetaBytes in HDFS 
•2013: Yahoo runs Hadoop on 42,000 nodes, 
computing about 500,000 MapReduce jobs per day 
•15 oct 2013: Apache Hadoop release 2.2.0 (YARN) 
This work was partially supported by the SCAPE Project. 
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
5 
Contributions 2006 - 2011 
(Cf. http://hortonworks.com/blog/reality-check-contributions-to-apache-hadoop/) 
This work was partially supported by the SCAPE Project. 
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
6 
“Core” Hadoop 
• Hadoop Common (formerly Hadoop Core) 
• Hadoop MapReduce 
• Hadoop YARN (MapReduce 2.0) 
• Hadoop Distributed File System (HDFS) 
This work was partially supported by the SCAPE Project. 
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
7 
The wider Hadoop Ecosystem 
• Ambari, Zookeeper (managing & monitoring) 
• HBase, Cassandra (database) 
• Hive, Pig (data warehouse and query language) 
• Mahout (machine learning) 
• Chukwa, Avro, Oozie, Giraph, and many more 
This work was partially supported by the SCAPE Project. 
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
8 
The wider Hadoop Ecosystem 
http://www.slideshare.net/cloudera/the-hadoop-stack-then-now-and-in-the-future-eli-collins- 
charles-zedlewski-cloudera 
This work was partially supported by the SCAPE Project. 
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
• “Hadoop is a hammer. Start by figuring out what house 
you‘re gonna build.“ 
Alistair Croll 
• “If all you have is a hammer, throw away everything 
that is not a nail!“ 
Jimmy Lin 
9 
“Hadoop is a hammer” 
This work was partially supported by the SCAPE Project. 
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
10 
MapReduce in 41 words (including “library”) 
Goal: count the number of books in the library. 
• Map: 
You count up shelf #1, I count up shelf #2. 
(The more people we get, the faster this part goes) 
• Reduce: 
We all get together and add up our individual counts. 
(Cf. http://www.chrisstucchio.com/blog/2011/mapreduce_explained.html) 
This work was partially supported by the SCAPE Project. 
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
MapReduce in a nutshell 
Task1 
Task 2 
Task 3 
Aggregated 
Result 
Aggregated 
Result 
Aggregated 
Result 
11 This work was partially supported by the SCAPE Project. 
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). 
Output data 
Aggregated 
Result 
© Sven Schlarb
12 
MapReduce “v1” issues 
• JobTracker as a single-point of failure 
• Deficiencies in scalability, memory consumption, 
threading-model, reliability and performance 
(https://issues.apache.org/jira/browse/MAPREDUCE- 
278) 
• Aim to support programming paradigms other than 
MapReduce (BSP) 
This work was partially supported by the SCAPE Project. 
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
13 
MapReduce vs YARN 
(Cf. http://hortonworks.com/blog/office-hours-qa-on-yarn-in-hadoop-2/) 
This work was partially supported by the SCAPE Project. 
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
14 
When to use Hadoop? 
• Generally, always when “standard tools” don’t work 
anymore because of sheer data size 
(rule of thumb: if your data fits on a regular hard drive, 
your better off sticking to Python/SQL/Bash/etc.!) 
• Aggregation across large data sets: use the power of 
Reducers! 
• Large-scale ETL operations (extract, transform, load) 
This work was partially supported by the SCAPE Project. 
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
Reading 
• Tom White: Hadoop. The Definitive Guide 
(get 3rd ed. for extra YARN chapter) 
• YARN explained (really quite well): 
http://blog.cloudera.com/blog/2012/02/mapreduce-2- 
0-in-hadoop-0-23/ 
• Jimmy Lin: Text Processing with MapReduce: 
http://lintool.github.io/MapReduceAlgorithms/ed1n.ht 
ml 
15 This work was partially supported by the SCAPE Project. 
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
16 
Happy Hadooping! 
This work was partially supported by the SCAPE Project. 
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

What is Hadoop?

  • 1.
    What is Hadoop? Hadoop Driven Digital Preservation Clemens Neudecker KB National Library of the Netherlands SCAPE & OPF Hackathon Vienna, 2 dec 2013
  • 2.
    • Dec 2004:Dean/Ghemawat (Google) MapReduce paper 2 Timeline • 2005: Doug Cutting and Mike Cafarella (Yahoo) create Hadoop, at first only to extend Nutch (the name is derived from Doug’s son’s toy elephant) • 2006: Yahoo runs Hadoop on 5-20 nodes This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  • 3.
    3 Timeline •March2008: Cloudera founded •July 2008: Hadoop wins TeraByte sort benchmark (1st time a Java program won this competition) •April 2009: Amazon introduce “Elastic MapReduce” as a service on S3/EC2 •June 2011: Hortonworks founded This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  • 4.
    4 Timeline •27dec 2011: Apache Hadoop release 1.0.0 •June 2012: Facebook claim “biggest Hadoop cluster”, totalling more than 100 PetaBytes in HDFS •2013: Yahoo runs Hadoop on 42,000 nodes, computing about 500,000 MapReduce jobs per day •15 oct 2013: Apache Hadoop release 2.2.0 (YARN) This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  • 5.
    5 Contributions 2006- 2011 (Cf. http://hortonworks.com/blog/reality-check-contributions-to-apache-hadoop/) This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  • 6.
    6 “Core” Hadoop • Hadoop Common (formerly Hadoop Core) • Hadoop MapReduce • Hadoop YARN (MapReduce 2.0) • Hadoop Distributed File System (HDFS) This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  • 7.
    7 The widerHadoop Ecosystem • Ambari, Zookeeper (managing & monitoring) • HBase, Cassandra (database) • Hive, Pig (data warehouse and query language) • Mahout (machine learning) • Chukwa, Avro, Oozie, Giraph, and many more This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  • 8.
    8 The widerHadoop Ecosystem http://www.slideshare.net/cloudera/the-hadoop-stack-then-now-and-in-the-future-eli-collins- charles-zedlewski-cloudera This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  • 9.
    • “Hadoop isa hammer. Start by figuring out what house you‘re gonna build.“ Alistair Croll • “If all you have is a hammer, throw away everything that is not a nail!“ Jimmy Lin 9 “Hadoop is a hammer” This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  • 10.
    10 MapReduce in41 words (including “library”) Goal: count the number of books in the library. • Map: You count up shelf #1, I count up shelf #2. (The more people we get, the faster this part goes) • Reduce: We all get together and add up our individual counts. (Cf. http://www.chrisstucchio.com/blog/2011/mapreduce_explained.html) This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  • 11.
    MapReduce in anutshell Task1 Task 2 Task 3 Aggregated Result Aggregated Result Aggregated Result 11 This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). Output data Aggregated Result © Sven Schlarb
  • 12.
    12 MapReduce “v1”issues • JobTracker as a single-point of failure • Deficiencies in scalability, memory consumption, threading-model, reliability and performance (https://issues.apache.org/jira/browse/MAPREDUCE- 278) • Aim to support programming paradigms other than MapReduce (BSP) This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  • 13.
    13 MapReduce vsYARN (Cf. http://hortonworks.com/blog/office-hours-qa-on-yarn-in-hadoop-2/) This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  • 14.
    14 When touse Hadoop? • Generally, always when “standard tools” don’t work anymore because of sheer data size (rule of thumb: if your data fits on a regular hard drive, your better off sticking to Python/SQL/Bash/etc.!) • Aggregation across large data sets: use the power of Reducers! • Large-scale ETL operations (extract, transform, load) This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  • 15.
    Reading • TomWhite: Hadoop. The Definitive Guide (get 3rd ed. for extra YARN chapter) • YARN explained (really quite well): http://blog.cloudera.com/blog/2012/02/mapreduce-2- 0-in-hadoop-0-23/ • Jimmy Lin: Text Processing with MapReduce: http://lintool.github.io/MapReduceAlgorithms/ed1n.ht ml 15 This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  • 16.
    16 Happy Hadooping! This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).