INTRODUCTION TO HADOOP 
BreizhJug 
Rennes – 2014-11-06 
David Morin - @davAtBzh
Me 
David Morin 
@davAtBzh 
Solutions Engineer at
3 
What is Hadoop ?
4 
An elephant – This one ?
5 
No, this one !
6 
The father
7 
Let's go !
8 
Let's go !
9 
Timeline
10 
How did the story begin ? 
=> Deal with high volume of data
11 
Big Data – Big Server ?
12 
Big Data – Big Server ?
13 
Big Data – Big Problems ?
14 
Big Data – Big Problems ?
15 
Split is the key
16 
How to find data ?
17 
Define a master
18 
Try again
19 
Not so bad
20 
Hadoop fundamentals 
● Distributed FileSystem for high volume of data 
● Use of common servers (limit costs) 
● Scalable / fault tolerance
21 
HDFS 
HDFS
22 
Hadoop Distributed FileSystem
23 
Hadoop fundamentals 
● Distributed FileSystem for high volume of data 
● Use of common servers (limit costs) 
● Scalable / fault tolerance ??
24 
Hadoop Distributed FileSystem
25 
MapReduce 
HDFS MapReduce
26 
Mapreduce
27 
Mapreduce : word count 
Map Reduce
28 
Data Locality Optimization
29 
Mapreduce in action
30 
Hadoop v1 : drawbacks 
– One Namenode : SPOF 
– One Jobtracker : SPOF and un-scalable (nodes limitation) 
– MapReduce only : open this platform to non MR 
applications 
– MapReduce v1 : do not fit well with iterative algorithms 
used by Machine Learning
31 
Hadoop v2 
Improvements : 
– HDFS v2 : Secondary namenode 
– YARN (Yet Another Resource Negociator) 
● JobTracker => Resource Manager + Applications 
Master (more than one) 
● Can be used by non MapReduce applications 
– MapReduce v2 : uses Yarn
32 
Hadoop v2
33 
YARN
34 
YARN
35 
YARN
36 
YARN
37 
YARN
38 
YARN
39 
What about monitoring ? 
● Command line : hadoop job, yarn 
● IHM to monitor cluster status 
● IHM to check status of running jobs 
● Access to logs files about nodes activity from the IHM
40 
What about monitoring ?
41 
What can we do with Hadoop ? 
(Me) 2 projects in Credit Mutuel Arkea : 
– LAB : Anti-money laundering 
– Operational reporting for a B2B customer
42 
LAB : Context 
● Tracfin : supervised by the Economic and Financial 
department in France
43 
LAB : Context 
● Difficulties to provide accurate alerts : complexity to 
maintain the system and develop new features
44 
LAB : Context 
● Batch Cobol (z/OS) : started at 19h00 until 9h00 
the day after
45 
LAB : Migration to Hadoop 
● Pig : Pig dataflow model fits well for this kind of 
process (lot of data manipulation)
46 
LAB : Migration to Hadoop 
● Lot of data in input : +1 for Pig
47 
LAB : Migration to Hadoop 
● A lot of jobs tasks can be parallelized : +1 for 
Hadoop
48 
LAB : Migration to Hadoop 
● Time spent for data manipulation reduced by more 
than 50 %
49 
LAB : Migration to Hadoop 
● Previous Job was a batch : MapReduce Ok
50 
Operational Reporting 
Context : 
– Provide a large variety of reporting to a B2B partner 
Why Hadoop : 
– New project 
– Huge amount of different data sources as input : Pig Help 
me ! 
– Batch is ok
51
52 
Pig – Why a new langage ? 
● With Pig write MR Jobs becomes easy 
● Dataflow model : data is the key ! 
● Langage : PigLatin 
● No limit : Used Defined Functions 
http://pig.apache.org/docs/r0.13.0/ 
https://github.com/linkedin/datafu 
https://github.com/twitter/elephant-bird 
https://cwiki.apache.org/confluence/display/PIG/PiggyBank
53 
● Pig-Wordcount 
-- Load file on HDFS 
lines = LOAD '/user/XXX/file.txt' AS (line:chararray); 
-- Iterate on each line 
-- We use TOKENISE to split by word and FLATTEN to obtain a tuple 
words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) AS word; 
-- Group by word 
grouped = GROUP words BY word; 
-- Count number of occurences for each group (word) 
wordcount = FOREACH grouped GENERATE group, COUNT(words); 
-- Display results on sysout 
DUMP wordcount; 
Pig “Hello world”
54 
Import … 
Pig vs MapReduce 
public class WordCount2 { 
public static class TokenizerMapper 
extends Mapper<Object, Text, Text, IntWritable>{ 
static enum CountersEnum { INPUT_WORDS } 
private final static IntWritable one = new IntWritable(1); 
private Text word = new Text(); 
private boolean caseSensitive; 
private Set<String> patternsToSkip = new HashSet<String>(); 
private Configuration conf; 
private BufferedReader fis; 
... 
=> 130 lines of code !
55 
● SQL like : HQL 
● Metastore : data abstraction and data discovery 
● UDFs 
Hive
56 
Hive “Hello world” 
● Hive-Wordcount 
-- Create table with structure (DDL) 
CREATE TABLE docs (line STRING); 
-- Load data.. 
LOAD DATA INPATH 'text' OVERWRITE INTO TABLE docs; 
-- Create table for results 
-- Select data from previous table, split lines and group by word 
-- And Count records per group 
CREATE TABLE word_counts AS 
SELECT word, count(1) AS count FROM 
(SELECT explode(split(line, 's')) AS word FROM docs) w 
GROUP BY word 
ORDER BY word;
57 
Zookeeper 
Purpose : Coordinate relations between the 
different actors. Provide a global configuration 
we have pushed.
58 
Zookeeper 
● Distributed coordination service
59 
Zookeeper 
● Dynamic configuration 
● Distributed locking
60 
Kafka 
● Messaging System with a specific design 
● Topic / Point to Point in the same time 
● Suitable for high volume of data 
https://kafka.apache.org/
61 
Hadoop : Batch but not only..
62 
Tez 
● Interactive processing uppon Hive and Pig
63 
HBase 
● Online database (realtime querying) 
● NoSQL : columm oriented database 
● Based on Google BigTable 
● Storage on HDFS
64 
Storm 
● Streaming mode 
● Plug well with Apache Kafka 
● Allow data manipulation during input 
http://fr.slideshare.net/hugfrance/hugfr-6-oct2014ovhantiddos 
http://fr.slideshare.net/miguno/apache-storm-09-basic-training-verisign
65 
Cascading 
● Application development platform on Hadoop 
● APIs in Java : standard API, data processing, data 
integration, scheduler API
66 
Scalding 
● Scala API for Cascading
67 
Phoenix 
● Relational DB Layer over Hbase 
● HBase access delivered as a JDBC client 
● Perf : on the order of milliseconds for small 
queries, or seconds for tens of millions of rows
68 
Spark 
● Big data analytics in-memory / disk 
● Complements Hadoop 
● Fast and more flexible 
https://speakerdeck.com/nivdul/lightning-fast-machine-learning-with-spark 
http://databricks.com/blog/2014/10/10/spark-petabyte-sort.html
69 
??

Hadoop breizhjug

  • 1.
    INTRODUCTION TO HADOOP BreizhJug Rennes – 2014-11-06 David Morin - @davAtBzh
  • 2.
    Me David Morin @davAtBzh Solutions Engineer at
  • 3.
    3 What isHadoop ?
  • 4.
    4 An elephant– This one ?
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
    10 How didthe story begin ? => Deal with high volume of data
  • 11.
    11 Big Data– Big Server ?
  • 12.
    12 Big Data– Big Server ?
  • 13.
    13 Big Data– Big Problems ?
  • 14.
    14 Big Data– Big Problems ?
  • 15.
    15 Split isthe key
  • 16.
    16 How tofind data ?
  • 17.
  • 18.
  • 19.
  • 20.
    20 Hadoop fundamentals ● Distributed FileSystem for high volume of data ● Use of common servers (limit costs) ● Scalable / fault tolerance
  • 21.
  • 22.
  • 23.
    23 Hadoop fundamentals ● Distributed FileSystem for high volume of data ● Use of common servers (limit costs) ● Scalable / fault tolerance ??
  • 24.
  • 25.
  • 26.
  • 27.
    27 Mapreduce :word count Map Reduce
  • 28.
    28 Data LocalityOptimization
  • 29.
  • 30.
    30 Hadoop v1: drawbacks – One Namenode : SPOF – One Jobtracker : SPOF and un-scalable (nodes limitation) – MapReduce only : open this platform to non MR applications – MapReduce v1 : do not fit well with iterative algorithms used by Machine Learning
  • 31.
    31 Hadoop v2 Improvements : – HDFS v2 : Secondary namenode – YARN (Yet Another Resource Negociator) ● JobTracker => Resource Manager + Applications Master (more than one) ● Can be used by non MapReduce applications – MapReduce v2 : uses Yarn
  • 32.
  • 33.
  • 34.
  • 35.
  • 36.
  • 37.
  • 38.
  • 39.
    39 What aboutmonitoring ? ● Command line : hadoop job, yarn ● IHM to monitor cluster status ● IHM to check status of running jobs ● Access to logs files about nodes activity from the IHM
  • 40.
    40 What aboutmonitoring ?
  • 41.
    41 What canwe do with Hadoop ? (Me) 2 projects in Credit Mutuel Arkea : – LAB : Anti-money laundering – Operational reporting for a B2B customer
  • 42.
    42 LAB :Context ● Tracfin : supervised by the Economic and Financial department in France
  • 43.
    43 LAB :Context ● Difficulties to provide accurate alerts : complexity to maintain the system and develop new features
  • 44.
    44 LAB :Context ● Batch Cobol (z/OS) : started at 19h00 until 9h00 the day after
  • 45.
    45 LAB :Migration to Hadoop ● Pig : Pig dataflow model fits well for this kind of process (lot of data manipulation)
  • 46.
    46 LAB :Migration to Hadoop ● Lot of data in input : +1 for Pig
  • 47.
    47 LAB :Migration to Hadoop ● A lot of jobs tasks can be parallelized : +1 for Hadoop
  • 48.
    48 LAB :Migration to Hadoop ● Time spent for data manipulation reduced by more than 50 %
  • 49.
    49 LAB :Migration to Hadoop ● Previous Job was a batch : MapReduce Ok
  • 50.
    50 Operational Reporting Context : – Provide a large variety of reporting to a B2B partner Why Hadoop : – New project – Huge amount of different data sources as input : Pig Help me ! – Batch is ok
  • 51.
  • 52.
    52 Pig –Why a new langage ? ● With Pig write MR Jobs becomes easy ● Dataflow model : data is the key ! ● Langage : PigLatin ● No limit : Used Defined Functions http://pig.apache.org/docs/r0.13.0/ https://github.com/linkedin/datafu https://github.com/twitter/elephant-bird https://cwiki.apache.org/confluence/display/PIG/PiggyBank
  • 53.
    53 ● Pig-Wordcount -- Load file on HDFS lines = LOAD '/user/XXX/file.txt' AS (line:chararray); -- Iterate on each line -- We use TOKENISE to split by word and FLATTEN to obtain a tuple words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) AS word; -- Group by word grouped = GROUP words BY word; -- Count number of occurences for each group (word) wordcount = FOREACH grouped GENERATE group, COUNT(words); -- Display results on sysout DUMP wordcount; Pig “Hello world”
  • 54.
    54 Import … Pig vs MapReduce public class WordCount2 { public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{ static enum CountersEnum { INPUT_WORDS } private final static IntWritable one = new IntWritable(1); private Text word = new Text(); private boolean caseSensitive; private Set<String> patternsToSkip = new HashSet<String>(); private Configuration conf; private BufferedReader fis; ... => 130 lines of code !
  • 55.
    55 ● SQLlike : HQL ● Metastore : data abstraction and data discovery ● UDFs Hive
  • 56.
    56 Hive “Helloworld” ● Hive-Wordcount -- Create table with structure (DDL) CREATE TABLE docs (line STRING); -- Load data.. LOAD DATA INPATH 'text' OVERWRITE INTO TABLE docs; -- Create table for results -- Select data from previous table, split lines and group by word -- And Count records per group CREATE TABLE word_counts AS SELECT word, count(1) AS count FROM (SELECT explode(split(line, 's')) AS word FROM docs) w GROUP BY word ORDER BY word;
  • 57.
    57 Zookeeper Purpose: Coordinate relations between the different actors. Provide a global configuration we have pushed.
  • 58.
    58 Zookeeper ●Distributed coordination service
  • 59.
    59 Zookeeper ●Dynamic configuration ● Distributed locking
  • 60.
    60 Kafka ●Messaging System with a specific design ● Topic / Point to Point in the same time ● Suitable for high volume of data https://kafka.apache.org/
  • 61.
    61 Hadoop :Batch but not only..
  • 62.
    62 Tez ●Interactive processing uppon Hive and Pig
  • 63.
    63 HBase ●Online database (realtime querying) ● NoSQL : columm oriented database ● Based on Google BigTable ● Storage on HDFS
  • 64.
    64 Storm ●Streaming mode ● Plug well with Apache Kafka ● Allow data manipulation during input http://fr.slideshare.net/hugfrance/hugfr-6-oct2014ovhantiddos http://fr.slideshare.net/miguno/apache-storm-09-basic-training-verisign
  • 65.
    65 Cascading ●Application development platform on Hadoop ● APIs in Java : standard API, data processing, data integration, scheduler API
  • 66.
    66 Scalding ●Scala API for Cascading
  • 67.
    67 Phoenix ●Relational DB Layer over Hbase ● HBase access delivered as a JDBC client ● Perf : on the order of milliseconds for small queries, or seconds for tens of millions of rows
  • 68.
    68 Spark ●Big data analytics in-memory / disk ● Complements Hadoop ● Fast and more flexible https://speakerdeck.com/nivdul/lightning-fast-machine-learning-with-spark http://databricks.com/blog/2014/10/10/spark-petabyte-sort.html
  • 69.