Hadoop breizhjug

INTRODUCTION TO HADOOP
BreizhJug
Rennes – 2014-11-06
David Morin - @davAtBzh

Me
David Morin
@davAtBzh
Solutions Engineer at

10
How did the story begin ?
=> Deal with high volume of data

13
Big Data – Big Problems ?

14
Big Data – Big Problems ?

20
Hadoop fundamentals
● Distributed FileSystem for high volume of data
● Use of common servers (limit costs)
● Scalable / fault tolerance

22
Hadoop Distributed FileSystem

23
Hadoop fundamentals
● Distributed FileSystem for high volume of data
● Use of common servers (limit costs)
● Scalable / fault tolerance ??

24
Hadoop Distributed FileSystem

27
Mapreduce : word count
Map Reduce

28
Data Locality Optimization

30
Hadoop v1 : drawbacks
– One Namenode : SPOF
– One Jobtracker : SPOF and un-scalable (nodes limitation)
– MapReduce only : open this platform to non MR
applications
– MapReduce v1 : do not fit well with iterative algorithms
used by Machine Learning

31
Hadoop v2
Improvements :
– HDFS v2 : Secondary namenode
– YARN (Yet Another Resource Negociator)
● JobTracker => Resource Manager + Applications
Master (more than one)
● Can be used by non MapReduce applications
– MapReduce v2 : uses Yarn

39
What about monitoring ?
● Command line : hadoop job, yarn
● IHM to monitor cluster status
● IHM to check status of running jobs
● Access to logs files about nodes activity from the IHM

41
What can we do with Hadoop ?
(Me) 2 projects in Credit Mutuel Arkea :
– LAB : Anti-money laundering
– Operational reporting for a B2B customer

42
LAB : Context
● Tracfin : supervised by the Economic and Financial
department in France

43
LAB : Context
● Difficulties to provide accurate alerts : complexity to
maintain the system and develop new features

44
LAB : Context
● Batch Cobol (z/OS) : started at 19h00 until 9h00
the day after

45
LAB : Migration to Hadoop
● Pig : Pig dataflow model fits well for this kind of
process (lot of data manipulation)

46
● Lot of data in input : +1 for Pig

47
● A lot of jobs tasks can be parallelized : +1 for
Hadoop

48
● Time spent for data manipulation reduced by more
than 50 %

49
● Previous Job was a batch : MapReduce Ok

50
Operational Reporting
Context :
– Provide a large variety of reporting to a B2B partner
Why Hadoop :
– New project
– Huge amount of different data sources as input : Pig Help
me !
– Batch is ok

52
Pig – Why a new langage ?
● With Pig write MR Jobs becomes easy
● Dataflow model : data is the key !
● Langage : PigLatin
● No limit : Used Defined Functions
http://pig.apache.org/docs/r0.13.0/
https://github.com/linkedin/datafu
https://github.com/twitter/elephant-bird
https://cwiki.apache.org/confluence/display/PIG/PiggyBank

53
● Pig-Wordcount
-- Load file on HDFS
lines = LOAD '/user/XXX/file.txt' AS (line:chararray);
-- Iterate on each line
-- We use TOKENISE to split by word and FLATTEN to obtain a tuple
words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) AS word;
-- Group by word
grouped = GROUP words BY word;
-- Count number of occurences for each group (word)
wordcount = FOREACH grouped GENERATE group, COUNT(words);
-- Display results on sysout
DUMP wordcount;
Pig “Hello world”

54
Import …
Pig vs MapReduce
public class WordCount2 {
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
static enum CountersEnum { INPUT_WORDS }
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
private boolean caseSensitive;
private Set<String> patternsToSkip = new HashSet<String>();
private Configuration conf;
private BufferedReader fis;
...
=> 130 lines of code !

55
● SQL like : HQL
● Metastore : data abstraction and data discovery
● UDFs
Hive

56
Hive “Hello world”
● Hive-Wordcount
-- Create table with structure (DDL)
CREATE TABLE docs (line STRING);
-- Load data..
LOAD DATA INPATH 'text' OVERWRITE INTO TABLE docs;
-- Create table for results
-- Select data from previous table, split lines and group by word
-- And Count records per group
CREATE TABLE word_counts AS
SELECT word, count(1) AS count FROM
(SELECT explode(split(line, 's')) AS word FROM docs) w
GROUP BY word
ORDER BY word;

57
Zookeeper
Purpose : Coordinate relations between the
different actors. Provide a global configuration
we have pushed.

58
Zookeeper
● Distributed coordination service

59
Zookeeper
● Dynamic configuration
● Distributed locking

60
Kafka
● Messaging System with a specific design
● Topic / Point to Point in the same time
● Suitable for high volume of data
https://kafka.apache.org/

61
Hadoop : Batch but not only..

62
Tez
● Interactive processing uppon Hive and Pig

63
HBase
● Online database (realtime querying)
● NoSQL : columm oriented database
● Based on Google BigTable
● Storage on HDFS

64
Storm
● Streaming mode
● Plug well with Apache Kafka
● Allow data manipulation during input
http://fr.slideshare.net/hugfrance/hugfr-6-oct2014ovhantiddos
http://fr.slideshare.net/miguno/apache-storm-09-basic-training-verisign

65
Cascading
● Application development platform on Hadoop
● APIs in Java : standard API, data processing, data
integration, scheduler API

66
Scalding
● Scala API for Cascading

67
Phoenix
● Relational DB Layer over Hbase
● HBase access delivered as a JDBC client
● Perf : on the order of milliseconds for small
queries, or seconds for tens of millions of rows

68
Spark
● Big data analytics in-memory / disk
● Complements Hadoop
● Fast and more flexible
https://speakerdeck.com/nivdul/lightning-fast-machine-learning-with-spark
http://databricks.com/blog/2014/10/10/spark-petabyte-sort.html

Hadoop breizhjug

More Related Content

What's hot

Similar to Hadoop breizhjug

Recently uploaded

Hadoop breizhjug