BigData @ Digital
Factory!
une petite histoire en cours d’écriture!

!
Olivier Varene!
olivier.varene@orange.com!

DSIF/DFY!
Orange DevDay 2013

!
Hadoop!

olivier.varene@orange.com!
Hadoop - Core!
MapReduce!
HDFS!
olivier.varene@orange.com!
Hadoop genealogy!
olivier.varene@orange.com!
Hadoop Time bar!
2.x!
1.x!
0.2[0-2].X!

olivier.varene@orange.com!

0.23.x!
Hadoop Distribution!
Packaging!
Deployment!
Support!
olivier.varene@orange.com!
Main Distributions!
Licence!

Business Model!

Support!

Apache!

Apache 2.0!

Fundation!

community only!

HortonWorks!

Apache 2.0!
HortonWorks (add-on)!

PS + Training +
support!

community +
Professional!

Cloudera!

Apache 2.0!
Closed Source (not
core)!

PS + Licencing +
Training + support!

community +
Professional!

MapR!

Apache 2.0!
Closed Source (FS)!

PS + Licencing +
support!

community +
Professional!

WanDisco!

Apache 2.0!
Closed Source (DConE)!

PS + Licencing +
Training + support!

community +
Professional!

!

PS: Professional Services
olivier.varene@orange.com!
Big Name Distributions!
Paying & Closed Source!

• 

IBM InfoSphere BigInsights!

• 

GreenPlum (EMC)!

• 

Intel Distribution for Hadoop!

• 

…!

olivier.varene@orange.com!
Big Data Suite!
Tooling!
Code generation!
Scheduling!
Integration!
olivier.varene@orange.com!
Tools (1st level)!
Tool"!

Description!

Licence!

Apache Pig!

Scripting Platform!

Apache 2!

Apache Hive!

Data Access & Query!

Apache 2!

Apache HCatalog!

Metadata Services!

Apache 2!

Apache HBase!

NoSQL Database!

Apache 2!

Apache ZooKeeper!

Cluster Coordination!

Apache 2!

Apache Tez !

Query processing!

Apache 2!

Apache Oozie!

Workflow Scheduler!

Apache 2!

Apache Sqoop!

Data Integration Services!

Apache 2!

olivier.varene@orange.com!
Tools (add-ons)!
Tool"!

Description!

Licence!

Teradata connector!

Connector!

Terradata + Distribution!

Hive ODBC!

ODBC!

Distribution!

Mahout!

Data Mining!

Apache 2!

Cascading!

Fault Tolerant API / Framework!

Apache 2!

Cassandra Connector! Connector to Cassandra NoSQL!

Apache 2!

MongoDB Connector!

Apache 2!

…!
olivier.varene@orange.com!

Connector to MongoDB!
Landscape!

olivier.varene@orange.com!
@ Digital Factory!
DSIF / Digital Factory!

olivier.varene@orange.com!
Back in Time!
- 3 years!

•  PageRank calculus on billions nodes and 10s billions
edges	


•  regularly failed ! (hardware ...)	

•  4 to 8 weeks calculus	

•  unscalable	

•  failure rate around 80%	

•  One person full time to supervise!
olivier.varene@orange.com!
Answer ?!

Internal
!
Development
!
+ full control!
- long term!
- €€!
olivier.varene@orange.com!

OpenSource
!
+ €€!
+ short term!
- support!
- evolution!
Success!

In PRODUCTION since 2010!

olivier.varene@orange.com!
How does it work ?!

olivier.varene@orange.com!
Hadoop Axioms!
•  System shall manage and heal himself"
•  Performance shall scale linearly"
•  Compute shall move to data"
•  Modular and extensible!
olivier.varene@orange.com!
HDFS (Simple)!
Self-healing High-Bandwidth Clustered Storage!

olivier.varene@orange.com!
olivier.varene@orange.com!
MapReduce V1 (Simple)!
cat <data> | <Mapper> | sort | <Reducer>!

olivier.varene@orange.com!
MapReduce V1 (Simple)!
cat <data> | …………... | sort | …………….!

Framework
olivier.varene@orange.com!
MapReduce V1 (Simple)!
……………. <Mapper> ..…… <Reducer>!

Your program
olivier.varene@orange.com!
olivier.varene@orange.com!
olivier.varene@orange.com!
YARN!
Allow plugging in new paradigms!

olivier.varene@orange.com!
MapReduce V1!
Map()

Map()
Data on
HDFS

Reduce()
Map()

Reduce()

Map()

Map()

Map!
olivier.varene@orange.com!

partXX

Sort!
Partition!

Reduce!

partXX
Before map()!
Map()

Block of Data

(Kin,Vin)

…

Data on
HDFS

Block of Data

Slicing
Partitioning

olivier.varene@orange.com!

(Kout,Vout)

Map()

JobTracker calculates
locality for job assignment
and input split data
Java (Api)!
Mapper!
Class YourMapper extends Mapper(Kin,Vin,Kout,Vout) {
[void setup();]
[void cleanup();]
void map(Kin,Vin,context) {
…. Your program …!
}
}
olivier.varene@orange.com!
before reduce()!
OPTIONAL

RAM
sorting
disk write

Map()

RAM
sorting
disk write

file
file
file

(Kout,Vout)

Combine()
(Kout,Vout)

temporary
intermediate files
sorted in each
file

(Kout,Vout)
temporary
intermediate files

1 or more times

olivier.varene@orange.com!

file
file

Partition()

part
part
part

key namespace
partitioning

JobTracker
distribution to
reducers
Java (Api)!
Reducer!
Class YourReducer extends
Reducer(Kin,Vin,Kout,Vout) {
[void setup();]
[void cleanup();]
void reduce(Kin,List<Vin>,context) {
…. Your program …
}
}
olivier.varene@orange.com!
Optimization tips!
•  JVM!
•  Algorithm in MapReduce paradigm!
•  Combiner!
•  Sort algorithm!
•  Partitioning!
olivier.varene@orange.com!
Streaming!
… | <mapper> | … | <reducer> |…!
• 
• 
• 
• 
• 
• 

STDIN !
STDOUT!
Text as input and output by default!
‘t’ as default separator!
Use your language : perl, python, shell, ruby, … !
(interpreter needed on all nodes)!

hadoop jar $streamingJar –input <inputDir> -output <outputDir> !
-mapper <mapProg> -reducer <reduceProg> -file <files>!
olivier.varene@orange.com!
Pipe – C++!
… | <mapper> | … | <reducer> |…!
•  Socket communication!
•  Bytes as input and output!
•  C++ API!
class MyMap: public
HadoopPipes::Mapper { … }

class MyReducer: public
HadoopPipes::Reducer { … }

hadoop put <binFile> <toHDFS…>!
hadoop pipes –input <inputDir> -output <outputDir> !
-program <path/binfile> [-conf <confFile>]!

olivier.varene@orange.com!
Too difficult!
Hopefully there are tools that can generate
code for you or let you do SQL queries !!!!

Tools!
olivier.varene@orange.com!

Algo / Libs!
PIG!
Scripting Language :!
set job.name calculateGraphDegres!

•  Simple!

!
%default nbpigreducers 10!
set default_parallel $nbpigreducers!
!

•  Parallel execution!

-- degres sortant!
A = load '$degout' using PigStorage() as
(url:chararray,out_deg:int);!
-- keep entries where out_deg > 1!
A2 = filter A by (out_deg > 1);!
B = order A2 by out_deg DESC;!
store B into '$degoutOrdered';!

•  Data oriented!
!

•  Extensible via UDF!
•  Automatic performance
enhancement via compiler!
olivier.varene@orange.com!

-- distribution des degres sortants!
C = foreach A generate out_deg,1 as deg_occ;!
D = group C by out_deg;!
E = foreach D generate FLATTEN(group) as
out_deg,SUM(C.deg_occ) as deg_occ;!
F = order E by out_deg ASC;!
store F into '$degoutDistrib';!
Hive!
Querying Language :!
•  HiveQL (sql like)!

CREATE EXTERNAL TABLE b_packet (timestamp
string, packet_length int, protocol string) !
ROW FORMAT DELIMITED FIELDS TERMINATED
BY "|" !
LOCATION ‘b-file/input/';!
!
CREATE EXTERNAL TABLE b_packet_out (protocol
string, cnt int) !
ROW FORMAT DELIMITED FIELDS TERMINATED
BY "t" !
LOCATION ’b-file/output/1/’;!

•  ETL Tool!
•  HDFS, HBase, Thrift …!
!

•  MapReduce interface (with
streaming to python …)!
•  Extensible via UDF!
olivier.varene@orange.com!

INSERT INTO TABLE b_packet_out!
select count(*) as overall, !
sum( if(protocol like '^ip:tcp',1,0) as tcp,
sum( if(protocol like '^ip:udp',1.0) as udp,
sum( if(protocol like '^ip:icmp'1,0) as icmp !
from b_packet;!
R!
Rhadoop :
https://github.com/RevolutionAnalytics/
RHadoop/wiki!

•  rmr : functions providing
mapreduce in R!
•  rhdfs : functions providing
dhfs operations in R!

library(rmr2)
library(rhdfs)
gdp <- read.csv("GDP_converted.csv")
hdfs.init()
gdp.values <- to.dfs(gdp)
gdp.map.fn <- function(k,v) {
key <- ifelse(v[4] < aaplRevenue, "less", "greater")
keyval(key, 1) }
count.reduce.fn <- function(k,v) { keyval(k, length(v)) }

•  rhbase : functions providing
hbase operations in R!
olivier.varene@orange.com!

count <- mapreduce(input=gdp.values, map = gdp.map.fn,
reduce = count.reduce.fn)
from.dfs(count)
Gui!
Tools!
Time saver!
Prototyping!
Visualize complex
processes!
Fast changes!
Poc

!

But need to know the inside for optimization!
olivier.varene@orange.com!
SQL!
Prod / Beta & Alpha products!
ODBC/JDBC!

HiveQL!

Impala !

JDBC!

SQL!

HQL!
ISO!
PSQL!

Phoenix !

Presto !

Hive !
Hbase !
HDFS!
olivier.varene@orange.com!

Tajo !
Sqoop!
Transfer from/to HDFS to/from Structured storage via
JDBC connectors : PostGresql, MySQL, Oracle,
Terradata, …!

RDBMS!

NoSQL!
olivier.varene@orange.com!

Sqoop!
import!

Hadoop!
process!

Sqoop!
export!
Oozie!

olivier.varene@orange.com!
Nowadays !
@ Digital Factory ?!

olivier.varene@orange.com!
In Production!
• 

Since 2010!

• 

Growth by internal projects needs!

• 

Recycling Servers (€€ savings)!

• 

We learned as we walked : !
* tar -> cdh3 -> cdh4 …!
* optimizations!
* Run processes …!

olivier.varene@orange.com!
Production « PFS »!
• 

Shared among different teams!

• 

xx nodes on COTS!

• 

xxx TBytes!

• 

>xxx jobs / per day!

• 

Monitoring : Xymon !

• 

Graphing via NetStat (SNMP / RRD : x’s oids/second)!

• 

Automatic Configuration!

olivier.varene@orange.com!
Architecture!
App Services!
HIVE Server!

HIVE!

PIG!

Web Service!

Oozie!

Real Time Query Engine!
R!

!

Flume!

Sqoop!

HCatalog!
Cassandra!

olivier.varene@orange.com!

HBase!

MapReduce!

Mahout!
Khiops!

ZooKeeper!

Cascading

HDFS!

!

in POC
Benefits!
• 

Infrastructure cost!

€
!

-70% loc!
-50% dev time!
-75% run cost!

• 

Development cost!

• 

Robustness!

• 

Scalability!

• 

New development areas (Graph Mining, Logs
statistics …)!

olivier.varene@orange.com!
A few of our use cases!

olivier.varene@orange.com!
Scoring - Search Engine!
Graph algorithms for http://www.lemoteur.fr/!
xRank!

!

xxx TB in RAM

xx TB compressed!
xx billions nodes!
>xxx billions edges!

olivier.varene@orange.com!
Profiling!
Customers’ statistical behaviors, ads display
optimization, …!
Customer profile!

xxx GB / daily!
+!
xxx GB / monthly
(customer DB)!

olivier.varene@orange.com!
Log Analysis!
OJD certified Measurements : Internet and Mobile,
Customers’ journey analysis, …!
KPIs!

xx billion events
daily!

olivier.varene@orange.com!
with NoSQL!

Hadoop over
Cassandra!
(next session)!

olivier.varene@orange.com!
Benefits & Drawbacks!
Scalable!
Stable!
RUN Cost!
Development Cost!
Performance!
Very fast evolution!
New Dev areas!
olivier.varene@orange.com!

Learning curve!
Debug!
Algorithms!
Complex!
Very fast evolution!
Future!
• 

Enhance Security and robustness!

• 

Create Services & Functional Catalog!

• 

Continue building our expertise : Fast Data,
Cascading, MR2, …!

• 

A thousand nodes cluster !!

• 

Help other teams to go on Production!

CONTACT US : olivier.varene@orange.com!
olivier.varene@orange.com!
Thank you!
Merci!

Olivier Varene!
olivier.varene@orange.com!

DSIF/DFY!
Orange DevDay 2013

!
My Thanks to!
• 

Apache http://www.apache.org/!

• 

http://hadooper.blogspot.com/!

• 

Cloudera http://www.cloudera.com/!

• 

HortonWorks http://www.hortonworks.com/!

• 

And all the community !!

olivier.varene@orange.com!

Big Data @ Orange - Dev Day 2013 - part 2