Big Data @ Orange - Dev Day 2013 - part 2

BigData @ Digital
Factory!
une petite histoire en cours d’écriture!

!
Olivier Varene!
olivier.varene@orange.com!

DSIF/DFY!
Orange DevDay 2013

!

Hadoop!


Hadoop - Core!
MapReduce!
HDFS!

Hadoop genealogy!

Hadoop Time bar!
2.x!
1.x!
0.2[0-2].X!


0.23.x!

Hadoop Distribution!
Packaging!
Deployment!
Support!

Main Distributions!
Licence!

Business Model!

Support!

Apache!

Apache 2.0!

Fundation!

community only!

HortonWorks!

Apache 2.0!
HortonWorks (add-on)!

PS + Training +
support!

community +
Professional!

Cloudera!

Apache 2.0!
Closed Source (not
core)!

PS + Licencing +
Training + support!

community +
Professional!

MapR!

Apache 2.0!
Closed Source (FS)!

PS + Licencing +
support!

community +
Professional!

WanDisco!

Apache 2.0!
Closed Source (DConE)!

PS + Licencing +
Training + support!

community +
Professional!

!

PS: Professional Services

Big Name Distributions!
Paying & Closed Source!

• 

IBM InfoSphere BigInsights!

• 

GreenPlum (EMC)!

• 

Intel Distribution for Hadoop!

• 

…!


Big Data Suite!
Tooling!
Code generation!
Scheduling!
Integration!

Tools (1st level)!
Tool"!

Description!

Licence!

Apache Pig!

Scripting Platform!

Apache 2!

Apache Hive!

Data Access & Query!

Apache 2!

Apache HCatalog!

Metadata Services!

Apache 2!

Apache HBase!

NoSQL Database!

Apache 2!

Apache ZooKeeper!

Cluster Coordination!

Apache 2!

Apache Tez !

Query processing!

Apache 2!

Apache Oozie!

Workﬂow Scheduler!

Apache 2!

Apache Sqoop!

Data Integration Services!

Apache 2!


Tools (add-ons)!
Tool"!

Description!

Licence!

Teradata connector!

Connector!

Terradata + Distribution!

Hive ODBC!

ODBC!

Distribution!

Mahout!

Data Mining!

Apache 2!

Cascading!

Fault Tolerant API / Framework!

Apache 2!

Cassandra Connector! Connector to Cassandra NoSQL!

Apache 2!

MongoDB Connector!

Apache 2!

…!

Connector to MongoDB!

Landscape!


@ Digital Factory!
DSIF / Digital Factory!


Back in Time!
- 3 years!

•  PageRank calculus on billions nodes and 10s billions
edges

•  regularly failed ! (hardware ...)

•  4 to 8 weeks calculus

•  unscalable

•  failure rate around 80%

•  One person full time to supervise!

Answer ?!

Internal
!
Development
!
+ full control!
- long term!
- €€!

OpenSource
!
+ €€!
+ short term!
- support!
- evolution!

Success!

In PRODUCTION since 2010!


How does it work ?!


Hadoop Axioms!
•  System shall manage and heal himself"
•  Performance shall scale linearly"
•  Compute shall move to data"
•  Modular and extensible!

HDFS (Simple)!
Self-healing High-Bandwidth Clustered Storage!


MapReduce V1 (Simple)!
cat <data> | <Mapper> | sort | <Reducer>!


cat <data> | …………... | sort | …………….!

Framework

……………. <Mapper> ..…… <Reducer>!

Your program

YARN!
Allow plugging in new paradigms!


MapReduce V1!
Map()

Map()
Data on
HDFS

Reduce()
Map()

Reduce()

Map()

Map()

Map!

partXX

Sort!
Partition!

Reduce!

partXX

Before map()!
Map()

Block of Data

(Kin,Vin)

…

Data on
HDFS

Block of Data

Slicing
Partitioning


(Kout,Vout)

Map()

JobTracker calculates
locality for job assignment
and input split data

Java (Api)!
Mapper!
Class YourMapper extends Mapper(Kin,Vin,Kout,Vout) {
[void setup();]
[void cleanup();]
void map(Kin,Vin,context) {
…. Your program …!
}
}

before reduce()!
OPTIONAL

RAM
sorting
disk write

Map()

RAM
sorting
disk write

file
file
file

(Kout,Vout)

Combine()
(Kout,Vout)

temporary
intermediate files
sorted in each
file

(Kout,Vout)
temporary
intermediate files

1 or more times


file
file

Partition()

part
part
part

key namespace
partitioning

JobTracker
distribution to
reducers

Java (Api)!
Reducer!
Class YourReducer extends
Reducer(Kin,Vin,Kout,Vout) {
[void setup();]
[void cleanup();]
void reduce(Kin,List<Vin>,context) {
…. Your program …
}
}

Optimization tips!
•  JVM!
•  Algorithm in MapReduce paradigm!
•  Combiner!
•  Sort algorithm!
•  Partitioning!

Streaming!
… | <mapper> | … | <reducer> |…!
• 
• 
• 
• 
• 
• 

STDIN !
STDOUT!
Text as input and output by default!
‘t’ as default separator!
Use your language : perl, python, shell, ruby, … !
(interpreter needed on all nodes)!

hadoop jar $streamingJar –input <inputDir> -output <outputDir> !
-mapper <mapProg> -reducer <reduceProg> -ﬁle <ﬁles>!

Pipe – C++!
… | <mapper> | … | <reducer> |…!
•  Socket communication!
•  Bytes as input and output!
•  C++ API!
class MyMap: public
HadoopPipes::Mapper { … }

class MyReducer: public
HadoopPipes::Reducer { … }

hadoop put <binFile> <toHDFS…>!
hadoop pipes –input <inputDir> -output <outputDir> !
-program <path/binﬁle> [-conf <confFile>]!


Too difﬁcult!
Hopefully there are tools that can generate
code for you or let you do SQL queries !!!!

Tools!

Algo / Libs!

PIG!
Scripting Language :!
set job.name calculateGraphDegres!

•  Simple!

!
%default nbpigreducers 10!
set default_parallel $nbpigreducers!
!

•  Parallel execution!

-- degres sortant!
A = load '$degout' using PigStorage() as
(url:chararray,out_deg:int);!
-- keep entries where out_deg > 1!
A2 = ﬁlter A by (out_deg > 1);!
B = order A2 by out_deg DESC;!
store B into '$degoutOrdered';!

•  Data oriented!
!

•  Extensible via UDF!
•  Automatic performance
enhancement via compiler!

-- distribution des degres sortants!
C = foreach A generate out_deg,1 as deg_occ;!
D = group C by out_deg;!
E = foreach D generate FLATTEN(group) as
out_deg,SUM(C.deg_occ) as deg_occ;!
F = order E by out_deg ASC;!
store F into '$degoutDistrib';!

Hive!
Querying Language :!
•  HiveQL (sql like)!

CREATE EXTERNAL TABLE b_packet (timestamp
string, packet_length int, protocol string) !
ROW FORMAT DELIMITED FIELDS TERMINATED
BY "|" !
LOCATION ‘b-file/input/';!
!
CREATE EXTERNAL TABLE b_packet_out (protocol
string, cnt int) !
ROW FORMAT DELIMITED FIELDS TERMINATED
BY "t" !
LOCATION ’b-file/output/1/’;!

•  ETL Tool!
•  HDFS, HBase, Thrift …!
!

•  MapReduce interface (with
streaming to python …)!
•  Extensible via UDF!

INSERT INTO TABLE b_packet_out!
select count(*) as overall, !
sum( if(protocol like 'îp:tcp',1,0) as tcp,
sum( if(protocol like 'îp:udp',1.0) as udp,
sum( if(protocol like 'îp:icmp'1,0) as icmp !
from b_packet;!

R!
Rhadoop :
https://github.com/RevolutionAnalytics/
RHadoop/wiki!

•  rmr : functions providing
mapreduce in R!
•  rhdfs : functions providing
dhfs operations in R!

library(rmr2)
library(rhdfs)
gdp <- read.csv("GDP_converted.csv")
hdfs.init()
gdp.values <- to.dfs(gdp)
gdp.map.fn <- function(k,v) {
key <- ifelse(v[4] < aaplRevenue, "less", "greater")
keyval(key, 1) }
count.reduce.fn <- function(k,v) { keyval(k, length(v)) }

•  rhbase : functions providing
hbase operations in R!

count <- mapreduce(input=gdp.values, map = gdp.map.fn,
reduce = count.reduce.fn)
from.dfs(count)

Gui!
Tools!
Time saver!
Prototyping!
Visualize complex
processes!
Fast changes!
Poc

!

But need to know the inside for optimization!

SQL!
Prod / Beta & Alpha products!
ODBC/JDBC!

HiveQL!

Impala !

JDBC!

SQL!

HQL!
ISO!
PSQL!

Phoenix !

Presto !

Hive !
Hbase !
HDFS!

Tajo !

Sqoop!
Transfer from/to HDFS to/from Structured storage via
JDBC connectors : PostGresql, MySQL, Oracle,
Terradata, …!

RDBMS!

NoSQL!

Sqoop!
import!

Hadoop!
process!

Sqoop!
export!

Oozie!


Nowadays !
@ Digital Factory ?!


In Production!
• 

Since 2010!

• 

Growth by internal projects needs!

• 

Recycling Servers (€€ savings)!

• 

We learned as we walked : !
* tar -> cdh3 -> cdh4 …!
* optimizations!
* Run processes …!


Production « PFS »!
• 

Shared among different teams!

• 

xx nodes on COTS!

• 

xxx TBytes!

• 

>xxx jobs / per day!

• 

Monitoring : Xymon !

• 

Graphing via NetStat (SNMP / RRD : x’s oids/second)!

• 

Automatic Conﬁguration!


Architecture!
App Services!
HIVE Server!

HIVE!

PIG!

Web Service!

Oozie!

Real Time Query Engine!
R!

!

Flume!

Sqoop!

HCatalog!
Cassandra!


HBase!

MapReduce!

Mahout!
Khiops!

ZooKeeper!

Cascading

HDFS!

!

in POC

Beneﬁts!
• 

Infrastructure cost!

€
!

-70% loc!
-50% dev time!
-75% run cost!

• 

Development cost!

• 

Robustness!

• 

Scalability!

• 

New development areas (Graph Mining, Logs
statistics …)!


A few of our use cases!


Scoring - Search Engine!
Graph algorithms for http://www.lemoteur.fr/!
xRank!

!

xxx TB in RAM

xx TB compressed!
xx billions nodes!
>xxx billions edges!


Proﬁling!
Customers’ statistical behaviors, ads display
optimization, …!
Customer proﬁle!

xxx GB / daily!
+!
xxx GB / monthly
(customer DB)!


Log Analysis!
OJD certiﬁed Measurements : Internet and Mobile,
Customers’ journey analysis, …!
KPIs!

xx billion events
daily!


with NoSQL!

Hadoop over
Cassandra!
(next session)!


Beneﬁts & Drawbacks!
Scalable!
Stable!
RUN Cost!
Development Cost!
Performance!
Very fast evolution!
New Dev areas!

Learning curve!
Debug!
Algorithms!
Complex!
Very fast evolution!

Future!
• 

Enhance Security and robustness!

• 

Create Services & Functional Catalog!

• 

Continue building our expertise : Fast Data,
Cascading, MR2, …!

• 

A thousand nodes cluster !!

• 

Help other teams to go on Production!

CONTACT US : olivier.varene@orange.com!

Thank you!
Merci!

Olivier Varene!

DSIF/DFY!
Orange DevDay 2013

!

My Thanks to!
• 

Apache http://www.apache.org/!

• 

http://hadooper.blogspot.com/!

• 

Cloudera http://www.cloudera.com/!

• 

HortonWorks http://www.hortonworks.com/!

• 

And all the community !!


Big Data @ Orange - Dev Day 2013 - part 2

More Related Content

What's hot

Viewers also liked

Similar to Big Data @ Orange - Dev Day 2013 - part 2

Recently uploaded

Big Data @ Orange - Dev Day 2013 - part 2