Your SlideShare is downloading. ×
0
BigData @ Digital
Factory!
une petite histoire en cours d’écriture!

!
Olivier Varene!
olivier.varene@orange.com!

DSIF/DF...
Hadoop!

olivier.varene@orange.com!
Hadoop - Core!
MapReduce!
HDFS!
olivier.varene@orange.com!
Hadoop genealogy!
olivier.varene@orange.com!
Hadoop Time bar!
2.x!
1.x!
0.2[0-2].X!

olivier.varene@orange.com!

0.23.x!
Hadoop Distribution!
Packaging!
Deployment!
Support!
olivier.varene@orange.com!
Main Distributions!
Licence!

Business Model!

Support!

Apache!

Apache 2.0!

Fundation!

community only!

HortonWorks!

...
Big Name Distributions!
Paying & Closed Source!

• 

IBM InfoSphere BigInsights!

• 

GreenPlum (EMC)!

• 

Intel Distribu...
Big Data Suite!
Tooling!
Code generation!
Scheduling!
Integration!
olivier.varene@orange.com!
Tools (1st level)!
Tool"!

Description!

Licence!

Apache Pig!

Scripting Platform!

Apache 2!

Apache Hive!

Data Access ...
Tools (add-ons)!
Tool"!

Description!

Licence!

Teradata connector!

Connector!

Terradata + Distribution!

Hive ODBC!

O...
Landscape!

olivier.varene@orange.com!
@ Digital Factory!
DSIF / Digital Factory!

olivier.varene@orange.com!
Back in Time!
- 3 years!

•  PageRank calculus on billions nodes and 10s billions
edges	


•  regularly failed ! (hardware...
Answer ?!

Internal
!
Development
!
+ full control!
- long term!
- €€!
olivier.varene@orange.com!

OpenSource
!
+ €€!
+ sh...
Success!

In PRODUCTION since 2010!

olivier.varene@orange.com!
How does it work ?!

olivier.varene@orange.com!
Hadoop Axioms!
•  System shall manage and heal himself"
•  Performance shall scale linearly"
•  Compute shall move to data...
HDFS (Simple)!
Self-healing High-Bandwidth Clustered Storage!

olivier.varene@orange.com!
olivier.varene@orange.com!
MapReduce V1 (Simple)!
cat <data> | <Mapper> | sort | <Reducer>!

olivier.varene@orange.com!
MapReduce V1 (Simple)!
cat <data> | …………... | sort | …………….!

Framework
olivier.varene@orange.com!
MapReduce V1 (Simple)!
……………. <Mapper> ..…… <Reducer>!

Your program
olivier.varene@orange.com!
olivier.varene@orange.com!
olivier.varene@orange.com!
YARN!
Allow plugging in new paradigms!

olivier.varene@orange.com!
MapReduce V1!
Map()

Map()
Data on
HDFS

Reduce()
Map()

Reduce()

Map()

Map()

Map!
olivier.varene@orange.com!

partXX

...
Before map()!
Map()

Block of Data

(Kin,Vin)

…

Data on
HDFS

Block of Data

Slicing
Partitioning

olivier.varene@orange...
Java (Api)!
Mapper!
Class YourMapper extends Mapper(Kin,Vin,Kout,Vout) {
[void setup();]
[void cleanup();]
void map(Kin,Vi...
before reduce()!
OPTIONAL

RAM
sorting
disk write

Map()

RAM
sorting
disk write

file
file
file

(Kout,Vout)

Combine()
(Kou...
Java (Api)!
Reducer!
Class YourReducer extends
Reducer(Kin,Vin,Kout,Vout) {
[void setup();]
[void cleanup();]
void reduce(...
Optimization tips!
•  JVM!
•  Algorithm in MapReduce paradigm!
•  Combiner!
•  Sort algorithm!
•  Partitioning!
olivier.va...
Streaming!
… | <mapper> | … | <reducer> |…!
• 
• 
• 
• 
• 
• 

STDIN !
STDOUT!
Text as input and output by default!
‘t’ as...
Pipe – C++!
… | <mapper> | … | <reducer> |…!
•  Socket communication!
•  Bytes as input and output!
•  C++ API!
class MyMa...
Too difficult!
Hopefully there are tools that can generate
code for you or let you do SQL queries !!!!

Tools!
olivier.vare...
PIG!
Scripting Language :!
set job.name calculateGraphDegres!

•  Simple!

!
%default nbpigreducers 10!
set default_parall...
Hive!
Querying Language :!
•  HiveQL (sql like)!

CREATE EXTERNAL TABLE b_packet (timestamp
string, packet_length int, pro...
R!
Rhadoop :
https://github.com/RevolutionAnalytics/
RHadoop/wiki!

•  rmr : functions providing
mapreduce in R!
•  rhdfs ...
Gui!
Tools!
Time saver!
Prototyping!
Visualize complex
processes!
Fast changes!
Poc

!

But need to know the inside for op...
SQL!
Prod / Beta & Alpha products!
ODBC/JDBC!

HiveQL!

Impala !

JDBC!

SQL!

HQL!
ISO!
PSQL!

Phoenix !

Presto !

Hive ...
Sqoop!
Transfer from/to HDFS to/from Structured storage via
JDBC connectors : PostGresql, MySQL, Oracle,
Terradata, …!

RD...
Oozie!

olivier.varene@orange.com!
Nowadays !
@ Digital Factory ?!

olivier.varene@orange.com!
In Production!
• 

Since 2010!

• 

Growth by internal projects needs!

• 

Recycling Servers (€€ savings)!

• 

We learne...
Production « PFS »!
• 

Shared among different teams!

• 

xx nodes on COTS!

• 

xxx TBytes!

• 

>xxx jobs / per day!

•...
Architecture!
App Services!
HIVE Server!

HIVE!

PIG!

Web Service!

Oozie!

Real Time Query Engine!
R!

!

Flume!

Sqoop!...
Benefits!
• 

Infrastructure cost!

€
!

-70% loc!
-50% dev time!
-75% run cost!

• 

Development cost!

• 

Robustness!

•...
A few of our use cases!

olivier.varene@orange.com!
Scoring - Search Engine!
Graph algorithms for http://www.lemoteur.fr/!
xRank!

!

xxx TB in RAM

xx TB compressed!
xx bill...
Profiling!
Customers’ statistical behaviors, ads display
optimization, …!
Customer profile!

xxx GB / daily!
+!
xxx GB / mon...
Log Analysis!
OJD certified Measurements : Internet and Mobile,
Customers’ journey analysis, …!
KPIs!

xx billion events
da...
with NoSQL!

Hadoop over
Cassandra!
(next session)!

olivier.varene@orange.com!
Benefits & Drawbacks!
Scalable!
Stable!
RUN Cost!
Development Cost!
Performance!
Very fast evolution!
New Dev areas!
olivie...
Future!
• 

Enhance Security and robustness!

• 

Create Services & Functional Catalog!

• 

Continue building our experti...
Thank you!
Merci!

Olivier Varene!
olivier.varene@orange.com!

DSIF/DFY!
Orange DevDay 2013

!
My Thanks to!
• 

Apache http://www.apache.org/!

• 

http://hadooper.blogspot.com/!

• 

Cloudera http://www.cloudera.com...
Upcoming SlideShare
Loading in...5
×

Big Data @ Orange - Dev Day 2013 - part 2

472

Published on

Big Data @ Orange - Dev Day 2013 - part 2

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
472
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
15
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Transcript of "Big Data @ Orange - Dev Day 2013 - part 2"

  1. 1. BigData @ Digital Factory! une petite histoire en cours d’écriture! ! Olivier Varene! olivier.varene@orange.com! DSIF/DFY! Orange DevDay 2013 !
  2. 2. Hadoop! olivier.varene@orange.com!
  3. 3. Hadoop - Core! MapReduce! HDFS! olivier.varene@orange.com!
  4. 4. Hadoop genealogy! olivier.varene@orange.com!
  5. 5. Hadoop Time bar! 2.x! 1.x! 0.2[0-2].X! olivier.varene@orange.com! 0.23.x!
  6. 6. Hadoop Distribution! Packaging! Deployment! Support! olivier.varene@orange.com!
  7. 7. Main Distributions! Licence! Business Model! Support! Apache! Apache 2.0! Fundation! community only! HortonWorks! Apache 2.0! HortonWorks (add-on)! PS + Training + support! community + Professional! Cloudera! Apache 2.0! Closed Source (not core)! PS + Licencing + Training + support! community + Professional! MapR! Apache 2.0! Closed Source (FS)! PS + Licencing + support! community + Professional! WanDisco! Apache 2.0! Closed Source (DConE)! PS + Licencing + Training + support! community + Professional! ! PS: Professional Services olivier.varene@orange.com!
  8. 8. Big Name Distributions! Paying & Closed Source! •  IBM InfoSphere BigInsights! •  GreenPlum (EMC)! •  Intel Distribution for Hadoop! •  …! olivier.varene@orange.com!
  9. 9. Big Data Suite! Tooling! Code generation! Scheduling! Integration! olivier.varene@orange.com!
  10. 10. Tools (1st level)! Tool"! Description! Licence! Apache Pig! Scripting Platform! Apache 2! Apache Hive! Data Access & Query! Apache 2! Apache HCatalog! Metadata Services! Apache 2! Apache HBase! NoSQL Database! Apache 2! Apache ZooKeeper! Cluster Coordination! Apache 2! Apache Tez ! Query processing! Apache 2! Apache Oozie! Workflow Scheduler! Apache 2! Apache Sqoop! Data Integration Services! Apache 2! olivier.varene@orange.com!
  11. 11. Tools (add-ons)! Tool"! Description! Licence! Teradata connector! Connector! Terradata + Distribution! Hive ODBC! ODBC! Distribution! Mahout! Data Mining! Apache 2! Cascading! Fault Tolerant API / Framework! Apache 2! Cassandra Connector! Connector to Cassandra NoSQL! Apache 2! MongoDB Connector! Apache 2! …! olivier.varene@orange.com! Connector to MongoDB!
  12. 12. Landscape! olivier.varene@orange.com!
  13. 13. @ Digital Factory! DSIF / Digital Factory! olivier.varene@orange.com!
  14. 14. Back in Time! - 3 years! •  PageRank calculus on billions nodes and 10s billions edges •  regularly failed ! (hardware ...) •  4 to 8 weeks calculus •  unscalable •  failure rate around 80% •  One person full time to supervise! olivier.varene@orange.com!
  15. 15. Answer ?! Internal ! Development ! + full control! - long term! - €€! olivier.varene@orange.com! OpenSource ! + €€! + short term! - support! - evolution!
  16. 16. Success! In PRODUCTION since 2010! olivier.varene@orange.com!
  17. 17. How does it work ?! olivier.varene@orange.com!
  18. 18. Hadoop Axioms! •  System shall manage and heal himself" •  Performance shall scale linearly" •  Compute shall move to data" •  Modular and extensible! olivier.varene@orange.com!
  19. 19. HDFS (Simple)! Self-healing High-Bandwidth Clustered Storage! olivier.varene@orange.com!
  20. 20. olivier.varene@orange.com!
  21. 21. MapReduce V1 (Simple)! cat <data> | <Mapper> | sort | <Reducer>! olivier.varene@orange.com!
  22. 22. MapReduce V1 (Simple)! cat <data> | …………... | sort | …………….! Framework olivier.varene@orange.com!
  23. 23. MapReduce V1 (Simple)! ……………. <Mapper> ..…… <Reducer>! Your program olivier.varene@orange.com!
  24. 24. olivier.varene@orange.com!
  25. 25. olivier.varene@orange.com!
  26. 26. YARN! Allow plugging in new paradigms! olivier.varene@orange.com!
  27. 27. MapReduce V1! Map() Map() Data on HDFS Reduce() Map() Reduce() Map() Map() Map! olivier.varene@orange.com! partXX Sort! Partition! Reduce! partXX
  28. 28. Before map()! Map() Block of Data (Kin,Vin) … Data on HDFS Block of Data Slicing Partitioning olivier.varene@orange.com! (Kout,Vout) Map() JobTracker calculates locality for job assignment and input split data
  29. 29. Java (Api)! Mapper! Class YourMapper extends Mapper(Kin,Vin,Kout,Vout) { [void setup();] [void cleanup();] void map(Kin,Vin,context) { …. Your program …! } } olivier.varene@orange.com!
  30. 30. before reduce()! OPTIONAL RAM sorting disk write Map() RAM sorting disk write file file file (Kout,Vout) Combine() (Kout,Vout) temporary intermediate files sorted in each file (Kout,Vout) temporary intermediate files 1 or more times olivier.varene@orange.com! file file Partition() part part part key namespace partitioning JobTracker distribution to reducers
  31. 31. Java (Api)! Reducer! Class YourReducer extends Reducer(Kin,Vin,Kout,Vout) { [void setup();] [void cleanup();] void reduce(Kin,List<Vin>,context) { …. Your program … } } olivier.varene@orange.com!
  32. 32. Optimization tips! •  JVM! •  Algorithm in MapReduce paradigm! •  Combiner! •  Sort algorithm! •  Partitioning! olivier.varene@orange.com!
  33. 33. Streaming! … | <mapper> | … | <reducer> |…! •  •  •  •  •  •  STDIN ! STDOUT! Text as input and output by default! ‘t’ as default separator! Use your language : perl, python, shell, ruby, … ! (interpreter needed on all nodes)! hadoop jar $streamingJar –input <inputDir> -output <outputDir> ! -mapper <mapProg> -reducer <reduceProg> -file <files>! olivier.varene@orange.com!
  34. 34. Pipe – C++! … | <mapper> | … | <reducer> |…! •  Socket communication! •  Bytes as input and output! •  C++ API! class MyMap: public HadoopPipes::Mapper { … } class MyReducer: public HadoopPipes::Reducer { … } hadoop put <binFile> <toHDFS…>! hadoop pipes –input <inputDir> -output <outputDir> ! -program <path/binfile> [-conf <confFile>]! olivier.varene@orange.com!
  35. 35. Too difficult! Hopefully there are tools that can generate code for you or let you do SQL queries !!!! Tools! olivier.varene@orange.com! Algo / Libs!
  36. 36. PIG! Scripting Language :! set job.name calculateGraphDegres! •  Simple! ! %default nbpigreducers 10! set default_parallel $nbpigreducers! ! •  Parallel execution! -- degres sortant! A = load '$degout' using PigStorage() as (url:chararray,out_deg:int);! -- keep entries where out_deg > 1! A2 = filter A by (out_deg > 1);! B = order A2 by out_deg DESC;! store B into '$degoutOrdered';! •  Data oriented! ! •  Extensible via UDF! •  Automatic performance enhancement via compiler! olivier.varene@orange.com! -- distribution des degres sortants! C = foreach A generate out_deg,1 as deg_occ;! D = group C by out_deg;! E = foreach D generate FLATTEN(group) as out_deg,SUM(C.deg_occ) as deg_occ;! F = order E by out_deg ASC;! store F into '$degoutDistrib';!
  37. 37. Hive! Querying Language :! •  HiveQL (sql like)! CREATE EXTERNAL TABLE b_packet (timestamp string, packet_length int, protocol string) ! ROW FORMAT DELIMITED FIELDS TERMINATED BY "|" ! LOCATION ‘b-file/input/';! ! CREATE EXTERNAL TABLE b_packet_out (protocol string, cnt int) ! ROW FORMAT DELIMITED FIELDS TERMINATED BY "t" ! LOCATION ’b-file/output/1/’;! •  ETL Tool! •  HDFS, HBase, Thrift …! ! •  MapReduce interface (with streaming to python …)! •  Extensible via UDF! olivier.varene@orange.com! INSERT INTO TABLE b_packet_out! select count(*) as overall, ! sum( if(protocol like '^ip:tcp',1,0) as tcp, sum( if(protocol like '^ip:udp',1.0) as udp, sum( if(protocol like '^ip:icmp'1,0) as icmp ! from b_packet;!
  38. 38. R! Rhadoop : https://github.com/RevolutionAnalytics/ RHadoop/wiki! •  rmr : functions providing mapreduce in R! •  rhdfs : functions providing dhfs operations in R! library(rmr2) library(rhdfs) gdp <- read.csv("GDP_converted.csv") hdfs.init() gdp.values <- to.dfs(gdp) gdp.map.fn <- function(k,v) { key <- ifelse(v[4] < aaplRevenue, "less", "greater") keyval(key, 1) } count.reduce.fn <- function(k,v) { keyval(k, length(v)) } •  rhbase : functions providing hbase operations in R! olivier.varene@orange.com! count <- mapreduce(input=gdp.values, map = gdp.map.fn, reduce = count.reduce.fn) from.dfs(count)
  39. 39. Gui! Tools! Time saver! Prototyping! Visualize complex processes! Fast changes! Poc ! But need to know the inside for optimization! olivier.varene@orange.com!
  40. 40. SQL! Prod / Beta & Alpha products! ODBC/JDBC! HiveQL! Impala ! JDBC! SQL! HQL! ISO! PSQL! Phoenix ! Presto ! Hive ! Hbase ! HDFS! olivier.varene@orange.com! Tajo !
  41. 41. Sqoop! Transfer from/to HDFS to/from Structured storage via JDBC connectors : PostGresql, MySQL, Oracle, Terradata, …! RDBMS! NoSQL! olivier.varene@orange.com! Sqoop! import! Hadoop! process! Sqoop! export!
  42. 42. Oozie! olivier.varene@orange.com!
  43. 43. Nowadays ! @ Digital Factory ?! olivier.varene@orange.com!
  44. 44. In Production! •  Since 2010! •  Growth by internal projects needs! •  Recycling Servers (€€ savings)! •  We learned as we walked : ! * tar -> cdh3 -> cdh4 …! * optimizations! * Run processes …! olivier.varene@orange.com!
  45. 45. Production « PFS »! •  Shared among different teams! •  xx nodes on COTS! •  xxx TBytes! •  >xxx jobs / per day! •  Monitoring : Xymon ! •  Graphing via NetStat (SNMP / RRD : x’s oids/second)! •  Automatic Configuration! olivier.varene@orange.com!
  46. 46. Architecture! App Services! HIVE Server! HIVE! PIG! Web Service! Oozie! Real Time Query Engine! R! ! Flume! Sqoop! HCatalog! Cassandra! olivier.varene@orange.com! HBase! MapReduce! Mahout! Khiops! ZooKeeper! Cascading HDFS! ! in POC
  47. 47. Benefits! •  Infrastructure cost! € ! -70% loc! -50% dev time! -75% run cost! •  Development cost! •  Robustness! •  Scalability! •  New development areas (Graph Mining, Logs statistics …)! olivier.varene@orange.com!
  48. 48. A few of our use cases! olivier.varene@orange.com!
  49. 49. Scoring - Search Engine! Graph algorithms for http://www.lemoteur.fr/! xRank! ! xxx TB in RAM xx TB compressed! xx billions nodes! >xxx billions edges! olivier.varene@orange.com!
  50. 50. Profiling! Customers’ statistical behaviors, ads display optimization, …! Customer profile! xxx GB / daily! +! xxx GB / monthly (customer DB)! olivier.varene@orange.com!
  51. 51. Log Analysis! OJD certified Measurements : Internet and Mobile, Customers’ journey analysis, …! KPIs! xx billion events daily! olivier.varene@orange.com!
  52. 52. with NoSQL! Hadoop over Cassandra! (next session)! olivier.varene@orange.com!
  53. 53. Benefits & Drawbacks! Scalable! Stable! RUN Cost! Development Cost! Performance! Very fast evolution! New Dev areas! olivier.varene@orange.com! Learning curve! Debug! Algorithms! Complex! Very fast evolution!
  54. 54. Future! •  Enhance Security and robustness! •  Create Services & Functional Catalog! •  Continue building our expertise : Fast Data, Cascading, MR2, …! •  A thousand nodes cluster !! •  Help other teams to go on Production! CONTACT US : olivier.varene@orange.com! olivier.varene@orange.com!
  55. 55. Thank you! Merci! Olivier Varene! olivier.varene@orange.com! DSIF/DFY! Orange DevDay 2013 !
  56. 56. My Thanks to! •  Apache http://www.apache.org/! •  http://hadooper.blogspot.com/! •  Cloudera http://www.cloudera.com/! •  HortonWorks http://www.hortonworks.com/! •  And all the community !! olivier.varene@orange.com!
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×