Big Data @ Orange - Dev Day 2013 - part 2

  • 315 views
Uploaded on

Big Data @ Orange - Dev Day 2013 - part 2

Big Data @ Orange - Dev Day 2013 - part 2

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
315
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
13
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. BigData @ Digital Factory! une petite histoire en cours d’écriture! ! Olivier Varene! olivier.varene@orange.com! DSIF/DFY! Orange DevDay 2013 !
  • 2. Hadoop! olivier.varene@orange.com!
  • 3. Hadoop - Core! MapReduce! HDFS! olivier.varene@orange.com!
  • 4. Hadoop genealogy! olivier.varene@orange.com!
  • 5. Hadoop Time bar! 2.x! 1.x! 0.2[0-2].X! olivier.varene@orange.com! 0.23.x!
  • 6. Hadoop Distribution! Packaging! Deployment! Support! olivier.varene@orange.com!
  • 7. Main Distributions! Licence! Business Model! Support! Apache! Apache 2.0! Fundation! community only! HortonWorks! Apache 2.0! HortonWorks (add-on)! PS + Training + support! community + Professional! Cloudera! Apache 2.0! Closed Source (not core)! PS + Licencing + Training + support! community + Professional! MapR! Apache 2.0! Closed Source (FS)! PS + Licencing + support! community + Professional! WanDisco! Apache 2.0! Closed Source (DConE)! PS + Licencing + Training + support! community + Professional! ! PS: Professional Services olivier.varene@orange.com!
  • 8. Big Name Distributions! Paying & Closed Source! •  IBM InfoSphere BigInsights! •  GreenPlum (EMC)! •  Intel Distribution for Hadoop! •  …! olivier.varene@orange.com!
  • 9. Big Data Suite! Tooling! Code generation! Scheduling! Integration! olivier.varene@orange.com!
  • 10. Tools (1st level)! Tool"! Description! Licence! Apache Pig! Scripting Platform! Apache 2! Apache Hive! Data Access & Query! Apache 2! Apache HCatalog! Metadata Services! Apache 2! Apache HBase! NoSQL Database! Apache 2! Apache ZooKeeper! Cluster Coordination! Apache 2! Apache Tez ! Query processing! Apache 2! Apache Oozie! Workflow Scheduler! Apache 2! Apache Sqoop! Data Integration Services! Apache 2! olivier.varene@orange.com!
  • 11. Tools (add-ons)! Tool"! Description! Licence! Teradata connector! Connector! Terradata + Distribution! Hive ODBC! ODBC! Distribution! Mahout! Data Mining! Apache 2! Cascading! Fault Tolerant API / Framework! Apache 2! Cassandra Connector! Connector to Cassandra NoSQL! Apache 2! MongoDB Connector! Apache 2! …! olivier.varene@orange.com! Connector to MongoDB!
  • 12. Landscape! olivier.varene@orange.com!
  • 13. @ Digital Factory! DSIF / Digital Factory! olivier.varene@orange.com!
  • 14. Back in Time! - 3 years! •  PageRank calculus on billions nodes and 10s billions edges •  regularly failed ! (hardware ...) •  4 to 8 weeks calculus •  unscalable •  failure rate around 80% •  One person full time to supervise! olivier.varene@orange.com!
  • 15. Answer ?! Internal ! Development ! + full control! - long term! - €€! olivier.varene@orange.com! OpenSource ! + €€! + short term! - support! - evolution!
  • 16. Success! In PRODUCTION since 2010! olivier.varene@orange.com!
  • 17. How does it work ?! olivier.varene@orange.com!
  • 18. Hadoop Axioms! •  System shall manage and heal himself" •  Performance shall scale linearly" •  Compute shall move to data" •  Modular and extensible! olivier.varene@orange.com!
  • 19. HDFS (Simple)! Self-healing High-Bandwidth Clustered Storage! olivier.varene@orange.com!
  • 20. olivier.varene@orange.com!
  • 21. MapReduce V1 (Simple)! cat <data> | <Mapper> | sort | <Reducer>! olivier.varene@orange.com!
  • 22. MapReduce V1 (Simple)! cat <data> | …………... | sort | …………….! Framework olivier.varene@orange.com!
  • 23. MapReduce V1 (Simple)! ……………. <Mapper> ..…… <Reducer>! Your program olivier.varene@orange.com!
  • 24. olivier.varene@orange.com!
  • 25. olivier.varene@orange.com!
  • 26. YARN! Allow plugging in new paradigms! olivier.varene@orange.com!
  • 27. MapReduce V1! Map() Map() Data on HDFS Reduce() Map() Reduce() Map() Map() Map! olivier.varene@orange.com! partXX Sort! Partition! Reduce! partXX
  • 28. Before map()! Map() Block of Data (Kin,Vin) … Data on HDFS Block of Data Slicing Partitioning olivier.varene@orange.com! (Kout,Vout) Map() JobTracker calculates locality for job assignment and input split data
  • 29. Java (Api)! Mapper! Class YourMapper extends Mapper(Kin,Vin,Kout,Vout) { [void setup();] [void cleanup();] void map(Kin,Vin,context) { …. Your program …! } } olivier.varene@orange.com!
  • 30. before reduce()! OPTIONAL RAM sorting disk write Map() RAM sorting disk write file file file (Kout,Vout) Combine() (Kout,Vout) temporary intermediate files sorted in each file (Kout,Vout) temporary intermediate files 1 or more times olivier.varene@orange.com! file file Partition() part part part key namespace partitioning JobTracker distribution to reducers
  • 31. Java (Api)! Reducer! Class YourReducer extends Reducer(Kin,Vin,Kout,Vout) { [void setup();] [void cleanup();] void reduce(Kin,List<Vin>,context) { …. Your program … } } olivier.varene@orange.com!
  • 32. Optimization tips! •  JVM! •  Algorithm in MapReduce paradigm! •  Combiner! •  Sort algorithm! •  Partitioning! olivier.varene@orange.com!
  • 33. Streaming! … | <mapper> | … | <reducer> |…! •  •  •  •  •  •  STDIN ! STDOUT! Text as input and output by default! ‘t’ as default separator! Use your language : perl, python, shell, ruby, … ! (interpreter needed on all nodes)! hadoop jar $streamingJar –input <inputDir> -output <outputDir> ! -mapper <mapProg> -reducer <reduceProg> -file <files>! olivier.varene@orange.com!
  • 34. Pipe – C++! … | <mapper> | … | <reducer> |…! •  Socket communication! •  Bytes as input and output! •  C++ API! class MyMap: public HadoopPipes::Mapper { … } class MyReducer: public HadoopPipes::Reducer { … } hadoop put <binFile> <toHDFS…>! hadoop pipes –input <inputDir> -output <outputDir> ! -program <path/binfile> [-conf <confFile>]! olivier.varene@orange.com!
  • 35. Too difficult! Hopefully there are tools that can generate code for you or let you do SQL queries !!!! Tools! olivier.varene@orange.com! Algo / Libs!
  • 36. PIG! Scripting Language :! set job.name calculateGraphDegres! •  Simple! ! %default nbpigreducers 10! set default_parallel $nbpigreducers! ! •  Parallel execution! -- degres sortant! A = load '$degout' using PigStorage() as (url:chararray,out_deg:int);! -- keep entries where out_deg > 1! A2 = filter A by (out_deg > 1);! B = order A2 by out_deg DESC;! store B into '$degoutOrdered';! •  Data oriented! ! •  Extensible via UDF! •  Automatic performance enhancement via compiler! olivier.varene@orange.com! -- distribution des degres sortants! C = foreach A generate out_deg,1 as deg_occ;! D = group C by out_deg;! E = foreach D generate FLATTEN(group) as out_deg,SUM(C.deg_occ) as deg_occ;! F = order E by out_deg ASC;! store F into '$degoutDistrib';!
  • 37. Hive! Querying Language :! •  HiveQL (sql like)! CREATE EXTERNAL TABLE b_packet (timestamp string, packet_length int, protocol string) ! ROW FORMAT DELIMITED FIELDS TERMINATED BY "|" ! LOCATION ‘b-file/input/';! ! CREATE EXTERNAL TABLE b_packet_out (protocol string, cnt int) ! ROW FORMAT DELIMITED FIELDS TERMINATED BY "t" ! LOCATION ’b-file/output/1/’;! •  ETL Tool! •  HDFS, HBase, Thrift …! ! •  MapReduce interface (with streaming to python …)! •  Extensible via UDF! olivier.varene@orange.com! INSERT INTO TABLE b_packet_out! select count(*) as overall, ! sum( if(protocol like '^ip:tcp',1,0) as tcp, sum( if(protocol like '^ip:udp',1.0) as udp, sum( if(protocol like '^ip:icmp'1,0) as icmp ! from b_packet;!
  • 38. R! Rhadoop : https://github.com/RevolutionAnalytics/ RHadoop/wiki! •  rmr : functions providing mapreduce in R! •  rhdfs : functions providing dhfs operations in R! library(rmr2) library(rhdfs) gdp <- read.csv("GDP_converted.csv") hdfs.init() gdp.values <- to.dfs(gdp) gdp.map.fn <- function(k,v) { key <- ifelse(v[4] < aaplRevenue, "less", "greater") keyval(key, 1) } count.reduce.fn <- function(k,v) { keyval(k, length(v)) } •  rhbase : functions providing hbase operations in R! olivier.varene@orange.com! count <- mapreduce(input=gdp.values, map = gdp.map.fn, reduce = count.reduce.fn) from.dfs(count)
  • 39. Gui! Tools! Time saver! Prototyping! Visualize complex processes! Fast changes! Poc ! But need to know the inside for optimization! olivier.varene@orange.com!
  • 40. SQL! Prod / Beta & Alpha products! ODBC/JDBC! HiveQL! Impala ! JDBC! SQL! HQL! ISO! PSQL! Phoenix ! Presto ! Hive ! Hbase ! HDFS! olivier.varene@orange.com! Tajo !
  • 41. Sqoop! Transfer from/to HDFS to/from Structured storage via JDBC connectors : PostGresql, MySQL, Oracle, Terradata, …! RDBMS! NoSQL! olivier.varene@orange.com! Sqoop! import! Hadoop! process! Sqoop! export!
  • 42. Oozie! olivier.varene@orange.com!
  • 43. Nowadays ! @ Digital Factory ?! olivier.varene@orange.com!
  • 44. In Production! •  Since 2010! •  Growth by internal projects needs! •  Recycling Servers (€€ savings)! •  We learned as we walked : ! * tar -> cdh3 -> cdh4 …! * optimizations! * Run processes …! olivier.varene@orange.com!
  • 45. Production « PFS »! •  Shared among different teams! •  xx nodes on COTS! •  xxx TBytes! •  >xxx jobs / per day! •  Monitoring : Xymon ! •  Graphing via NetStat (SNMP / RRD : x’s oids/second)! •  Automatic Configuration! olivier.varene@orange.com!
  • 46. Architecture! App Services! HIVE Server! HIVE! PIG! Web Service! Oozie! Real Time Query Engine! R! ! Flume! Sqoop! HCatalog! Cassandra! olivier.varene@orange.com! HBase! MapReduce! Mahout! Khiops! ZooKeeper! Cascading HDFS! ! in POC
  • 47. Benefits! •  Infrastructure cost! € ! -70% loc! -50% dev time! -75% run cost! •  Development cost! •  Robustness! •  Scalability! •  New development areas (Graph Mining, Logs statistics …)! olivier.varene@orange.com!
  • 48. A few of our use cases! olivier.varene@orange.com!
  • 49. Scoring - Search Engine! Graph algorithms for http://www.lemoteur.fr/! xRank! ! xxx TB in RAM xx TB compressed! xx billions nodes! >xxx billions edges! olivier.varene@orange.com!
  • 50. Profiling! Customers’ statistical behaviors, ads display optimization, …! Customer profile! xxx GB / daily! +! xxx GB / monthly (customer DB)! olivier.varene@orange.com!
  • 51. Log Analysis! OJD certified Measurements : Internet and Mobile, Customers’ journey analysis, …! KPIs! xx billion events daily! olivier.varene@orange.com!
  • 52. with NoSQL! Hadoop over Cassandra! (next session)! olivier.varene@orange.com!
  • 53. Benefits & Drawbacks! Scalable! Stable! RUN Cost! Development Cost! Performance! Very fast evolution! New Dev areas! olivier.varene@orange.com! Learning curve! Debug! Algorithms! Complex! Very fast evolution!
  • 54. Future! •  Enhance Security and robustness! •  Create Services & Functional Catalog! •  Continue building our expertise : Fast Data, Cascading, MR2, …! •  A thousand nodes cluster !! •  Help other teams to go on Production! CONTACT US : olivier.varene@orange.com! olivier.varene@orange.com!
  • 55. Thank you! Merci! Olivier Varene! olivier.varene@orange.com! DSIF/DFY! Orange DevDay 2013 !
  • 56. My Thanks to! •  Apache http://www.apache.org/! •  http://hadooper.blogspot.com/! •  Cloudera http://www.cloudera.com/! •  HortonWorks http://www.hortonworks.com/! •  And all the community !! olivier.varene@orange.com!