Distributed Calculus with Hadoop MapReduce inside Orange Search EngineSophiaConf2012               olivier.varene@orange.c...
What is Big Data ?SophiaConf2012                         olivier.varene@orange.commardi 10 juillet 12
$ 5 billions   (2012)                                      to                               $ 50 billions                 ...
«Big Data is the new definitive                          source of competitive                          advantage across al...
Product success                      «The days are over when you build a                      product once and it just wor...
Big DataSophiaConf2012                   olivier.varene@orange.commardi 10 juillet 12
What about Hadoop ?SophiaConf2012                         olivier.varene@orange.commardi 10 juillet 12
Beliefs                      « We believe that by 2015, more than                      half the world’s data will be proce...
ActorsSophiaConf2012                 olivier.varene@orange.commardi 10 juillet 12
Hadoop eco-systemSophiaConf2012                       olivier.varene@orange.commardi 10 juillet 12
Context at Orange                            (more than 2 years ago)SophiaConf2012                                        ...
Orange Search Engine                      http://www.orange.fr                                             http://www.lemo...
Search Engine                          Architecture                                        Indexation                     ...
Main issue                      •   PageRank calculus on billions nodes and 10s                          billions edges   ...
AnswerSophiaConf2012                 olivier.varene@orange.commardi 10 juillet 12
PageRank portable to                      Hadoop / MapReduce ?                      • Simple programing model:            ...
Hadoop Axioms                      • System shall manage and heal himself                      • Performance shall scale l...
Our install ?SophiaConf2012                        olivier.varene@orange.commardi 10 juillet 12
Our install                                            Oozie                      HIVE       PIG                          ...
HDFS ?SophiaConf2012                 olivier.varene@orange.commardi 10 juillet 12
Hadoop - HDFS                                                                              00110                          ...
Map Reduce ?SophiaConf2012                       olivier.varene@orange.commardi 10 juillet 12
MapReduce            cat | «your map» | sort -u | «your reduce»                            Programming paradigmSophiaConf2...
MapReduce            cat | «your map» | sort -u | «your reduce»                      FrameWorkSophiaConf2012              ...
MapReduce            cat | «your map» | sort -u | «your reduce»                              your JobSophiaConf2012       ...
MapReduce InsidesSophiaConf2012                        olivier.varene@orange.commardi 10 juillet 12
Interfaces                      • Java API                      • Pipes                      • Streaming (python, perl, C/...
PIG                      • High level data analysis script language                      • extensible via UDF             ...
HIVE                      • High level SQL-like query and analysis                        language                      • ...
Application domains ?SophiaConf2012                           olivier.varene@orange.commardi 10 juillet 12
Projects                      • Scoring                      • User profiling                      • Log analysis and stati...
ROI ?SophiaConf2012                olivier.varene@orange.commardi 10 juillet 12
ROI                      • Lines Of Code                  10X gain                      • Development Time                ...
Perfect World ?                      ★ YES                       •   Run cost                       •   Development cost  ...
Thank you                      Olivier Varene - Orange                      olivier.varene@orange.com                     ...
thanks to                      • Hadoop - Apache                               (http://hadoop.apache.org/)                ...
Upcoming SlideShare
Loading in...5
×

Hadoop at Orange - SophiaConf2012

1,200
-1

Published on

During SophiaConf2012,
presentation of Hadoop at Orange

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,200
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
24
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Hadoop at Orange - SophiaConf2012

  1. 1. Distributed Calculus with Hadoop MapReduce inside Orange Search EngineSophiaConf2012 olivier.varene@orange.commardi 10 juillet 12
  2. 2. What is Big Data ?SophiaConf2012 olivier.varene@orange.commardi 10 juillet 12
  3. 3. $ 5 billions (2012) to $ 50 billions (by 2017) ForbesSophiaConf2012 olivier.varene@orange.commardi 10 juillet 12
  4. 4. «Big Data is the new definitive source of competitive advantage across all industries» Jeff KellySophiaConf2012 olivier.varene@orange.commardi 10 juillet 12
  5. 5. Product success «The days are over when you build a product once and it just works. You have to take ideas, test them, iterate them, use data and analytics to understand what works and what doesn’t in order to be successful» LivingSocial @PCWorld 2012SophiaConf2012 olivier.varene@orange.commardi 10 juillet 12
  6. 6. Big DataSophiaConf2012 olivier.varene@orange.commardi 10 juillet 12
  7. 7. What about Hadoop ?SophiaConf2012 olivier.varene@orange.commardi 10 juillet 12
  8. 8. Beliefs « We believe that by 2015, more than half the world’s data will be processed by Apache Hadoop » HortonWorks @HadoopSummit 2012SophiaConf2012 olivier.varene@orange.commardi 10 juillet 12
  9. 9. ActorsSophiaConf2012 olivier.varene@orange.commardi 10 juillet 12
  10. 10. Hadoop eco-systemSophiaConf2012 olivier.varene@orange.commardi 10 juillet 12
  11. 11. Context at Orange (more than 2 years ago)SophiaConf2012 olivier.varene@orange.commardi 10 juillet 12
  12. 12. Orange Search Engine http://www.orange.fr http://www.lemoteur.fr http://www.voila.frSophiaConf2012 olivier.varene@orange.commardi 10 juillet 12
  13. 13. Search Engine Architecture Indexation Collecte du Infrastructure Internet WEB de Recherche 24/24 (Crawl) Pré-calcul de score PageRank 5 milliards 750 millions de de documents documents FrançaisSophiaConf2012 olivier.varene@orange.commardi 10 juillet 12
  14. 14. Main issue • PageRank calculus on billions nodes and 10s billions edges • regularly failed ! (hardware ...) • 4 to 8 weeks calculus • unscalable • failure rate aroud 80% • One person full time to superviseSophiaConf2012 olivier.varene@orange.commardi 10 juillet 12
  15. 15. AnswerSophiaConf2012 olivier.varene@orange.commardi 10 juillet 12
  16. 16. PageRank portable to Hadoop / MapReduce ? • Simple programing model: Map(in_k,in_v) => list(out_k,intermed_v) Reduce(out_k,intermed_v) => list(out_v) • Scalable • Batch Processing • YES !SophiaConf2012 olivier.varene@orange.commardi 10 juillet 12
  17. 17. Hadoop Axioms • System shall manage and heal himself • Performance shall scale linearly • Compute shall move to data • Modular and extensibleSophiaConf2012 olivier.varene@orange.commardi 10 juillet 12
  18. 18. Our install ?SophiaConf2012 olivier.varene@orange.commardi 10 juillet 12
  19. 19. Our install Oozie HIVE PIG ZooKeeper MapReduce Mahout Khiops HDFSSophiaConf2012 olivier.varene@orange.commardi 10 juillet 12
  20. 20. HDFS ?SophiaConf2012 olivier.varene@orange.commardi 10 juillet 12
  21. 21. Hadoop - HDFS 00110 10101 Master 11000 0111 Client Nœud Nœud Nœud données données Nœud données Nœud données données Read/Write Read Bloc ops replication COTS - replication - big blocks maximize throughput - Metadata in RAMSophiaConf2012 olivier.varene@orange.commardi 10 juillet 12
  22. 22. Map Reduce ?SophiaConf2012 olivier.varene@orange.commardi 10 juillet 12
  23. 23. MapReduce cat | «your map» | sort -u | «your reduce» Programming paradigmSophiaConf2012 olivier.varene@orange.commardi 10 juillet 12
  24. 24. MapReduce cat | «your map» | sort -u | «your reduce» FrameWorkSophiaConf2012 olivier.varene@orange.commardi 10 juillet 12
  25. 25. MapReduce cat | «your map» | sort -u | «your reduce» your JobSophiaConf2012 olivier.varene@orange.commardi 10 juillet 12
  26. 26. MapReduce InsidesSophiaConf2012 olivier.varene@orange.commardi 10 juillet 12
  27. 27. Interfaces • Java API • Pipes • Streaming (python, perl, C/C++, ...)SophiaConf2012 olivier.varene@orange.commardi 10 juillet 12
  28. 28. PIG • High level data analysis script language • extensible via UDF • Structure of a Pig script • load • filter • foreach | group by | join | your functions • order • storeSophiaConf2012 olivier.varene@orange.commardi 10 juillet 12
  29. 29. HIVE • High level SQL-like query and analysis language • extensible via UDF • Structure of a Hive script • create table • load data • select ... from ... • insert | group by | joinSophiaConf2012 olivier.varene@orange.commardi 10 juillet 12
  30. 30. Application domains ?SophiaConf2012 olivier.varene@orange.commardi 10 juillet 12
  31. 31. Projects • Scoring • User profiling • Log analysis and statistics • ... and many others to comeSophiaConf2012 olivier.varene@orange.commardi 10 juillet 12
  32. 32. ROI ?SophiaConf2012 olivier.varene@orange.commardi 10 juillet 12
  33. 33. ROI • Lines Of Code 10X gain • Development Time 2X gain • IT cost 4X gain less bug, automatic, scalable ...SophiaConf2012 olivier.varene@orange.commardi 10 juillet 12
  34. 34. Perfect World ? ★ YES • Run cost • Development cost • Scalable • Stable • Heterogenity ★ NO • SPOF (almost solved) • Fastidious debugging • Localy non optimum • mono-siteSophiaConf2012 olivier.varene@orange.commardi 10 juillet 12
  35. 35. Thank you Olivier Varene - Orange olivier.varene@orange.com @VareneOSophiaConf2012 olivier.varene@orange.commardi 10 juillet 12
  36. 36. thanks to • Hadoop - Apache (http://hadoop.apache.org/) • Khiops - Orange (http://www.khiops.com) • Shauwn Connoly - HortonWorks (http://youtu.be/yPfysFAGv8s) • Forbes - article (http://www.forbes.com/sites/siliconangle/2012/02/17/big-data-is-big-market-big-business/) • Living Social (sentence) • Terradata (Volumetry Graph) • http://www.wordle.net/ (Words Cloud)SophiaConf2012 olivier.varene@orange.commardi 10 juillet 12
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×