Successfully reported this slideshow.

Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / NextProductToBuy

7

Share

1 of 34
1 of 34

Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / NextProductToBuy

7

Share

Download to read offline

Présentation de la technologie Spark et exemple de nouveaux cas métiers pouvant être traités par du BigData temps réel par Cédric Carbone
-Spark vs Hadoop MapReduce (& Hadoop v2 vs Hadoop v1)
-Spark Streaming vs Storm
-Le Machine Learning avec Spark
-Use case métier : NextProductToBuy

Présentation de la technologie Spark et exemple de nouveaux cas métiers pouvant être traités par du BigData temps réel par Cédric Carbone
-Spark vs Hadoop MapReduce (& Hadoop v2 vs Hadoop v1)
-Spark Streaming vs Storm
-Le Machine Learning avec Spark
-Use case métier : NextProductToBuy

More Related Content

Related Books

Free with a 14 day trial from Scribd

See all

Related Audiobooks

Free with a 14 day trial from Scribd

See all

Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / NextProductToBuy

  1. 1. Spark Meetup chez Viadeo Mercredi 4 février 2015 • 19h-19h45 : Présentation de la technologie Spark et exemple de nouveaux cas métiers pouvant être traités par du BigData temps réel. Cédric Carbone - Cofondateur d'Influans – cedric@influans.com -Spark vs Hadoop MapReduce -Spark Streaming vs Storm -Le Machine Learning avec Spark -Use case métier : NextProductToBuy • 19h45-20h : Extension de Spark (Tachyon / Spark JobServer). Jonathan Lamiel - Talend Labs – jlamiel@talend.com -La mémoire partagée de Spark avec Tachyon -Rendre Spark Interactif avec Spark JobServer • 20h-21h : Big Data analytics avec Spark & Cassandra. DuyHai DOAN - Technical Advocate at DataStax – duy_hai.doan@datastax.com Apache Spark is a general data processing framework which allows you perform data processing tasks in memory. Apache Cassandra is a highly available and massively scalable NoSQL data-store. By combining Spark flexible API and Cassandra performance, we get an interesting combo for both real-time and batch processing.
  2. 2. Map Reduce ➜ Map() : parse inputs and generate 0 to n <key, value> ➜ Reduce() : sums all values of the same key and generate a <key, value> WordCount Example ➜ Each map take a line as an input and break into words • It emits a key/value pair of the word and 1 ➜ Each Reducer sums the counts for each word • It emits a key/value pair of the word and sum
  3. 3. Map Reduce
  4. 4. Hadoop MapReduce v1
  5. 5. Hadoop MapReduce v1
  6. 6. Hadoop MapReduce v1
  7. 7. MapReduce v1 Not good for low-latency jobs on smallest dataset
  8. 8. MapReduce v1 Good for off-line batch jobs on massive data
  9. 9. Hadoop 1 ➜ Batch ONLY • High latency jobs HDFS (Redundant, Reliable Storage) MapReduce1 Cluster Resource Management + Data Processing BATCH HIVE Query Pig Scripting Cascading Accelerate Dev.
  10. 10. Hadoop2 : Big Data Operating System ➜ Customers want to store ALL DATA in one place and interact with it in MULTIPLE WAYS • Simultaneously & with predictable levels of service • Data analysts and real-time applications HDFS (Redundant, Reliable Storage) MapReduce1 Data Processing BATCH YARN (Cluster Resource Management) Other Data Processing …
  11. 11. Hadoop2 : Big Data Operating System ➜ Customers want to store ALL DATA in one place and interact with it in MULTIPLE WAYS • Simultaneously & with predictable levels of service • Data analysts and real-time applications HDFS (Redundant, Reliable Storage) BATCH INTERACTIVE STREAMING GRAPH ML IN-MEMORYONLINE SEARCH YARN (Cluster Resource Management)
  12. 12. Hadoop2 : Big Data Operating System ➜ Customers want to store ALL DATA in one place and interact with it in MULTIPLE WAYS • Simultaneously & with predictable levels of service • Data analysts and real-time applications HDFS (Redundant, Reliable Storage) YARN (Cluster Resource Management) BATCH (MapReduce) INTERACTIVE (Tez) STREAMING (Storm, Samza SparkStreaming) GRAPH (Giraph, GraphX) Machine Learning (MLLIb) In-Memory (Spark) ONLINE (Hbase HOYA) OTHER (ElasticSearch)
  13. 13. https://spark.apache.org Apache Spark™ is a fast and general engine for large-scale data processing.
  14. 14. The most active project 0 50 100 150 200 250 Patches MapReduce Storm Yarn Spark 0 5000 10000 15000 20000 25000 30000 35000 40000 45000 Lines Added MapReduce Storm Yarn Spark
  15. 15. Spark won the Daytona GraySort contest! Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. Sort on disk 100TB of data 3x faster than Hadoop MapReduce using 10x fewer machines.
  16. 16. RDD & Operation Resilient Distributed Datasets (RDDs) Operations ➜ Transformations (e.g. map, filter, groupBy) ➜ Actions (e.g. count, collect, save)
  17. 17. Spark scala> val textFile = sc.textFile("README.md") ➜ textFile: spark.RDD[String] = spark.MappedRDD@2ee9b6e3 scala> textFile.count() ➜ res0: Long = 126 scala> textFile.first() ➜ res1: String = # Apache Spark scala> val linesWithSpark = textFile.filter(line => line.contains("Spark")) ➜ linesWithSpark: spark.RDD[String]=spark.FilteredRDD@7dd4 scala> textFile.filter(line=>line.contains("Spark")).count() ➜ res3: Long = 15
  18. 18. Streaming Streaming
  19. 19. Storm
  20. 20. Storm
  21. 21. Storm vs Spark Spark Streaming Storm Storm Trident Processing model Micro batches Record-at-a-time Micro batches Thoughput ++++ ++ ++++ Latency Second Sub-second Second Reliability Models Exactly once At least once Exactly once Embedded Hadoop Distro HDP, CDH, MapR HDP HDP Support Databricks N/A N/A Community ++++ ++ ++ Spark Storm Scope Batch, Streaming, Graph, ML, SQL Streaming only
  22. 22. Machine Learning Library (Mllib)
  23. 23. Collaborative Filtering
  24. 24. Collaborative Filtering (learning)
  25. 25. Collaborative Filtering (learning)
  26. 26. Collaborative Filtering (learning)
  27. 27. Collaborative Filtering : Let’s use the model
  28. 28. Collaborative Filtering : similar behaviors
  29. 29. Collaborative Filtering Prediction
  30. 30. Netflix Prize (2009) Netflix is a provider of on-demand Internet streaming media
  31. 31. Input Data UserID::MovieID::Rating::Timestamp 1::1193::5::978300760 1::661::3::978302109 1::914::3::978301968 Etc… 2::1357::5::978298709 2::3068::4::978299000 2::1537::4::978299620
  32. 32. The result 1 ; Lyndon Wilson ; 4.608531808535918 ; 858 ; Godfather, The (1972) 1 ; Lyndon Wilson ; 4.596556961095434 ; 318 ; Shawshank Redemption, The (1994) 1 ; Lyndon Wilson ; 4.575789377957803 ; 527 ; Schindler's List (1993) 1 ; Lyndon Wilson ; 4.549694932928024 ; 593 ; Silence of the Lambs, The (1991) 1 ; Lyndon Wilson ; 4.46311974037361 ; 919 ; Wizard of Oz, The (1939) 2 ; Benjamin Harrison ; 4.99545499047152 ; 318 ; Shawshank Redemption, The (1994) 2 ; Benjamin Harrison ; 4.94255532354725 ; 356 ; Forrest Gump (1994) 2 ; Benjamin Harrison ; 4.80168679606128 ; 527 ; Schindler's List (1993) 2 ; Benjamin Harrison ; 4.7874247577586795 ; 1097 ; E.T. the Extra-Terrestrial (1982) 2 ; Benjamin Harrison ; 4.7635998147872325 ; 110 ; Braveheart (1995) 3 ; Richard Hoover ; 4.962687467351026 ; 110 ; Braveheart (1995) 3 ; Richard Hoover ; 4.8316542374095315 ; 318 ; Shawshank Redemption, The (1994) 3 ; Richard Hoover ; 4.7307103243995385 ; 356 ; Forrest Gump (19
  33. 33. Real Time Big Data Use Case Next Product To Buy ➜ Right Person ➜ Right Product ➜ Right Price ➜ Right Time ➜ Right Channel
  34. 34. Questions? Cédric Carbone cedric@influans.com @carbone www.hugfrance.fr hug-france-orga@googlegroups.com @hugfrance

×