Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Spark - Alexis Seigneurin (English)

915 views

Published on

Spark presentation in English

Published in: Technology
  • Be the first to comment

Spark - Alexis Seigneurin (English)

  1. 1. Alexis Seigneurin @aseigneurin @ippontech
  2. 2. Spark ● Processing of large volumes of data ● Distributed processing on commodity hardware ● Written in Scala, Java and Python bindings
  3. 3. History ● 2009: AMPLab, Berkeley University ● June 2013 : "Top-level project" of the Apache foundation ● May 2014: version 1.0.0 ● Currently: version 1.2.0
  4. 4. Use cases ● Logs analysis ● Processing of text files ● Analytics ● Distributed search (Google, before) ● Fraud detection ● Product recommendation
  5. 5. ● Same use cases ● Same development model: MapReduce ● Integration with the ecosystem Proximity with Hadoop
  6. 6. Simpler than Hadoop ● API simpler to learn ● “Relaxed” MapReduce ● Spark Shell: interactive processing
  7. 7. Faster than Hadoop Spark officially sets a new record in large-scale sorting (5th November 2014) ● Sorting 100 To of data ● Hadoop MR: 72 minutes ○ With 2100 noeuds (50400 cores) ● Spark: 23 minutes ○ With 206 noeuds (6592 cores)
  8. 8. Spark ecosystem ● Spark ● Spark Shell ● Spark Streaming ● Spark SQL ● Spark ML ● GraphX
  9. 9. Integration ● Yarn, Zookeeper, Mesos ● HDFS ● Cassandra ● Elasticsearch ● MongoDB
  10. 10. Spark Operating principle
  11. 11. ● Resilient Distributed Dataset ● Abstraction of a collection processed in parallel ● Fault tolerant ● Can work with tuples: ○ Key - Value ○ Tuples must be independent from each other RDD
  12. 12. Sources ● Files on HDFS ● Local files ● Collection in memory ● Amazon S3 ● NoSQL database ● ... ● Or a custom implementation of InputFormat
  13. 13. Transformations ● Processes an RDD, returns another RDD ● Lazy! ● Examples : ○ map(): one value → another value ○ mapToPair(): one value → a tuple ○ filter(): filters values/tuples given a condition ○ groupByKey(): groups values by key ○ reduceByKey(): aggregates values by key ○ join(), cogroup()...: joins two RDDs
  14. 14. Actions ● Does not return an RDD ● Examples: ○ count(): counts values/tuples ○ saveAsHadoopFile(): saves results in Hadoop’s format ○ foreach(): applies a function on each item ○ collect(): retrieves values in a list (List<T>)
  15. 15. Example
  16. 16. ● Trees of Paris: CSV file, Open Data ● Count of trees by specie Spark - Example geom_x_y;circonfere;adresse;hauteurenm;espece;varieteouc;dateplanta 48.8648454814, 2.3094155344;140.0;COURS ALBERT 1ER;10.0;Aesculus hippocastanum;; 48.8782668139, 2.29806967519;100.0;PLACE DES TERNES;15.0;Tilia platyphyllos;; 48.889306184, 2.30400164126;38.0;BOULEVARD MALESHERBES;0.0;Platanus x hispanica;; 48.8599934405, 2.29504883623;65.0;QUAI BRANLY;10.0;Paulownia tomentosa;;1996-02-29 ...
  17. 17. Spark - Example JavaSparkContext sc = new JavaSparkContext("local", "arbres"); sc.textFile("data/arbresalignementparis2010.csv") .filter(line -> !line.startsWith("geom")) .map(line -> line.split(";")) .mapToPair(fields -> new Tuple2<String, Integer>(fields[4], 1)) .reduceByKey((x, y) -> x + y) .sortByKey() .foreach(t -> System.out.println(t._1 + " : " + t._2)); [... ; … ; …] [... ; … ; …] [... ; … ; …] [... ; … ; …] [... ; … ; …] [... ; … ; …] u m k m a a textFile mapToPairmap reduceByKey foreach 1 1 1 1 1 u m k 1 2 1 2a ... ... ... ... filter ... ... sortByKey a m 2 1 2 1u ... ... ... ... ... ... geom;... 1 k
  18. 18. Spark - Example Acacia dealbata : 2 Acer acerifolius : 39 Acer buergerianum : 14 Acer campestre : 452 ...
  19. 19. Spark clusters
  20. 20. Topology & Terminology ● One master / several workers ○ (+ one standby master) ● Submit an application to the cluster ● Execution managed by a driver
  21. 21. Spark in a cluster Several options ● YARN ● Mesos ● Standalone ○ Workers started manually ○ Workers started by the master
  22. 22. MapReduce ● Spark (API) ● Distributed processing ● Fault tolerant Storage ● HDFS, base NoSQL... ● Distributed storage ● Fault tolerant Storage & Processing
  23. 23. Data locality ● Process the data where it is stored ● Avoid network I/Os
  24. 24. Data locality Spark Worker HDFS Datanode Spark Worker HDFS Datanode Spark Worker HDFS Datanode Spark Master HDFS Namenode HDFS Namenode (Standby) Spark Master (Standby)
  25. 25. Demo Spark in a cluster
  26. 26. Demo $ $SPARK_HOME/sbin/start-master.sh $ $SPARK_HOME/bin/spark-class org.apache.spark.deploy.worker.Worker spark://MBP-de-Alexis:7077 --cores 2 --memory 2G $ mvn clean package $ $SPARK_HOME/bin/spark-submit --master spark://MBP-de-Alexis:7077 --class com.seigneurin.spark.WikipediaMapReduceByKey --deploy-mode cluster target/pres-spark-0.0.1-SNAPSHOT.jar
  27. 27. Spark SQL
  28. 28. ● Usage of an RDD in SQL ● SQL engine: converts SQL instructions to low-level instructions Spark SQL
  29. 29. Spark SQL Prerequisites: ● Use tabular data ● Describe the schema → SchemaRDD Describing the schema : ● Programmatic description of the data ● Schema inference through reflection (POJO)
  30. 30. JavaRDD<Row> rdd = trees.map(fields -> Row.create( Float.parseFloat(fields[3]), fields[4])); ● Creating tabular data (type Row) Spark SQL - Example --------------------------------------- | 10.0 | Aesculus hippocastanum | | 15.0 | Tilia platyphyllos | | 0.0 | Platanus x hispanica | | 10.0 | Paulownia tomentosa | | ... | ... |
  31. 31. Spark SQL - Example List<StructField> fields = new ArrayList<StructField>(); fields.add(DataType.createStructField("hauteurenm", DataType.FloatType, false)); fields.add(DataType.createStructField("espece", DataType.StringType, false)); StructType schema = DataType.createStructType(fields); JavaSchemaRDD schemaRDD = sqlContext.applySchema(rdd, schema); schemaRDD.registerTempTable("tree"); --------------------------------------- | hauteurenm | espece | --------------------------------------- | 10.0 | Aesculus hippocastanum | | 15.0 | Tilia platyphyllos | | 0.0 | Platanus x hispanica | | 10.0 | Paulownia tomentosa | | ... | ... | ● Describing the schema
  32. 32. ● Counting trees by specie Spark SQL - Example sqlContext.sql("SELECT espece, COUNT(*) FROM tree WHERE espece <> '' GROUP BY espece ORDER BY espece") .foreach(row -> System.out.println(row.getString(0)+" : "+row.getLong(1))); Acacia dealbata : 2 Acer acerifolius : 39 Acer buergerianum : 14 Acer campestre : 452 ...
  33. 33. Spark Streaming
  34. 34. Micro-batches ● Slices a continuous flow of data into batches ● Same API ● ≠ Apache Storm
  35. 35. DStream ● Discretized Streams ● Sequence of RDDs ● Initialized with a Duration
  36. 36. Window operations ● Sliding window ● Reuses data from other windows ● Initialized with a window length and a slide interval
  37. 37. Sources ● Socket ● Kafka ● Flume ● HDFS ● MQ (ZeroMQ...) ● Twitter ● ... ● Or a custom implementation of Receiver
  38. 38. Demo Spark Streaming
  39. 39. Spark Streaming Demo ● Receive Tweets with hashtag #Android ○ Twitter4J ● Detection of the language of the Tweet ○ Language Detection ● Indexing with Elasticsearch ● Reporting with Kibana 4
  40. 40. $ curl -X DELETE localhost:9200 $ curl -X PUT localhost:9200/spark/_mapping/tweets '{ "tweets": { "properties": { "user": {"type": "string","index": "not_analyzed"}, "text": {"type": "string"}, "createdAt": {"type": "date","format": "date_time"}, "language": {"type": "string","index": "not_analyzed"} } } }' ● Launch ElasticSearch Demo ● Launch Kibana -> http://localhost:5601 ● Launch the Spark Streaming process
  41. 41. @aseigneurin aseigneurin.github.io @ippontech blog.ippon.fr

×