Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

of

Apache spark Slide 1 Apache spark Slide 2 Apache spark Slide 3 Apache spark Slide 4 Apache spark Slide 5 Apache spark Slide 6 Apache spark Slide 7 Apache spark Slide 8 Apache spark Slide 9 Apache spark Slide 10 Apache spark Slide 11 Apache spark Slide 12 Apache spark Slide 13 Apache spark Slide 14 Apache spark Slide 15 Apache spark Slide 16 Apache spark Slide 17 Apache spark Slide 18 Apache spark Slide 19 Apache spark Slide 20 Apache spark Slide 21 Apache spark Slide 22 Apache spark Slide 23 Apache spark Slide 24 Apache spark Slide 25 Apache spark Slide 26 Apache spark Slide 27 Apache spark Slide 28 Apache spark Slide 29
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

0 Likes

Share

Download to read offline

Apache spark

Download to read offline

Intro level slides on Apache Spark

Related Books

Free with a 30 day trial from Scribd

See all
  • Be the first to like this

Apache spark

  1. 1. Apache Spark Arnon Rotem-Gal-Oz
  2. 2. Demo import spark.implicits._ import org.apache.spark.sql.functions._ import org.apache.spark.sql._ case class Data(InvoiceNo: String, StockCode: String, Description: String, Quantity: Long, InvoiceDate: String, UnitPrice: Double, CustomerID: String, Country: String) val schema = Encoders.product[Data].schema val df=spark.read.option("header",true).schema(schema).csv("./data.csv") val clean=df.na.drop(Seq("CustomerID")).dropDuplicates() val data = clean.withColumn("total",when($"StockCode"!=="D",$"UnitPrice"*$"Quantity").otherwise(0)) .withColumn("Discount",when($"StockCode"==="D",$"UnitPrice"*$"Quantity").otherwise(0)) .withColumn("Postage",when($"StockCode"==="P",1).otherwise(0)) .withColumn("Invoice",regexp_replace($"InvoiceNo","^C","")) .withColumn("Cancelled",when(substring($"InvoiceNo",0,1)==="C",1).otherwise(0)) val aggregated=data.groupBy($"Invoice",$"Country",$"CustomerID") .agg(sum($"Discount").as("Discount"),sum($"total").as("Total"),max($"Cancelled").as("Cancelled")) val customers =aggregated.groupBy($"CustomerID") .agg(sum($"Total").as("Total"),sum($"Discount").as("Discount"),sum($"Cancelled").as("Cancelled"),count($" Invoice").as("Invoices")) import org.apache.spark.ml.feature.VectorAssembler val assembler = new VectorAssembler().setInputCols(Array("Total","Discount","Cancelled","Invoices")).setOutputCol("features") val features=assembler.transform(customers) import org.apache.spark.ml.clustering.KMeans import org.apache.spark.ml.evaluation.ClusteringEvaluator val Array(test,train)= features.randomSplit(Array(0.3,0.7)) val kmeans=new KMeans().setK(12).setFeaturesCol("features").setPredictionCol("prediction") val model = kmeans.fit(train) model.clusterCenters.foreach(println) val predictions=model.transform(test) predictions.groupBy($"prediction").count().show()
  3. 3. (reduce + (map #(+ % 2) (range 0 10))) (0 to 10).map(_+2).reduce(_+_) Enumerable.Range(0, 10).Select(x => x + 2).Aggregate(0, (acc, x) => acc + x);
  4. 4. 2004 MapReduce: Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. https://research.google.com/archive/mapreduce-osdi04-slides/index.html
  5. 5. • Re-execute on fail • Skip bad-records • Redundent execution (copies of tasks) • Data locality optimization • Combiners (map-side reduce) • Compression of data
  6. 6. Sort is shuffle
  7. 7. Microsoft DryadLINQ / LINQ to HPC (2009-2011) • DAG • Compile not run directly https://www.microsoft.com/en-us/research/project/dryadlinq/
  8. 8. AMPLabs Spark • Born as a way to test Mesos • Open sourced 2010
  9. 9. Spark Component
  10. 10. Resilient Distributed Dataset
  11. 11. Dataframe and DataSet • Higher abstraction • More like a database table than an array • Adds Optimizers
  12. 12. parsed plan logical plan Optimized plan Physical plan
  13. 13. Spark UI
  14. 14. With Batch all data is there https://www2.slideshare.net/VadimSolovey/dataflow-a-unified-model-for-batch-and-streaming-data-processing
  15. 15. Streaming – event by event
  16. 16. Streaming challenges Watermarks – describe event time progress Events earlier than watermark are ignored (too slow – delay, too fast – more late events)
  17. 17. • Spark Streaming • Spark Structured Streaming (unified code for batch & Streaming) • Demo - Dimensio
  18. 18. Caveat emptor • Also bugs (spark 1.6)
  19. 19. Bugs.. • https://issues.apache.org/jira/browse/SPARK-8406
  20. 20. Debugging Out Of Memory problems
  21. 21. Long DAGs
  22. 22. Data Skew
  23. 23. Spark • Lots of things out of the box • Batch (RDD, DataFrames, DataSets) • Streaming • Structured Streaming (unify batch and streaming) • Graph • (“Classic”) ML • Runs on Hadoop, Mesos, Kubernetes
  24. 24. Lots of extensions • Spark NLP - John Snow Labs • Spark Deep Learning - Databricks, Intel (BidDL), DeepLearing4j, H2O • Connectors to any DB that respects itself • (Hades is WIP  )
  25. 25. Multiple languages• Scala, Java, R, Python, .NET (just released) • Currently Scala is favorite • Python taking center-stage

Intro level slides on Apache Spark

Views

Total views

808

On Slideshare

0

From embeds

0

Number of embeds

562

Actions

Downloads

2

Shares

0

Comments

0

Likes

0

×