Successfully reported this slideshow.
Your SlideShare is downloading. ×

Apache spark

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 29 Ad
Advertisement

More Related Content

Similar to Apache spark (20)

Advertisement
Advertisement

Recently uploaded (20)

Apache spark

  1. 1. Apache Spark Arnon Rotem-Gal-Oz
  2. 2. Demo import spark.implicits._ import org.apache.spark.sql.functions._ import org.apache.spark.sql._ case class Data(InvoiceNo: String, StockCode: String, Description: String, Quantity: Long, InvoiceDate: String, UnitPrice: Double, CustomerID: String, Country: String) val schema = Encoders.product[Data].schema val df=spark.read.option("header",true).schema(schema).csv("./data.csv") val clean=df.na.drop(Seq("CustomerID")).dropDuplicates() val data = clean.withColumn("total",when($"StockCode"!=="D",$"UnitPrice"*$"Quantity").otherwise(0)) .withColumn("Discount",when($"StockCode"==="D",$"UnitPrice"*$"Quantity").otherwise(0)) .withColumn("Postage",when($"StockCode"==="P",1).otherwise(0)) .withColumn("Invoice",regexp_replace($"InvoiceNo","^C","")) .withColumn("Cancelled",when(substring($"InvoiceNo",0,1)==="C",1).otherwise(0)) val aggregated=data.groupBy($"Invoice",$"Country",$"CustomerID") .agg(sum($"Discount").as("Discount"),sum($"total").as("Total"),max($"Cancelled").as("Cancelled")) val customers =aggregated.groupBy($"CustomerID") .agg(sum($"Total").as("Total"),sum($"Discount").as("Discount"),sum($"Cancelled").as("Cancelled"),count($" Invoice").as("Invoices")) import org.apache.spark.ml.feature.VectorAssembler val assembler = new VectorAssembler().setInputCols(Array("Total","Discount","Cancelled","Invoices")).setOutputCol("features") val features=assembler.transform(customers) import org.apache.spark.ml.clustering.KMeans import org.apache.spark.ml.evaluation.ClusteringEvaluator val Array(test,train)= features.randomSplit(Array(0.3,0.7)) val kmeans=new KMeans().setK(12).setFeaturesCol("features").setPredictionCol("prediction") val model = kmeans.fit(train) model.clusterCenters.foreach(println) val predictions=model.transform(test) predictions.groupBy($"prediction").count().show()
  3. 3. (reduce + (map #(+ % 2) (range 0 10))) (0 to 10).map(_+2).reduce(_+_) Enumerable.Range(0, 10).Select(x => x + 2).Aggregate(0, (acc, x) => acc + x);
  4. 4. 2004 MapReduce: Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. https://research.google.com/archive/mapreduce-osdi04-slides/index.html
  5. 5. • Re-execute on fail • Skip bad-records • Redundent execution (copies of tasks) • Data locality optimization • Combiners (map-side reduce) • Compression of data
  6. 6. Sort is shuffle
  7. 7. Microsoft DryadLINQ / LINQ to HPC (2009-2011) • DAG • Compile not run directly https://www.microsoft.com/en-us/research/project/dryadlinq/
  8. 8. AMPLabs Spark • Born as a way to test Mesos • Open sourced 2010
  9. 9. Spark Component
  10. 10. Resilient Distributed Dataset
  11. 11. Dataframe and DataSet • Higher abstraction • More like a database table than an array • Adds Optimizers
  12. 12. parsed plan logical plan Optimized plan Physical plan
  13. 13. Spark UI
  14. 14. With Batch all data is there https://www2.slideshare.net/VadimSolovey/dataflow-a-unified-model-for-batch-and-streaming-data-processing
  15. 15. Streaming – event by event
  16. 16. Streaming challenges Watermarks – describe event time progress Events earlier than watermark are ignored (too slow – delay, too fast – more late events)
  17. 17. • Spark Streaming • Spark Structured Streaming (unified code for batch & Streaming) • Demo - Dimensio
  18. 18. Caveat emptor • Also bugs (spark 1.6)
  19. 19. Bugs.. • https://issues.apache.org/jira/browse/SPARK-8406
  20. 20. Debugging Out Of Memory problems
  21. 21. Long DAGs
  22. 22. Data Skew
  23. 23. Spark • Lots of things out of the box • Batch (RDD, DataFrames, DataSets) • Streaming • Structured Streaming (unify batch and streaming) • Graph • (“Classic”) ML • Runs on Hadoop, Mesos, Kubernetes
  24. 24. Lots of extensions • Spark NLP - John Snow Labs • Spark Deep Learning - Databricks, Intel (BidDL), DeepLearing4j, H2O • Connectors to any DB that respects itself • (Hades is WIP  )
  25. 25. Multiple languages• Scala, Java, R, Python, .NET (just released) • Currently Scala is favorite • Python taking center-stage

Editor's Notes

  • Higher abstraction
    More like a database table than an array
    Adds Optimizers

×