Apache spark

Arnon Rotem-Gal-Oz
Arnon Rotem-Gal-OzDistinguished Engineer at Monday.com
Apache Spark
Arnon Rotem-Gal-Oz
Demo
import spark.implicits._
import org.apache.spark.sql.functions._
import org.apache.spark.sql._
case class Data(InvoiceNo: String, StockCode: String, Description: String, Quantity: Long, InvoiceDate:
String, UnitPrice: Double, CustomerID: String, Country: String)
val schema = Encoders.product[Data].schema
val df=spark.read.option("header",true).schema(schema).csv("./data.csv")
val clean=df.na.drop(Seq("CustomerID")).dropDuplicates()
val data = clean.withColumn("total",when($"StockCode"!=="D",$"UnitPrice"*$"Quantity").otherwise(0))
.withColumn("Discount",when($"StockCode"==="D",$"UnitPrice"*$"Quantity").otherwise(0))
.withColumn("Postage",when($"StockCode"==="P",1).otherwise(0))
.withColumn("Invoice",regexp_replace($"InvoiceNo","^C",""))
.withColumn("Cancelled",when(substring($"InvoiceNo",0,1)==="C",1).otherwise(0))
val aggregated=data.groupBy($"Invoice",$"Country",$"CustomerID")
.agg(sum($"Discount").as("Discount"),sum($"total").as("Total"),max($"Cancelled").as("Cancelled"))
val customers =aggregated.groupBy($"CustomerID")
.agg(sum($"Total").as("Total"),sum($"Discount").as("Discount"),sum($"Cancelled").as("Cancelled"),count($"
Invoice").as("Invoices"))
import org.apache.spark.ml.feature.VectorAssembler
val assembler = new
VectorAssembler().setInputCols(Array("Total","Discount","Cancelled","Invoices")).setOutputCol("features")
val features=assembler.transform(customers)
import org.apache.spark.ml.clustering.KMeans
import org.apache.spark.ml.evaluation.ClusteringEvaluator
val Array(test,train)= features.randomSplit(Array(0.3,0.7))
val kmeans=new KMeans().setK(12).setFeaturesCol("features").setPredictionCol("prediction")
val model = kmeans.fit(train)
model.clusterCenters.foreach(println)
val predictions=model.transform(test)
predictions.groupBy($"prediction").count().show()
(reduce + (map #(+ % 2) (range 0 10)))
(0 to 10).map(_+2).reduce(_+_)
Enumerable.Range(0, 10).Select(x => x + 2).Aggregate(0, (acc, x) => acc
+ x);
2004
MapReduce:
Simplified Data Processing on Large Clusters
Jeff Dean, Sanjay Ghemawat
Google, Inc.
https://research.google.com/archive/mapreduce-osdi04-slides/index.html
Apache spark
• Re-execute on fail
• Skip bad-records
• Redundent execution (copies of tasks)
• Data locality optimization
• Combiners (map-side reduce)
• Compression of data
Apache spark
Sort is shuffle
Microsoft
DryadLINQ /
LINQ to HPC
(2009-2011)
• DAG
• Compile not run
directly
https://www.microsoft.com/en-us/research/project/dryadlinq/
AMPLabs Spark
• Born as a way to test Mesos
• Open sourced 2010
Spark Component
Resilient Distributed Dataset
Dataframe and DataSet
• Higher abstraction
• More like a database table than an array
• Adds Optimizers
parsed plan
logical plan
Optimized
plan
Physical
plan
Spark UI
With Batch all data is there
https://www2.slideshare.net/VadimSolovey/dataflow-a-unified-model-for-batch-and-streaming-data-processing
Streaming – event by event
Apache spark
Streaming challenges
Watermarks – describe event time
progress
Events earlier than watermark are
ignored
(too slow – delay, too fast – more
late events)
• Spark Streaming
• Spark Structured Streaming (unified code for batch & Streaming)
• Demo - Dimensio
Caveat emptor
• Also bugs (spark 1.6)
Bugs..
• https://issues.apache.org/jira/browse/SPARK-8406
Debugging
Out Of Memory
problems
Long DAGs
Data Skew
Apache spark
Spark
• Lots of things out of the box
• Batch (RDD, DataFrames, DataSets)
• Streaming
• Structured Streaming (unify batch and streaming)
• Graph
• (“Classic”) ML
• Runs on Hadoop, Mesos, Kubernetes
Lots of extensions
• Spark NLP - John Snow Labs
• Spark Deep Learning - Databricks, Intel (BidDL), DeepLearing4j, H2O
• Connectors to any DB that respects itself
• (Hades is WIP  )
Multiple languages• Scala, Java, R, Python, .NET (just
released)
• Currently Scala is favorite
• Python taking center-stage
1 of 29

Recommended

Comparing 30 Elastic Search operations with Oracle SQL statements by
Comparing 30 Elastic Search operations with Oracle SQL statementsComparing 30 Elastic Search operations with Oracle SQL statements
Comparing 30 Elastic Search operations with Oracle SQL statementsLucas Jellema
5.6K views34 slides
Query for json databases by
Query for json databasesQuery for json databases
Query for json databasesBinh Le
149 views21 slides
MongoDB Aggregation by
MongoDB Aggregation MongoDB Aggregation
MongoDB Aggregation Amit Ghosh
234 views18 slides
Cubes - Lightweight Python OLAP (EuroPython 2012 talk) by
Cubes - Lightweight Python OLAP (EuroPython 2012 talk)Cubes - Lightweight Python OLAP (EuroPython 2012 talk)
Cubes - Lightweight Python OLAP (EuroPython 2012 talk)Stefan Urbanek
12.9K views75 slides
Love Your Database Railsconf 2017 by
Love Your Database Railsconf 2017Love Your Database Railsconf 2017
Love Your Database Railsconf 2017gisborne
733 views88 slides
FleetDB: A Schema-Free Database in Clojure by
FleetDB: A Schema-Free Database in ClojureFleetDB: A Schema-Free Database in Clojure
FleetDB: A Schema-Free Database in ClojureMark McGranaghan
715 views68 slides

More Related Content

What's hot

Aggregation Framework in MongoDB Overview Part-1 by
Aggregation Framework in MongoDB Overview Part-1Aggregation Framework in MongoDB Overview Part-1
Aggregation Framework in MongoDB Overview Part-1Anuj Jain
1.4K views23 slides
FleetDB A Schema-Free Database in Clojure by
FleetDB A Schema-Free Database in ClojureFleetDB A Schema-Free Database in Clojure
FleetDB A Schema-Free Database in Clojureelliando dias
1.2K views69 slides
Dbabstraction by
DbabstractionDbabstraction
DbabstractionBruce McPherson
1.9K views14 slides
Google apps script database abstraction exposed version by
Google apps script database abstraction   exposed versionGoogle apps script database abstraction   exposed version
Google apps script database abstraction exposed versionBruce McPherson
8.3K views23 slides
MongoDB Aggregation Framework by
MongoDB Aggregation FrameworkMongoDB Aggregation Framework
MongoDB Aggregation FrameworkCaserta
7.8K views33 slides
array of object pointer in c++ by
array of object pointer in c++array of object pointer in c++
array of object pointer in c++Arpita Patel
809 views14 slides

What's hot(20)

Aggregation Framework in MongoDB Overview Part-1 by Anuj Jain
Aggregation Framework in MongoDB Overview Part-1Aggregation Framework in MongoDB Overview Part-1
Aggregation Framework in MongoDB Overview Part-1
Anuj Jain1.4K views
FleetDB A Schema-Free Database in Clojure by elliando dias
FleetDB A Schema-Free Database in ClojureFleetDB A Schema-Free Database in Clojure
FleetDB A Schema-Free Database in Clojure
elliando dias1.2K views
Google apps script database abstraction exposed version by Bruce McPherson
Google apps script database abstraction   exposed versionGoogle apps script database abstraction   exposed version
Google apps script database abstraction exposed version
Bruce McPherson8.3K views
MongoDB Aggregation Framework by Caserta
MongoDB Aggregation FrameworkMongoDB Aggregation Framework
MongoDB Aggregation Framework
Caserta 7.8K views
array of object pointer in c++ by Arpita Patel
array of object pointer in c++array of object pointer in c++
array of object pointer in c++
Arpita Patel809 views
Stefan Hochdörfer - The NoSQL Store everyone ignores: PostgreSQL - NoSQL matt... by NoSQLmatters
Stefan Hochdörfer - The NoSQL Store everyone ignores: PostgreSQL - NoSQL matt...Stefan Hochdörfer - The NoSQL Store everyone ignores: PostgreSQL - NoSQL matt...
Stefan Hochdörfer - The NoSQL Store everyone ignores: PostgreSQL - NoSQL matt...
NoSQLmatters1.2K views
DBIx-DataModel v2.0 in detail by Laurent Dami
DBIx-DataModel v2.0 in detail DBIx-DataModel v2.0 in detail
DBIx-DataModel v2.0 in detail
Laurent Dami5K views
Mongodb Aggregation Pipeline by zahid-mian
Mongodb Aggregation PipelineMongodb Aggregation Pipeline
Mongodb Aggregation Pipeline
zahid-mian14.5K views
Aggregation Framework MongoDB Days Munich by Norberto Leite
Aggregation Framework MongoDB Days MunichAggregation Framework MongoDB Days Munich
Aggregation Framework MongoDB Days Munich
Norberto Leite2K views
Scalding: Reaching Efficient MapReduce by LivePerson
Scalding: Reaching Efficient MapReduceScalding: Reaching Efficient MapReduce
Scalding: Reaching Efficient MapReduce
LivePerson4.4K views
Developing A Real World Logistic Application With Oracle Application - UKOUG ... by Roel Hartman
Developing A Real World Logistic Application With Oracle Application - UKOUG ...Developing A Real World Logistic Application With Oracle Application - UKOUG ...
Developing A Real World Logistic Application With Oracle Application - UKOUG ...
Roel Hartman2K views
Avro, la puissance du binaire, la souplesse du JSON by Alexandre Victoor
Avro, la puissance du binaire, la souplesse du JSONAvro, la puissance du binaire, la souplesse du JSON
Avro, la puissance du binaire, la souplesse du JSON
Alexandre Victoor3.3K views
Node js mongodriver by christkv
Node js mongodriverNode js mongodriver
Node js mongodriver
christkv9K views
Drill / SQL / Optiq by Julian Hyde
Drill / SQL / OptiqDrill / SQL / Optiq
Drill / SQL / Optiq
Julian Hyde3.5K views
Grails: a quick tutorial (1) by Davide Rossi
Grails: a quick tutorial (1)Grails: a quick tutorial (1)
Grails: a quick tutorial (1)
Davide Rossi5.3K views
Enterprise workflow with Apps Script by ccherubino
Enterprise workflow with Apps ScriptEnterprise workflow with Apps Script
Enterprise workflow with Apps Script
ccherubino999 views

Similar to Apache spark

R (Shiny Package) - Server Side Code for Decision Support System by
R (Shiny Package) - Server Side Code for Decision Support SystemR (Shiny Package) - Server Side Code for Decision Support System
R (Shiny Package) - Server Side Code for Decision Support SystemMaithreya Chakravarthula
173 views9 slides
Spark Summit EU talk by Ted Malaska by
Spark Summit EU talk by Ted MalaskaSpark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted MalaskaSpark Summit
8.8K views34 slides
A Shiny Example-- R by
A Shiny Example-- RA Shiny Example-- R
A Shiny Example-- RDr. Volkan OBAN
469 views5 slides
MapReduce with Scalding @ 24th Hadoop London Meetup by
MapReduce with Scalding @ 24th Hadoop London MeetupMapReduce with Scalding @ 24th Hadoop London Meetup
MapReduce with Scalding @ 24th Hadoop London MeetupLandoop Ltd
2.5K views67 slides
How to ship customer value faster with step functions by
How to ship customer value faster with step functionsHow to ship customer value faster with step functions
How to ship customer value faster with step functionsYan Cui
652 views157 slides
Open Source Search: An Analysis by
Open Source Search: An AnalysisOpen Source Search: An Analysis
Open Source Search: An AnalysisJustin Finkelstein
1.4K views39 slides

Similar to Apache spark(20)

Spark Summit EU talk by Ted Malaska by Spark Summit
Spark Summit EU talk by Ted MalaskaSpark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted Malaska
Spark Summit8.8K views
MapReduce with Scalding @ 24th Hadoop London Meetup by Landoop Ltd
MapReduce with Scalding @ 24th Hadoop London MeetupMapReduce with Scalding @ 24th Hadoop London Meetup
MapReduce with Scalding @ 24th Hadoop London Meetup
Landoop Ltd2.5K views
How to ship customer value faster with step functions by Yan Cui
How to ship customer value faster with step functionsHow to ship customer value faster with step functions
How to ship customer value faster with step functions
Yan Cui652 views
Lightning fast analytics with Spark and Cassandra by Rustam Aliyev
Lightning fast analytics with Spark and CassandraLightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and Cassandra
Rustam Aliyev3.3K views
MongoDB World 2018: Keynote by MongoDB
MongoDB World 2018: KeynoteMongoDB World 2018: Keynote
MongoDB World 2018: Keynote
MongoDB1.5K views
Performante Java Enterprise Applikationen trotz O/R-Mapping by Simon Martinelli
Performante Java Enterprise Applikationen trotz O/R-MappingPerformante Java Enterprise Applikationen trotz O/R-Mapping
Performante Java Enterprise Applikationen trotz O/R-Mapping
Simon Martinelli880 views
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs by Matt Stubbs
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIsBig Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs
Matt Stubbs242 views
Agile Testing Days 2018 - API Fundamentals - postman collection by JoEllen Carter
Agile Testing Days 2018 - API Fundamentals - postman collectionAgile Testing Days 2018 - API Fundamentals - postman collection
Agile Testing Days 2018 - API Fundamentals - postman collection
JoEllen Carter457 views
Cutting Edge Data Processing with PHP & XQuery by William Candillon
Cutting Edge Data Processing with PHP & XQueryCutting Edge Data Processing with PHP & XQuery
Cutting Edge Data Processing with PHP & XQuery
William Candillon6.4K views
Introduction to Scalding and Monoids by Hugo Gävert
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and Monoids
Hugo Gävert7.6K views
DataMapper by Yehuda Katz
DataMapperDataMapper
DataMapper
Yehuda Katz12.1K views
Streaming Solr - Activate 2018 talk by Amrit Sarkar
Streaming Solr - Activate 2018 talkStreaming Solr - Activate 2018 talk
Streaming Solr - Activate 2018 talk
Amrit Sarkar155 views
Building Analytics Applications with Streaming Expressions in Apache Solr - A... by Lucidworks
Building Analytics Applications with Streaming Expressions in Apache Solr - A...Building Analytics Applications with Streaming Expressions in Apache Solr - A...
Building Analytics Applications with Streaming Expressions in Apache Solr - A...
Lucidworks715 views
Do more with less code in serverless by jeromevdl
Do more with less code in serverlessDo more with less code in serverless
Do more with less code in serverless
jeromevdl44 views
MongoDB World 2019: Just-in-time Validation with JSON Schema by MongoDB
MongoDB World 2019: Just-in-time Validation with JSON SchemaMongoDB World 2019: Just-in-time Validation with JSON Schema
MongoDB World 2019: Just-in-time Validation with JSON Schema
MongoDB294 views
A Tale of Two APIs: Using Spark Streaming In Production by Lightbend
A Tale of Two APIs: Using Spark Streaming In ProductionA Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In Production
Lightbend2.7K views

More from Arnon Rotem-Gal-Oz

Taking ML to production - a journey by
Taking ML to production - a journeyTaking ML to production - a journey
Taking ML to production - a journeyArnon Rotem-Gal-Oz
192 views20 slides
Fallacies of Distributed Computing by
Fallacies of Distributed Computing Fallacies of Distributed Computing
Fallacies of Distributed Computing Arnon Rotem-Gal-Oz
772 views33 slides
Docker & Kubernetes intro by
Docker & Kubernetes introDocker & Kubernetes intro
Docker & Kubernetes introArnon Rotem-Gal-Oz
7.2K views28 slides
Docker Intro by
Docker IntroDocker Intro
Docker IntroArnon Rotem-Gal-Oz
664 views15 slides
Data security @ the personal level by
Data security @ the personal levelData security @ the personal level
Data security @ the personal levelArnon Rotem-Gal-Oz
493 views24 slides
Microservices - it's déjà vu all over again by
Microservices  - it's déjà vu all over againMicroservices  - it's déjà vu all over again
Microservices - it's déjà vu all over againArnon Rotem-Gal-Oz
3.7K views40 slides

More from Arnon Rotem-Gal-Oz(20)

Microservices - it's déjà vu all over again by Arnon Rotem-Gal-Oz
Microservices  - it's déjà vu all over againMicroservices  - it's déjà vu all over again
Microservices - it's déjà vu all over again
Arnon Rotem-Gal-Oz3.7K views
Big data in the cloud - welcome to cost oriented design by Arnon Rotem-Gal-Oz
Big data in the cloud - welcome to cost oriented designBig data in the cloud - welcome to cost oriented design
Big data in the cloud - welcome to cost oriented design
Arnon Rotem-Gal-Oz3.3K views
Distilling Insights @ Appsflyer (Data Architecture) by Arnon Rotem-Gal-Oz
Distilling Insights @ Appsflyer (Data Architecture)Distilling Insights @ Appsflyer (Data Architecture)
Distilling Insights @ Appsflyer (Data Architecture)
Arnon Rotem-Gal-Oz1.9K views
Building reliable systems from unreliable components by Arnon Rotem-Gal-Oz
Building reliable systems from unreliable componentsBuilding reliable systems from unreliable components
Building reliable systems from unreliable components
Arnon Rotem-Gal-Oz6.7K views
Things to think about while architecting azure solutions by Arnon Rotem-Gal-Oz
Things to think about while architecting azure solutionsThings to think about while architecting azure solutions
Things to think about while architecting azure solutions
Arnon Rotem-Gal-Oz1.3K views

Recently uploaded

Kyo - Functional Scala 2023.pdf by
Kyo - Functional Scala 2023.pdfKyo - Functional Scala 2023.pdf
Kyo - Functional Scala 2023.pdfFlavio W. Brasil
457 views92 slides
How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ... by
How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ...How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ...
How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ...ShapeBlue
166 views28 slides
Generative AI: Shifting the AI Landscape by
Generative AI: Shifting the AI LandscapeGenerative AI: Shifting the AI Landscape
Generative AI: Shifting the AI LandscapeDeakin University
53 views55 slides
Future of AR - Facebook Presentation by
Future of AR - Facebook PresentationFuture of AR - Facebook Presentation
Future of AR - Facebook PresentationRob McCarty
64 views27 slides
DRBD Deep Dive - Philipp Reisner - LINBIT by
DRBD Deep Dive - Philipp Reisner - LINBITDRBD Deep Dive - Philipp Reisner - LINBIT
DRBD Deep Dive - Philipp Reisner - LINBITShapeBlue
180 views21 slides
The Role of Patterns in the Era of Large Language Models by
The Role of Patterns in the Era of Large Language ModelsThe Role of Patterns in the Era of Large Language Models
The Role of Patterns in the Era of Large Language ModelsYunyao Li
85 views65 slides

Recently uploaded(20)

How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ... by ShapeBlue
How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ...How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ...
How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ...
ShapeBlue166 views
Future of AR - Facebook Presentation by Rob McCarty
Future of AR - Facebook PresentationFuture of AR - Facebook Presentation
Future of AR - Facebook Presentation
Rob McCarty64 views
DRBD Deep Dive - Philipp Reisner - LINBIT by ShapeBlue
DRBD Deep Dive - Philipp Reisner - LINBITDRBD Deep Dive - Philipp Reisner - LINBIT
DRBD Deep Dive - Philipp Reisner - LINBIT
ShapeBlue180 views
The Role of Patterns in the Era of Large Language Models by Yunyao Li
The Role of Patterns in the Era of Large Language ModelsThe Role of Patterns in the Era of Large Language Models
The Role of Patterns in the Era of Large Language Models
Yunyao Li85 views
Enabling DPU Hardware Accelerators in XCP-ng Cloud Platform Environment - And... by ShapeBlue
Enabling DPU Hardware Accelerators in XCP-ng Cloud Platform Environment - And...Enabling DPU Hardware Accelerators in XCP-ng Cloud Platform Environment - And...
Enabling DPU Hardware Accelerators in XCP-ng Cloud Platform Environment - And...
ShapeBlue106 views
CloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlue by ShapeBlue
CloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlueCloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlue
CloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlue
ShapeBlue138 views
Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or... by ShapeBlue
Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or...Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or...
Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or...
ShapeBlue198 views
Why and How CloudStack at weSystems - Stephan Bienek - weSystems by ShapeBlue
Why and How CloudStack at weSystems - Stephan Bienek - weSystemsWhy and How CloudStack at weSystems - Stephan Bienek - weSystems
Why and How CloudStack at weSystems - Stephan Bienek - weSystems
ShapeBlue238 views
2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue by ShapeBlue
2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue
2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue
ShapeBlue147 views
"Surviving highload with Node.js", Andrii Shumada by Fwdays
"Surviving highload with Node.js", Andrii Shumada "Surviving highload with Node.js", Andrii Shumada
"Surviving highload with Node.js", Andrii Shumada
Fwdays56 views
Extending KVM Host HA for Non-NFS Storage - Alex Ivanov - StorPool by ShapeBlue
Extending KVM Host HA for Non-NFS Storage -  Alex Ivanov - StorPoolExtending KVM Host HA for Non-NFS Storage -  Alex Ivanov - StorPool
Extending KVM Host HA for Non-NFS Storage - Alex Ivanov - StorPool
ShapeBlue123 views
Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha... by ShapeBlue
Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...
Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...
ShapeBlue180 views
Declarative Kubernetes Cluster Deployment with Cloudstack and Cluster API - O... by ShapeBlue
Declarative Kubernetes Cluster Deployment with Cloudstack and Cluster API - O...Declarative Kubernetes Cluster Deployment with Cloudstack and Cluster API - O...
Declarative Kubernetes Cluster Deployment with Cloudstack and Cluster API - O...
ShapeBlue132 views
Business Analyst Series 2023 - Week 4 Session 7 by DianaGray10
Business Analyst Series 2023 -  Week 4 Session 7Business Analyst Series 2023 -  Week 4 Session 7
Business Analyst Series 2023 - Week 4 Session 7
DianaGray10139 views
Live Demo Showcase: Unveiling Dell PowerFlex’s IaaS Capabilities with Apache ... by ShapeBlue
Live Demo Showcase: Unveiling Dell PowerFlex’s IaaS Capabilities with Apache ...Live Demo Showcase: Unveiling Dell PowerFlex’s IaaS Capabilities with Apache ...
Live Demo Showcase: Unveiling Dell PowerFlex’s IaaS Capabilities with Apache ...
ShapeBlue126 views
Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ... by ShapeBlue
Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ...Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ...
Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ...
ShapeBlue119 views

Apache spark

  • 2. Demo import spark.implicits._ import org.apache.spark.sql.functions._ import org.apache.spark.sql._ case class Data(InvoiceNo: String, StockCode: String, Description: String, Quantity: Long, InvoiceDate: String, UnitPrice: Double, CustomerID: String, Country: String) val schema = Encoders.product[Data].schema val df=spark.read.option("header",true).schema(schema).csv("./data.csv") val clean=df.na.drop(Seq("CustomerID")).dropDuplicates() val data = clean.withColumn("total",when($"StockCode"!=="D",$"UnitPrice"*$"Quantity").otherwise(0)) .withColumn("Discount",when($"StockCode"==="D",$"UnitPrice"*$"Quantity").otherwise(0)) .withColumn("Postage",when($"StockCode"==="P",1).otherwise(0)) .withColumn("Invoice",regexp_replace($"InvoiceNo","^C","")) .withColumn("Cancelled",when(substring($"InvoiceNo",0,1)==="C",1).otherwise(0)) val aggregated=data.groupBy($"Invoice",$"Country",$"CustomerID") .agg(sum($"Discount").as("Discount"),sum($"total").as("Total"),max($"Cancelled").as("Cancelled")) val customers =aggregated.groupBy($"CustomerID") .agg(sum($"Total").as("Total"),sum($"Discount").as("Discount"),sum($"Cancelled").as("Cancelled"),count($" Invoice").as("Invoices")) import org.apache.spark.ml.feature.VectorAssembler val assembler = new VectorAssembler().setInputCols(Array("Total","Discount","Cancelled","Invoices")).setOutputCol("features") val features=assembler.transform(customers) import org.apache.spark.ml.clustering.KMeans import org.apache.spark.ml.evaluation.ClusteringEvaluator val Array(test,train)= features.randomSplit(Array(0.3,0.7)) val kmeans=new KMeans().setK(12).setFeaturesCol("features").setPredictionCol("prediction") val model = kmeans.fit(train) model.clusterCenters.foreach(println) val predictions=model.transform(test) predictions.groupBy($"prediction").count().show()
  • 3. (reduce + (map #(+ % 2) (range 0 10))) (0 to 10).map(_+2).reduce(_+_) Enumerable.Range(0, 10).Select(x => x + 2).Aggregate(0, (acc, x) => acc + x);
  • 4. 2004 MapReduce: Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. https://research.google.com/archive/mapreduce-osdi04-slides/index.html
  • 6. • Re-execute on fail • Skip bad-records • Redundent execution (copies of tasks) • Data locality optimization • Combiners (map-side reduce) • Compression of data
  • 9. Microsoft DryadLINQ / LINQ to HPC (2009-2011) • DAG • Compile not run directly https://www.microsoft.com/en-us/research/project/dryadlinq/
  • 10. AMPLabs Spark • Born as a way to test Mesos • Open sourced 2010
  • 13. Dataframe and DataSet • Higher abstraction • More like a database table than an array • Adds Optimizers
  • 16. With Batch all data is there https://www2.slideshare.net/VadimSolovey/dataflow-a-unified-model-for-batch-and-streaming-data-processing
  • 19. Streaming challenges Watermarks – describe event time progress Events earlier than watermark are ignored (too slow – delay, too fast – more late events)
  • 20. • Spark Streaming • Spark Structured Streaming (unified code for batch & Streaming) • Demo - Dimensio
  • 21. Caveat emptor • Also bugs (spark 1.6)
  • 27. Spark • Lots of things out of the box • Batch (RDD, DataFrames, DataSets) • Streaming • Structured Streaming (unify batch and streaming) • Graph • (“Classic”) ML • Runs on Hadoop, Mesos, Kubernetes
  • 28. Lots of extensions • Spark NLP - John Snow Labs • Spark Deep Learning - Databricks, Intel (BidDL), DeepLearing4j, H2O • Connectors to any DB that respects itself • (Hades is WIP  )
  • 29. Multiple languages• Scala, Java, R, Python, .NET (just released) • Currently Scala is favorite • Python taking center-stage

Editor's Notes

  1. Higher abstraction More like a database table than an array Adds Optimizers