SlideShare a Scribd company logo
1 of 17
Hadoop and Spark for the SAS Developer
Richard Williamson | @superhadooper
10 June 2015
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
3@SVDataScience © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
 My Background
 Overview: SAS vs. Spark
 Spark DataFrame vs. SAS Dataset
 Spark SQL vs. SAS Proc SQL
 Spark MLlib vs. SAS Stats
 Spark Streaming
 Questions?
AGENDA
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
* http://en.wikipedia.org/wiki/SAS_%28software%29
** http://techcrunch.com/2015/03/19/on-the-growth-of-apache-spark
OVERVIEW: SAS vs. Spark
SAS
• SAS is the largest market-share holder in "advanced
analytics" with 36.2% of the market as of 2012.*
Spark
• Launched in U.C. Berkeley’s AMPLab in 2009, Apache Spark
has begun to catch on like wildfire during the last year and a
half. Spark had more than 465 contributors in 2014, making it
the most active project in the Apache Software Foundation
and among big data open source projects globally.**
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
OVERVIEW: SAS vs. Spark
SAS
• Basic Programming model consists of SAS Data Step
and SAS Procedures
• SAS Datasets move data between processing steps
Spark
• Native language is Scala—allows generic data types
and flexible programming model (Java and Python
also supported)
• RDDs (and now DataFrames) are used to move
distributed datasets between processing steps
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
OVERVIEW: SAS vs. Spark
SAS Code Snippet
http://support.sas.com/kb/24/595.html
data old;
input state $ accttot;
datalines;
ca 7000
ca 6500
ca 5800
nc 4800
nc 3640
sc 3520
va 4490
va 8700
va 2850
va 1111;
Spark Code Snippet
import sqlContext.implicits._
case class OLD(state: String, accttot: Int)
val oldList = List(
OLD("va",1111),
OLD("ca",7000),
OLD("ca",6500),
OLD("ca",5800),
OLD("nc",4800),
OLD("nc",3640),
OLD("sc",3520),
OLD("va",4490),
OLD("va",8700),
OLD("va",2850)
)
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
OVERVIEW: SAS vs. Spark
SAS Code Snippet
http://support.sas.com/kb/24/595.html
proc sort data=old;
by state;
data new;
set old (drop= accttot);
by state;
if first.state then count=0;
count+1;
if last.state then output;
proc freq;
tables state / out=new(drop=percent)
Spark Code Snippet
val oldRDD = sc.parallelize(oldList)
var oldDataFrame = oldRDD.toDF()
oldDataFrame = oldDataFrame.orderBy("state”)
oldRDD.aggregateByKey(0)((buffer, value) => buffer +
value, (b1,b2) => b1 + b2).foreach(println)
val newDataFrame =
oldDataFrame.groupBy("state").count()
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
Spark DataFrame vs. SAS Dataset
Spark DataFrame — a distributed collection of data organized
into named columns.
https://spark.apache.org/docs/1.3.0/sql-programming-guide.html#rename-of-schemardd-to-dataframe
SAS Dataset — a SAS file stored in a SAS library organized as a
table of observations (rows) and variables (columns) that can
be processed by SAS software.
http://support.sas.com/documentation/cdl/en/lrcon/62955/HTML/default/viewer.htm#a001005709.htm
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
How does Spark DataFrame differ from SAS Dataset?
• built from ground up to be distributed and can be processed
in parallel by multiple machines, whereas SAS Dataset has
non-distributed roots
• logical entity that is not necessarily paired with a serialized
on-disk version, whereas a SAS Dataset has an on-disk
manifestation
Spark DataFrame vs. SAS Dataset
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
Spark SQL vs. SAS Proc SQL
• My first reaction to Spark SQL was, “This looks like Proc SQL”
• SAS Proc SQL Simple Example:
libname Example 'c:SASPROJECTS';
proc sql;
create table newtable
as select a.*, b.unique_consumer_id
from Example.transactions as a, Example.consumer as b
where a.ref_id=b.ref_id;
quit;
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
Spark SQL vs. SAS Proc SQL
• Spark SQL Simple Example:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val newtable = sqlContext.sql(“
select a.*, b.unique_consumer_id
from transactions as a, consumer as b
where a.ref_id=b.ref_id”)
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
Spark MLlib vs. SAS Stats
• Spark Mllib
• https://spark.apache.org/docs/latest/mllib-guide.html
• Spark’s scalable machine learning library consisting of
common learning algorithms and utilities, including
classification, regression, clustering, collaborative filtering,
dimensionality reduction
• SAS Stats
• Traditional Add-on package to SAS for Statistics
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
Spark MLlib Example Data Prep
case class Meetup(mdatehr: String, mdate: String, mhour: String)
val meetup5 = meetup4.map(p => Meetup(p._1, p._2, p._3))
meetup5.registerTempTable("meetup5")
val meetup6 = sqlContext.sql("select mdate,mhour,count(*) as rsvp_cnt from meetup5 where mdatehr >= '2015-02-
15 02' group by mdatehr,mdate,mhour")
meetup6.registerTempTable("meetup6")
sqlContext.sql("cache table meetup6”)
val trainingData = meetup7.map { row =>
val features = Array[Double](row(24).toString().toDouble,row(0).toString().toDouble, …
LabeledPoint(row(27).toString().toDouble, Vectors.dense(features))}
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
Spark MLlib Example Regression Model
val trainingData = meetup7.map { row =>
val features = Array[Double](1.0,row(0).toString().toDouble,row(1).toString().toDouble, …
LabeledPoint(row(27).toString().toDouble, Vectors.dense(features))}
val model = new RidgeRegressionWithSGD().run(trainingData)
val scores = meetup7.map { row =>
val features = Vectors.dense(Array[Double](1.0,row(0).toString().toDouble, …
row(23).toString().toDouble))
(row(25),row(26),row(27), model.predict(features))}
scores.foreach(println)
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
Spark Streaming
val ssc = new StreamingContext(sc, Seconds(10))
val lines = KafkaUtils.createStream(ssc, ”localhost:2181", "meetupstream", Map("meetupstream" -> 10)).map(_._2)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
lines.foreachRDD(rdd => {
val lines2 = sqlContext.jsonRDD(rdd)
lines2.registerTempTable("lines2")
val lines3 = sqlContext.sql("select event.event_id,event.event_name,event.event_url, event.time,guests,
member.member_id, member.member_name, member.other_services.facebook.identifier as facebook_identifier,
member.other_services.linkedin.identifier as linkedin_identifier, member.other_services.twitter.identifier as
twitter_identifier,member.photo, mtime,response,
rsvp_id,venue.lat,venue.lon,venue.venue_id,venue.venue_name,visibility from lines2")
//PERFORM LOGIC HERE LIKE STREAMING REGRESSION
})
ssc.start()
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
Key Takeaways
• If you work with large data or compute intensive advanced
analytics and want a platform built from the ground up to
run faster on distributed servers then - try out Spark
• If you would like to have more control over your code than
just an added macro language - try out Spark
• If you want to better leverage data stored in Hadoop - try
out Spark
• If you prefer the open source licensing model over a
subscriptions model - try out Spark
@SVDataScience
Richard Williamson
richard@svds.com
@superhadooper
Yes, we’re hiring!
info@svds.com

More Related Content

What's hot

Building Reliable Data Lakes at Scale with Delta Lake
Building Reliable Data Lakes at Scale with Delta LakeBuilding Reliable Data Lakes at Scale with Delta Lake
Building Reliable Data Lakes at Scale with Delta Lake
Databricks
 

What's hot (20)

KSnow: Getting started with Snowflake
KSnow: Getting started with SnowflakeKSnow: Getting started with Snowflake
KSnow: Getting started with Snowflake
 
How to build a successful Data Lake
How to build a successful Data LakeHow to build a successful Data Lake
How to build a successful Data Lake
 
Building Reliable Data Lakes at Scale with Delta Lake
Building Reliable Data Lakes at Scale with Delta LakeBuilding Reliable Data Lakes at Scale with Delta Lake
Building Reliable Data Lakes at Scale with Delta Lake
 
Using Databricks as an Analysis Platform
Using Databricks as an Analysis PlatformUsing Databricks as an Analysis Platform
Using Databricks as an Analysis Platform
 
Snowflake Company Presentation
Snowflake Company PresentationSnowflake Company Presentation
Snowflake Company Presentation
 
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...
 
Get Savvy with Snowflake
Get Savvy with SnowflakeGet Savvy with Snowflake
Get Savvy with Snowflake
 
Delivering Data Democratization in the Cloud with Snowflake
Delivering Data Democratization in the Cloud with SnowflakeDelivering Data Democratization in the Cloud with Snowflake
Delivering Data Democratization in the Cloud with Snowflake
 
Siligong.Data - May 2021 - Transforming your analytics workflow with dbt
Siligong.Data - May 2021 - Transforming your analytics workflow with dbtSiligong.Data - May 2021 - Transforming your analytics workflow with dbt
Siligong.Data - May 2021 - Transforming your analytics workflow with dbt
 
Junior 產業:華通(2313-TW)
Junior 產業:華通(2313-TW)Junior 產業:華通(2313-TW)
Junior 產業:華通(2313-TW)
 
Data mapping tutorial
Data mapping tutorialData mapping tutorial
Data mapping tutorial
 
因應COVID-19疫情診所應變線上說明會
因應COVID-19疫情診所應變線上說明會因應COVID-19疫情診所應變線上說明會
因應COVID-19疫情診所應變線上說明會
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Presentation on Big Data
Presentation on Big DataPresentation on Big Data
Presentation on Big Data
 
By The Numbers: CPaaS, UCaaS, CCaaS Landscapes and Market Sizing
By The Numbers: CPaaS, UCaaS, CCaaS Landscapes and Market SizingBy The Numbers: CPaaS, UCaaS, CCaaS Landscapes and Market Sizing
By The Numbers: CPaaS, UCaaS, CCaaS Landscapes and Market Sizing
 
Capco Linked In
Capco Linked InCapco Linked In
Capco Linked In
 
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
 
US Army Cyber Warfare Field Manual FM 3-38 CEMS
US Army Cyber Warfare Field Manual  FM 3-38  CEMSUS Army Cyber Warfare Field Manual  FM 3-38  CEMS
US Army Cyber Warfare Field Manual FM 3-38 CEMS
 
Data analytics
Data analyticsData analytics
Data analytics
 
Hold Firm: The State of Cyber Resilience in Banking and Capital Markets
Hold Firm: The State of Cyber Resilience in Banking and Capital MarketsHold Firm: The State of Cyber Resilience in Banking and Capital Markets
Hold Firm: The State of Cyber Resilience in Banking and Capital Markets
 

Viewers also liked

Web Services Hadoop Summit 2012
Web Services Hadoop Summit 2012Web Services Hadoop Summit 2012
Web Services Hadoop Summit 2012
Hortonworks
 
Advanced Analytics and Recommendations with Apache Spark - Spark Maryland/DC ...
Advanced Analytics and Recommendations with Apache Spark - Spark Maryland/DC ...Advanced Analytics and Recommendations with Apache Spark - Spark Maryland/DC ...
Advanced Analytics and Recommendations with Apache Spark - Spark Maryland/DC ...
Chris Fregly
 

Viewers also liked (20)

Why your Spark Job is Failing
Why your Spark Job is FailingWhy your Spark Job is Failing
Why your Spark Job is Failing
 
SAS Modernization architectures - Big Data Analytics
SAS Modernization architectures - Big Data AnalyticsSAS Modernization architectures - Big Data Analytics
SAS Modernization architectures - Big Data Analytics
 
Why your Spark job is failing
Why your Spark job is failingWhy your Spark job is failing
Why your Spark job is failing
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
SAS on Your (Apache) Cluster, Serving your Data (Analysts)
SAS on Your (Apache) Cluster, Serving your Data (Analysts)SAS on Your (Apache) Cluster, Serving your Data (Analysts)
SAS on Your (Apache) Cluster, Serving your Data (Analysts)
 
YARN - Strata 2014
YARN - Strata 2014YARN - Strata 2014
YARN - Strata 2014
 
Web Services Hadoop Summit 2012
Web Services Hadoop Summit 2012Web Services Hadoop Summit 2012
Web Services Hadoop Summit 2012
 
Flume intro-100715
Flume intro-100715Flume intro-100715
Flume intro-100715
 
Big Data Scala by the Bay: Interactive Spark in your Browser
Big Data Scala by the Bay: Interactive Spark in your BrowserBig Data Scala by the Bay: Interactive Spark in your Browser
Big Data Scala by the Bay: Interactive Spark in your Browser
 
Introduction to big data and apache spark
Introduction to big data and apache sparkIntroduction to big data and apache spark
Introduction to big data and apache spark
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
 
Big Data Analytics in Government
Big Data Analytics in GovernmentBig Data Analytics in Government
Big Data Analytics in Government
 
Inside Flume
Inside FlumeInside Flume
Inside Flume
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
Apache Hadoop In Theory And Practice
Apache Hadoop In Theory And PracticeApache Hadoop In Theory And Practice
Apache Hadoop In Theory And Practice
 
SAS and Cloudera – Analytics at Scale
SAS and Cloudera – Analytics at ScaleSAS and Cloudera – Analytics at Scale
SAS and Cloudera – Analytics at Scale
 
Introduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkIntroduction to real time big data with Apache Spark
Introduction to real time big data with Apache Spark
 
Advanced Analytics and Recommendations with Apache Spark - Spark Maryland/DC ...
Advanced Analytics and Recommendations with Apache Spark - Spark Maryland/DC ...Advanced Analytics and Recommendations with Apache Spark - Spark Maryland/DC ...
Advanced Analytics and Recommendations with Apache Spark - Spark Maryland/DC ...
 
Where There Is Smoke, There is Fire: Extracting Actionable Intelligence from ...
Where There Is Smoke, There is Fire: Extracting Actionable Intelligence from ...Where There Is Smoke, There is Fire: Extracting Actionable Intelligence from ...
Where There Is Smoke, There is Fire: Extracting Actionable Intelligence from ...
 
Spark streaming: Best Practices
Spark streaming: Best PracticesSpark streaming: Best Practices
Spark streaming: Best Practices
 

Similar to Hadoop and Spark for the SAS Developer

Similar to Hadoop and Spark for the SAS Developer (20)

Spark Cassandra Connector: Past, Present and Furure
Spark Cassandra Connector: Past, Present and FurureSpark Cassandra Connector: Past, Present and Furure
Spark Cassandra Connector: Past, Present and Furure
 
What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016
What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016
What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016
 
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
 
Big Data on the Cloud
Big Data on the CloudBig Data on the Cloud
Big Data on the Cloud
 
A Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In ProductionA Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In Production
 
Big Data processing with Spark, Scala or Java?
Big Data processing with Spark, Scala or Java?Big Data processing with Spark, Scala or Java?
Big Data processing with Spark, Scala or Java?
 
Parallelizing Existing R Packages
Parallelizing Existing R PackagesParallelizing Existing R Packages
Parallelizing Existing R Packages
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
An Introduction to Spark with Scala
An Introduction to Spark with ScalaAn Introduction to Spark with Scala
An Introduction to Spark with Scala
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules DamjiA Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
5 Ways to Use Spark to Enrich your Cassandra Environment
5 Ways to Use Spark to Enrich your Cassandra Environment5 Ways to Use Spark to Enrich your Cassandra Environment
5 Ways to Use Spark to Enrich your Cassandra Environment
 
Enabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and REnabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and R
 
SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)
 
Koalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkKoalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache Spark
 
Fossasia 2018-chetan-khatri
Fossasia 2018-chetan-khatriFossasia 2018-chetan-khatri
Fossasia 2018-chetan-khatri
 

More from DataWorks Summit

HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 

Hadoop and Spark for the SAS Developer

  • 1. Hadoop and Spark for the SAS Developer Richard Williamson | @superhadooper 10 June 2015
  • 2. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience
  • 3. 3@SVDataScience © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.  My Background  Overview: SAS vs. Spark  Spark DataFrame vs. SAS Dataset  Spark SQL vs. SAS Proc SQL  Spark MLlib vs. SAS Stats  Spark Streaming  Questions? AGENDA
  • 4. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience * http://en.wikipedia.org/wiki/SAS_%28software%29 ** http://techcrunch.com/2015/03/19/on-the-growth-of-apache-spark OVERVIEW: SAS vs. Spark SAS • SAS is the largest market-share holder in "advanced analytics" with 36.2% of the market as of 2012.* Spark • Launched in U.C. Berkeley’s AMPLab in 2009, Apache Spark has begun to catch on like wildfire during the last year and a half. Spark had more than 465 contributors in 2014, making it the most active project in the Apache Software Foundation and among big data open source projects globally.**
  • 5. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience OVERVIEW: SAS vs. Spark SAS • Basic Programming model consists of SAS Data Step and SAS Procedures • SAS Datasets move data between processing steps Spark • Native language is Scala—allows generic data types and flexible programming model (Java and Python also supported) • RDDs (and now DataFrames) are used to move distributed datasets between processing steps
  • 6. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience OVERVIEW: SAS vs. Spark SAS Code Snippet http://support.sas.com/kb/24/595.html data old; input state $ accttot; datalines; ca 7000 ca 6500 ca 5800 nc 4800 nc 3640 sc 3520 va 4490 va 8700 va 2850 va 1111; Spark Code Snippet import sqlContext.implicits._ case class OLD(state: String, accttot: Int) val oldList = List( OLD("va",1111), OLD("ca",7000), OLD("ca",6500), OLD("ca",5800), OLD("nc",4800), OLD("nc",3640), OLD("sc",3520), OLD("va",4490), OLD("va",8700), OLD("va",2850) )
  • 7. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience OVERVIEW: SAS vs. Spark SAS Code Snippet http://support.sas.com/kb/24/595.html proc sort data=old; by state; data new; set old (drop= accttot); by state; if first.state then count=0; count+1; if last.state then output; proc freq; tables state / out=new(drop=percent) Spark Code Snippet val oldRDD = sc.parallelize(oldList) var oldDataFrame = oldRDD.toDF() oldDataFrame = oldDataFrame.orderBy("state”) oldRDD.aggregateByKey(0)((buffer, value) => buffer + value, (b1,b2) => b1 + b2).foreach(println) val newDataFrame = oldDataFrame.groupBy("state").count()
  • 8. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience Spark DataFrame vs. SAS Dataset Spark DataFrame — a distributed collection of data organized into named columns. https://spark.apache.org/docs/1.3.0/sql-programming-guide.html#rename-of-schemardd-to-dataframe SAS Dataset — a SAS file stored in a SAS library organized as a table of observations (rows) and variables (columns) that can be processed by SAS software. http://support.sas.com/documentation/cdl/en/lrcon/62955/HTML/default/viewer.htm#a001005709.htm
  • 9. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience How does Spark DataFrame differ from SAS Dataset? • built from ground up to be distributed and can be processed in parallel by multiple machines, whereas SAS Dataset has non-distributed roots • logical entity that is not necessarily paired with a serialized on-disk version, whereas a SAS Dataset has an on-disk manifestation Spark DataFrame vs. SAS Dataset
  • 10. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience Spark SQL vs. SAS Proc SQL • My first reaction to Spark SQL was, “This looks like Proc SQL” • SAS Proc SQL Simple Example: libname Example 'c:SASPROJECTS'; proc sql; create table newtable as select a.*, b.unique_consumer_id from Example.transactions as a, Example.consumer as b where a.ref_id=b.ref_id; quit;
  • 11. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience Spark SQL vs. SAS Proc SQL • Spark SQL Simple Example: val sqlContext = new org.apache.spark.sql.SQLContext(sc) val newtable = sqlContext.sql(“ select a.*, b.unique_consumer_id from transactions as a, consumer as b where a.ref_id=b.ref_id”)
  • 12. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience Spark MLlib vs. SAS Stats • Spark Mllib • https://spark.apache.org/docs/latest/mllib-guide.html • Spark’s scalable machine learning library consisting of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction • SAS Stats • Traditional Add-on package to SAS for Statistics
  • 13. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience Spark MLlib Example Data Prep case class Meetup(mdatehr: String, mdate: String, mhour: String) val meetup5 = meetup4.map(p => Meetup(p._1, p._2, p._3)) meetup5.registerTempTable("meetup5") val meetup6 = sqlContext.sql("select mdate,mhour,count(*) as rsvp_cnt from meetup5 where mdatehr >= '2015-02- 15 02' group by mdatehr,mdate,mhour") meetup6.registerTempTable("meetup6") sqlContext.sql("cache table meetup6”) val trainingData = meetup7.map { row => val features = Array[Double](row(24).toString().toDouble,row(0).toString().toDouble, … LabeledPoint(row(27).toString().toDouble, Vectors.dense(features))}
  • 14. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience Spark MLlib Example Regression Model val trainingData = meetup7.map { row => val features = Array[Double](1.0,row(0).toString().toDouble,row(1).toString().toDouble, … LabeledPoint(row(27).toString().toDouble, Vectors.dense(features))} val model = new RidgeRegressionWithSGD().run(trainingData) val scores = meetup7.map { row => val features = Vectors.dense(Array[Double](1.0,row(0).toString().toDouble, … row(23).toString().toDouble)) (row(25),row(26),row(27), model.predict(features))} scores.foreach(println)
  • 15. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience Spark Streaming val ssc = new StreamingContext(sc, Seconds(10)) val lines = KafkaUtils.createStream(ssc, ”localhost:2181", "meetupstream", Map("meetupstream" -> 10)).map(_._2) val sqlContext = new org.apache.spark.sql.SQLContext(sc) lines.foreachRDD(rdd => { val lines2 = sqlContext.jsonRDD(rdd) lines2.registerTempTable("lines2") val lines3 = sqlContext.sql("select event.event_id,event.event_name,event.event_url, event.time,guests, member.member_id, member.member_name, member.other_services.facebook.identifier as facebook_identifier, member.other_services.linkedin.identifier as linkedin_identifier, member.other_services.twitter.identifier as twitter_identifier,member.photo, mtime,response, rsvp_id,venue.lat,venue.lon,venue.venue_id,venue.venue_name,visibility from lines2") //PERFORM LOGIC HERE LIKE STREAMING REGRESSION }) ssc.start()
  • 16. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience Key Takeaways • If you work with large data or compute intensive advanced analytics and want a platform built from the ground up to run faster on distributed servers then - try out Spark • If you would like to have more control over your code than just an added macro language - try out Spark • If you want to better leverage data stored in Hadoop - try out Spark • If you prefer the open source licensing model over a subscriptions model - try out Spark

Editor's Notes

  1. Retailer Inventory Mgmt
  2. SHOW of HANDS for SAS vs Spark development SparkR
  3. A Spark Dataframe is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. A SAS Dataset also contains descriptor information such as the data types and lengths of the variables, as well as which engine was used to create the data.
  4. Mention addition of Windowing functions in 1.4 and possibly pivot/transpose