More Related Content Similar to Hadoop and Spark for the SAS Developer (20) More from DataWorks Summit (20) Hadoop and Spark for the SAS Developer1. Hadoop and Spark for the SAS Developer
Richard Williamson | @superhadooper
10 June 2015
2. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
3. 3@SVDataScience © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
My Background
Overview: SAS vs. Spark
Spark DataFrame vs. SAS Dataset
Spark SQL vs. SAS Proc SQL
Spark MLlib vs. SAS Stats
Spark Streaming
Questions?
AGENDA
4. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
* http://en.wikipedia.org/wiki/SAS_%28software%29
** http://techcrunch.com/2015/03/19/on-the-growth-of-apache-spark
OVERVIEW: SAS vs. Spark
SAS
• SAS is the largest market-share holder in "advanced
analytics" with 36.2% of the market as of 2012.*
Spark
• Launched in U.C. Berkeley’s AMPLab in 2009, Apache Spark
has begun to catch on like wildfire during the last year and a
half. Spark had more than 465 contributors in 2014, making it
the most active project in the Apache Software Foundation
and among big data open source projects globally.**
5. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
OVERVIEW: SAS vs. Spark
SAS
• Basic Programming model consists of SAS Data Step
and SAS Procedures
• SAS Datasets move data between processing steps
Spark
• Native language is Scala—allows generic data types
and flexible programming model (Java and Python
also supported)
• RDDs (and now DataFrames) are used to move
distributed datasets between processing steps
6. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
OVERVIEW: SAS vs. Spark
SAS Code Snippet
http://support.sas.com/kb/24/595.html
data old;
input state $ accttot;
datalines;
ca 7000
ca 6500
ca 5800
nc 4800
nc 3640
sc 3520
va 4490
va 8700
va 2850
va 1111;
Spark Code Snippet
import sqlContext.implicits._
case class OLD(state: String, accttot: Int)
val oldList = List(
OLD("va",1111),
OLD("ca",7000),
OLD("ca",6500),
OLD("ca",5800),
OLD("nc",4800),
OLD("nc",3640),
OLD("sc",3520),
OLD("va",4490),
OLD("va",8700),
OLD("va",2850)
)
7. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
OVERVIEW: SAS vs. Spark
SAS Code Snippet
http://support.sas.com/kb/24/595.html
proc sort data=old;
by state;
data new;
set old (drop= accttot);
by state;
if first.state then count=0;
count+1;
if last.state then output;
proc freq;
tables state / out=new(drop=percent)
Spark Code Snippet
val oldRDD = sc.parallelize(oldList)
var oldDataFrame = oldRDD.toDF()
oldDataFrame = oldDataFrame.orderBy("state”)
oldRDD.aggregateByKey(0)((buffer, value) => buffer +
value, (b1,b2) => b1 + b2).foreach(println)
val newDataFrame =
oldDataFrame.groupBy("state").count()
8. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
Spark DataFrame vs. SAS Dataset
Spark DataFrame — a distributed collection of data organized
into named columns.
https://spark.apache.org/docs/1.3.0/sql-programming-guide.html#rename-of-schemardd-to-dataframe
SAS Dataset — a SAS file stored in a SAS library organized as a
table of observations (rows) and variables (columns) that can
be processed by SAS software.
http://support.sas.com/documentation/cdl/en/lrcon/62955/HTML/default/viewer.htm#a001005709.htm
9. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
How does Spark DataFrame differ from SAS Dataset?
• built from ground up to be distributed and can be processed
in parallel by multiple machines, whereas SAS Dataset has
non-distributed roots
• logical entity that is not necessarily paired with a serialized
on-disk version, whereas a SAS Dataset has an on-disk
manifestation
Spark DataFrame vs. SAS Dataset
10. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
Spark SQL vs. SAS Proc SQL
• My first reaction to Spark SQL was, “This looks like Proc SQL”
• SAS Proc SQL Simple Example:
libname Example 'c:SASPROJECTS';
proc sql;
create table newtable
as select a.*, b.unique_consumer_id
from Example.transactions as a, Example.consumer as b
where a.ref_id=b.ref_id;
quit;
11. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
Spark SQL vs. SAS Proc SQL
• Spark SQL Simple Example:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val newtable = sqlContext.sql(“
select a.*, b.unique_consumer_id
from transactions as a, consumer as b
where a.ref_id=b.ref_id”)
12. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
Spark MLlib vs. SAS Stats
• Spark Mllib
• https://spark.apache.org/docs/latest/mllib-guide.html
• Spark’s scalable machine learning library consisting of
common learning algorithms and utilities, including
classification, regression, clustering, collaborative filtering,
dimensionality reduction
• SAS Stats
• Traditional Add-on package to SAS for Statistics
13. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
Spark MLlib Example Data Prep
case class Meetup(mdatehr: String, mdate: String, mhour: String)
val meetup5 = meetup4.map(p => Meetup(p._1, p._2, p._3))
meetup5.registerTempTable("meetup5")
val meetup6 = sqlContext.sql("select mdate,mhour,count(*) as rsvp_cnt from meetup5 where mdatehr >= '2015-02-
15 02' group by mdatehr,mdate,mhour")
meetup6.registerTempTable("meetup6")
sqlContext.sql("cache table meetup6”)
val trainingData = meetup7.map { row =>
val features = Array[Double](row(24).toString().toDouble,row(0).toString().toDouble, …
LabeledPoint(row(27).toString().toDouble, Vectors.dense(features))}
14. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
Spark MLlib Example Regression Model
val trainingData = meetup7.map { row =>
val features = Array[Double](1.0,row(0).toString().toDouble,row(1).toString().toDouble, …
LabeledPoint(row(27).toString().toDouble, Vectors.dense(features))}
val model = new RidgeRegressionWithSGD().run(trainingData)
val scores = meetup7.map { row =>
val features = Vectors.dense(Array[Double](1.0,row(0).toString().toDouble, …
row(23).toString().toDouble))
(row(25),row(26),row(27), model.predict(features))}
scores.foreach(println)
15. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
Spark Streaming
val ssc = new StreamingContext(sc, Seconds(10))
val lines = KafkaUtils.createStream(ssc, ”localhost:2181", "meetupstream", Map("meetupstream" -> 10)).map(_._2)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
lines.foreachRDD(rdd => {
val lines2 = sqlContext.jsonRDD(rdd)
lines2.registerTempTable("lines2")
val lines3 = sqlContext.sql("select event.event_id,event.event_name,event.event_url, event.time,guests,
member.member_id, member.member_name, member.other_services.facebook.identifier as facebook_identifier,
member.other_services.linkedin.identifier as linkedin_identifier, member.other_services.twitter.identifier as
twitter_identifier,member.photo, mtime,response,
rsvp_id,venue.lat,venue.lon,venue.venue_id,venue.venue_name,visibility from lines2")
//PERFORM LOGIC HERE LIKE STREAMING REGRESSION
})
ssc.start()
16. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
Key Takeaways
• If you work with large data or compute intensive advanced
analytics and want a platform built from the ground up to
run faster on distributed servers then - try out Spark
• If you would like to have more control over your code than
just an added macro language - try out Spark
• If you want to better leverage data stored in Hadoop - try
out Spark
• If you prefer the open source licensing model over a
subscriptions model - try out Spark
Editor's Notes Retailer Inventory Mgmt SHOW of HANDS for SAS vs Spark development
SparkR A Spark Dataframe is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood.
A SAS Dataset also contains descriptor information such as the data types and lengths of the variables, as well as which engine was used to create the data.
Mention addition of Windowing functions in 1.4 and possibly pivot/transpose