Hadoop and Spark for the SAS Developer

Hadoop and Spark for the SAS Developer
Richard Williamson | @superhadooper
10 June 2015

© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience

3@SVDataScience © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
 My Background
 Overview: SAS vs. Spark
 Spark DataFrame vs. SAS Dataset
 Spark SQL vs. SAS Proc SQL
 Spark MLlib vs. SAS Stats
 Spark Streaming
 Questions?
AGENDA

@SVDataScience
* http://en.wikipedia.org/wiki/SAS_%28software%29
** http://techcrunch.com/2015/03/19/on-the-growth-of-apache-spark
OVERVIEW: SAS vs. Spark
SAS
• SAS is the largest market-share holder in "advanced
analytics" with 36.2% of the market as of 2012.*
Spark
• Launched in U.C. Berkeley’s AMPLab in 2009, Apache Spark
has begun to catch on like wildfire during the last year and a
half. Spark had more than 465 contributors in 2014, making it
the most active project in the Apache Software Foundation
and among big data open source projects globally.**

@SVDataScience
SAS
• Basic Programming model consists of SAS Data Step
and SAS Procedures
• SAS Datasets move data between processing steps
Spark
• Native language is Scala—allows generic data types
and flexible programming model (Java and Python
also supported)
• RDDs (and now DataFrames) are used to move
distributed datasets between processing steps

@SVDataScience
SAS Code Snippet
http://support.sas.com/kb/24/595.html
data old;
input state $ accttot;
datalines;
ca 7000
ca 6500
ca 5800
nc 4800
nc 3640
sc 3520
va 4490
va 8700
va 2850
va 1111;
Spark Code Snippet
import sqlContext.implicits._
case class OLD(state: String, accttot: Int)
val oldList = List(
OLD("va",1111),
OLD("ca",7000),
OLD("ca",6500),
OLD("ca",5800),
OLD("nc",4800),
OLD("nc",3640),
OLD("sc",3520),
OLD("va",4490),
OLD("va",8700),
OLD("va",2850)
)

@SVDataScience
SAS Code Snippet
http://support.sas.com/kb/24/595.html
proc sort data=old;
by state;
data new;
set old (drop= accttot);
by state;
if first.state then count=0;
count+1;
if last.state then output;
proc freq;
tables state / out=new(drop=percent)
Spark Code Snippet
val oldRDD = sc.parallelize(oldList)
var oldDataFrame = oldRDD.toDF()
oldDataFrame = oldDataFrame.orderBy("state”)
oldRDD.aggregateByKey(0)((buffer, value) => buffer +
value, (b1,b2) => b1 + b2).foreach(println)
val newDataFrame =
oldDataFrame.groupBy("state").count()

@SVDataScience
Spark DataFrame vs. SAS Dataset
Spark DataFrame — a distributed collection of data organized
into named columns.
https://spark.apache.org/docs/1.3.0/sql-programming-guide.html#rename-of-schemardd-to-dataframe
SAS Dataset — a SAS file stored in a SAS library organized as a
table of observations (rows) and variables (columns) that can
be processed by SAS software.
http://support.sas.com/documentation/cdl/en/lrcon/62955/HTML/default/viewer.htm#a001005709.htm

@SVDataScience
How does Spark DataFrame differ from SAS Dataset?
• built from ground up to be distributed and can be processed
in parallel by multiple machines, whereas SAS Dataset has
non-distributed roots
• logical entity that is not necessarily paired with a serialized
on-disk version, whereas a SAS Dataset has an on-disk
manifestation
Spark DataFrame vs. SAS Dataset

@SVDataScience
Spark SQL vs. SAS Proc SQL
• My first reaction to Spark SQL was, “This looks like Proc SQL”
• SAS Proc SQL Simple Example:
libname Example 'c:SASPROJECTS';
proc sql;
create table newtable
as select a.*, b.unique_consumer_id
from Example.transactions as a, Example.consumer as b
where a.ref_id=b.ref_id;
quit;

@SVDataScience
Spark SQL vs. SAS Proc SQL
• Spark SQL Simple Example:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val newtable = sqlContext.sql(“
select a.*, b.unique_consumer_id
from transactions as a, consumer as b
where a.ref_id=b.ref_id”)

@SVDataScience
Spark MLlib vs. SAS Stats
• Spark Mllib
• https://spark.apache.org/docs/latest/mllib-guide.html
• Spark’s scalable machine learning library consisting of
common learning algorithms and utilities, including
classification, regression, clustering, collaborative filtering,
dimensionality reduction
• SAS Stats
• Traditional Add-on package to SAS for Statistics

@SVDataScience
Spark MLlib Example Data Prep
case class Meetup(mdatehr: String, mdate: String, mhour: String)
val meetup5 = meetup4.map(p => Meetup(p._1, p._2, p._3))
meetup5.registerTempTable("meetup5")
val meetup6 = sqlContext.sql("select mdate,mhour,count(*) as rsvp_cnt from meetup5 where mdatehr >= '2015-02-
15 02' group by mdatehr,mdate,mhour")
meetup6.registerTempTable("meetup6")
sqlContext.sql("cache table meetup6”)
val trainingData = meetup7.map { row =>
val features = Array[Double](row(24).toString().toDouble,row(0).toString().toDouble, …
LabeledPoint(row(27).toString().toDouble, Vectors.dense(features))}

@SVDataScience
Spark MLlib Example Regression Model
val trainingData = meetup7.map { row =>
val features = Array[Double](1.0,row(0).toString().toDouble,row(1).toString().toDouble, …
LabeledPoint(row(27).toString().toDouble, Vectors.dense(features))}
val model = new RidgeRegressionWithSGD().run(trainingData)
val scores = meetup7.map { row =>
val features = Vectors.dense(Array[Double](1.0,row(0).toString().toDouble, …
row(23).toString().toDouble))
(row(25),row(26),row(27), model.predict(features))}
scores.foreach(println)

@SVDataScience
Spark Streaming
val ssc = new StreamingContext(sc, Seconds(10))
val lines = KafkaUtils.createStream(ssc, ”localhost:2181", "meetupstream", Map("meetupstream" -> 10)).map(_._2)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
lines.foreachRDD(rdd => {
val lines2 = sqlContext.jsonRDD(rdd)
lines2.registerTempTable("lines2")
val lines3 = sqlContext.sql("select event.event_id,event.event_name,event.event_url, event.time,guests,
member.member_id, member.member_name, member.other_services.facebook.identifier as facebook_identifier,
member.other_services.linkedin.identifier as linkedin_identifier, member.other_services.twitter.identifier as
twitter_identifier,member.photo, mtime,response,
rsvp_id,venue.lat,venue.lon,venue.venue_id,venue.venue_name,visibility from lines2")
//PERFORM LOGIC HERE LIKE STREAMING REGRESSION
})
ssc.start()

@SVDataScience
Key Takeaways
• If you work with large data or compute intensive advanced
analytics and want a platform built from the ground up to
run faster on distributed servers then - try out Spark
• If you would like to have more control over your code than
just an added macro language - try out Spark
• If you want to better leverage data stored in Hadoop - try
out Spark
• If you prefer the open source licensing model over a
subscriptions model - try out Spark

@SVDataScience
Richard Williamson
richard@svds.com
@superhadooper
Yes, we’re hiring!
info@svds.com

Hadoop and Spark for the SAS Developer

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Hadoop and Spark for the SAS Developer

Similar to Hadoop and Spark for the SAS Developer (20)

More from DataWorks Summit

More from DataWorks Summit (20)

Recently uploaded

Recently uploaded (20)

Hadoop and Spark for the SAS Developer

Editor's Notes