4. OVERVIEW
SAS
○ The largest independent
vendor in “advanced analytics”
○ 1976 foundation of the SAS
Institute, Cary, North Carolina
○ Commercial software product
SPARK
○ A fast and general engine for
large-scale data processing
○ Started in 2009 as a research
project in the UC Berkeley,
AMPLab
○ Open source
5. CODE
SAS
Basic programming model
consists of code blocks:
○ SAS Data Step
■ generation of data
■ concatenation of data
○ SAS PROCedures
■ special functionalities
SPARK
“Line based” programming
Native Language is Scala, but
flexible programming model:
○ Scala
○ Java
○ Python
○ R
6. DATA
SAS: DATASET
○ Computed in memory (RAM)
○ A data set contains:
● observations: organized in
rows
● variables: organized in
columns
SPARK: DATAFRAME
○ A distributed collection of data
organized into named columns.
○ It is conceptually equivalent to:
table in a relational database and
dataframe in R/Python
○ It is a programming abstraction
8. SAS:
data sasData
set sasData;
Fare2 = Fare + 2;
run;
Python Pandas:
pandasDF['Fare2'] = pandasDF['Fare']+2
Spark:
sparkDF = sparkDF
.withColumn('Fare2',sparkDF['Fare']+2)
NOTEBOOK
IMMUTABLE,
PARTITIONED,
DISTRIBUTED
DATA STRUCTURE
9. READ SAS DATASETS
The SAS-FILE (sas7bdat) is a file with special structure
created by SAS and binary stored
● PYTHON: SAS7BDAT PACKAGE
● R: HAVEN LIBRARY
○
11. SQL sentences
SAS ProC SQL
SAS Procedure that combines the
functionality of DATA and PROC
steps. It can sort, summarize,
subset, join, concatenate
datasets, create new variables...
Spark SQL
○ Spark’s interface for working
with structured and
semi-structured data, query
using SQL
○ Load data from JSON, Hive,
Parquet
○ Evaluated “lazily”
12. SQL sentences
SAS ProC SQL
PROC SQL;
CREATE TABLE newTable AS
SELECT Columns
FROM Table
WHERE Column > Value
GROUP BY Columns;
QUIT;
Spark SQL
sqlContext = new
org.apache.spark.sql.SQLContext(sc)
newTable = sqlContext.sql(“
SELECT Columns
FROM Table
WHERE Column > Value
GROUP BY Columns”)
NOTEBOOK
13. AGGREGATE FUNCTION
IN SPARK SQL
sum, avg, mean, count,
max, min, first, last,
sttedev, variance,
skewness, kurtosis…
After aggregation
Act on each group of
data, return a single
value as a result
14. WINDOW FUNCTION IN
SPARK SQL
Ranking: rank, dense_rank,
percent_rank, ntile,
row_number
Analytics: cume_dist, lag,
first_value, last_value, lead
Aggregate: aggregate funcs
Calculate a return value
over a set of rows called
window that are
somehow related to the
current
NOTEBOOK
15. EXTEND SPARK SQL
Standard functions are over 100 functions
(pyspark)
from pyspark.sql.functions import *
16. BUILT-IN FUNCTIONS,
UDFs
“User Defined Function”
Define new Column-based
functions that extend the
vocabulary of Spark
Act on a single row as an
input, single return value for
every input row
NOTEBOOK
17. TIPS
○ Not thinking in sorted data. In parallel process we can’t acces per row.
○ Cache tables/DFs when they are used more than once
○ Merge doesn’t need ordered data as SAS
○ Use functions already defined instead of creating your own UDF
○ Save data in columnar format as Parquet
○ Avoid collecting data when you are working with Big Data, take a sample