SlideShare a Scribd company logo
1 of 21
Download to read offline
Road to
Analytics
Contents
1
2
3
SAS vs Spark
SAS Proc SQL vs Spark SQL
Advantage Analytics
1. SAS vs Spark
OVERVIEW
SAS
○ The largest independent
vendor in “advanced analytics”
○ 1976 foundation of the SAS
Institute, Cary, North Carolina
○ Commercial software product
SPARK
○ A fast and general engine for
large-scale data processing
○ Started in 2009 as a research
project in the UC Berkeley,
AMPLab
○ Open source
CODE
SAS
Basic programming model
consists of code blocks:
○ SAS Data Step
■ generation of data
■ concatenation of data
○ SAS PROCedures
■ special functionalities
SPARK
“Line based” programming
Native Language is Scala, but
flexible programming model:
○ Scala
○ Java
○ Python
○ R
DATA
SAS: DATASET
○ Computed in memory (RAM)
○ A data set contains:
● observations: organized in
rows
● variables: organized in
columns
SPARK: DATAFRAME
○ A distributed collection of data
organized into named columns.
○ It is conceptually equivalent to:
table in a relational database and
dataframe in R/Python
○ It is a programming abstraction
IMMUTABLE,
PARTITIONED,
DISTRIBUTED
DATA STRUCTURE
Transformations like: map,
filter, union, join, group
by… results in an other
dataset
SAS:
data sasData
set sasData;
Fare2 = Fare + 2;
run;
Python Pandas:
pandasDF['Fare2'] = pandasDF['Fare']+2
Spark:
sparkDF = sparkDF
.withColumn('Fare2',sparkDF['Fare']+2)
NOTEBOOK
IMMUTABLE,
PARTITIONED,
DISTRIBUTED
DATA STRUCTURE
READ SAS DATASETS
The SAS-FILE (sas7bdat) is a file with special structure
created by SAS and binary stored
● PYTHON: SAS7BDAT PACKAGE
● R: HAVEN LIBRARY
○
2. SAS Proc SQL vs
Spark SQL
SQL sentences
SAS ProC SQL
SAS Procedure that combines the
functionality of DATA and PROC
steps. It can sort, summarize,
subset, join, concatenate
datasets, create new variables...
Spark SQL
○ Spark’s interface for working
with structured and
semi-structured data, query
using SQL
○ Load data from JSON, Hive,
Parquet
○ Evaluated “lazily”
SQL sentences
SAS ProC SQL
PROC SQL;
CREATE TABLE newTable AS
SELECT Columns
FROM Table
WHERE Column > Value
GROUP BY Columns;
QUIT;
Spark SQL
sqlContext = new
org.apache.spark.sql.SQLContext(sc)
newTable = sqlContext.sql(“
SELECT Columns
FROM Table
WHERE Column > Value
GROUP BY Columns”)
NOTEBOOK
AGGREGATE FUNCTION
IN SPARK SQL
sum, avg, mean, count,
max, min, first, last,
sttedev, variance,
skewness, kurtosis…
After aggregation
Act on each group of
data, return a single
value as a result
WINDOW FUNCTION IN
SPARK SQL
Ranking: rank, dense_rank,
percent_rank, ntile,
row_number
Analytics: cume_dist, lag,
first_value, last_value, lead
Aggregate: aggregate funcs
Calculate a return value
over a set of rows called
window that are
somehow related to the
current
NOTEBOOK
EXTEND SPARK SQL
Standard functions are over 100 functions
(pyspark)
from pyspark.sql.functions import *
BUILT-IN FUNCTIONS,
UDFs
“User Defined Function”
Define new Column-based
functions that extend the
vocabulary of Spark
Act on a single row as an
input, single return value for
every input row
NOTEBOOK
TIPS
○ Not thinking in sorted data. In parallel process we can’t acces per row.
○ Cache tables/DFs when they are used more than once
○ Merge doesn’t need ordered data as SAS
○ Use functions already defined instead of creating your own UDF
○ Save data in columnar format as Parquet
○ Avoid collecting data when you are working with Big Data, take a sample
3. Advantage Analytics
ADVANTAGE ANALYTICS
SAS Stats
Traditional Add-on package to
SAS for Statistics
○ Analysis of variance
○ Bayesian analysis
○ Categorical data analysis
○ Distribution analysis
○ Mixed models
○ Predictive modeling...
Spark MLlib
Scalable machine learning library
○ Basic statistics
○ Classification and regression
○ Collaborative filtering
○ Clustering
○ Dimensionality reduction
○ Feature extraction and
transformation...
BIBLIOGRAPHY
SPARK DOCUMENTATION:
https://spark.apache.org/docs/2.0.0/
PYSPARK API:
https://spark.apache.org/docs/2.0.0/api/python/index.html
PYSPARK FUNCTIONS:
https://spark.apache.org/docs/2.0.0/api/python/_modules/pyspark/sql/functions.html
THANKS!
Any questions?
@datiobd
maguilar@datiobd.com
datio-big-data

More Related Content

What's hot

Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksDatabricks
 
Using Apache Spark as ETL engine. Pros and Cons
Using Apache Spark as ETL engine. Pros and Cons          Using Apache Spark as ETL engine. Pros and Cons
Using Apache Spark as ETL engine. Pros and Cons Provectus
 
Data Source API in Spark
Data Source API in SparkData Source API in Spark
Data Source API in SparkDatabricks
 
End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkDatabricks
 
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...Spark Summit
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overviewDataArt
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDsDean Chen
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guideRyan Blue
 
Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Databricks
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
 
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst OptimizerDeep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst OptimizerSachin Aggarwal
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkDatabricks
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkPatrick Wendell
 
Apache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming modelApache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming modelMartin Zapletal
 
Spark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, StreamingSpark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, StreamingPetr Zapletal
 
Distributed Stream Processing - Spark Summit East 2017
Distributed Stream Processing - Spark Summit East 2017Distributed Stream Processing - Spark Summit East 2017
Distributed Stream Processing - Spark Summit East 2017Petr Zapletal
 
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...Spark Summit
 
Spark Sql and DataFrame
Spark Sql and DataFrameSpark Sql and DataFrame
Spark Sql and DataFramePrashant Gupta
 

What's hot (20)

Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 
Spark etl
Spark etlSpark etl
Spark etl
 
Using Apache Spark as ETL engine. Pros and Cons
Using Apache Spark as ETL engine. Pros and Cons          Using Apache Spark as ETL engine. Pros and Cons
Using Apache Spark as ETL engine. Pros and Cons
 
Data Source API in Spark
Data Source API in SparkData Source API in Spark
Data Source API in Spark
 
End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache Spark
 
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst OptimizerDeep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Apache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming modelApache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming model
 
Spark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, StreamingSpark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, Streaming
 
Distributed Stream Processing - Spark Summit East 2017
Distributed Stream Processing - Spark Summit East 2017Distributed Stream Processing - Spark Summit East 2017
Distributed Stream Processing - Spark Summit East 2017
 
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
 
Spark Sql and DataFrame
Spark Sql and DataFrameSpark Sql and DataFrame
Spark Sql and DataFrame
 

Viewers also liked

DC/OS: The definitive platform for modern apps
DC/OS: The definitive platform for modern appsDC/OS: The definitive platform for modern apps
DC/OS: The definitive platform for modern appsDatio Big Data
 
Kafka Connect by Datio
Kafka Connect by DatioKafka Connect by Datio
Kafka Connect by DatioDatio Big Data
 
10 Things You Didn’t Know About Mobile Email from Litmus & HubSpot
 10 Things You Didn’t Know About Mobile Email from Litmus & HubSpot 10 Things You Didn’t Know About Mobile Email from Litmus & HubSpot
10 Things You Didn’t Know About Mobile Email from Litmus & HubSpotHubSpot
 
PDP Your personal development plan
PDP Your personal development planPDP Your personal development plan
PDP Your personal development planDatio Big Data
 
How to Earn the Attention of Today's Buyer
How to Earn the Attention of Today's BuyerHow to Earn the Attention of Today's Buyer
How to Earn the Attention of Today's BuyerHubSpot
 
25 Discovery Call Questions
25 Discovery Call Questions25 Discovery Call Questions
25 Discovery Call QuestionsHubSpot
 
Modern Prospecting Techniques for Connecting with Prospects (from Sales Hacke...
Modern Prospecting Techniques for Connecting with Prospects (from Sales Hacke...Modern Prospecting Techniques for Connecting with Prospects (from Sales Hacke...
Modern Prospecting Techniques for Connecting with Prospects (from Sales Hacke...HubSpot
 
Class 1: Email Marketing Certification course: Email Marketing and Your Business
Class 1: Email Marketing Certification course: Email Marketing and Your BusinessClass 1: Email Marketing Certification course: Email Marketing and Your Business
Class 1: Email Marketing Certification course: Email Marketing and Your BusinessHubSpot
 
Behind the Scenes: Launching HubSpot Tokyo
Behind the Scenes: Launching HubSpot TokyoBehind the Scenes: Launching HubSpot Tokyo
Behind the Scenes: Launching HubSpot TokyoHubSpot
 
HubSpot Diversity Data 2016
HubSpot Diversity Data 2016HubSpot Diversity Data 2016
HubSpot Diversity Data 2016HubSpot
 
Why People Block Ads (And What It Means for Marketers and Advertisers) [New R...
Why People Block Ads (And What It Means for Marketers and Advertisers) [New R...Why People Block Ads (And What It Means for Marketers and Advertisers) [New R...
Why People Block Ads (And What It Means for Marketers and Advertisers) [New R...HubSpot
 
What is Inbound Recruiting?
What is Inbound Recruiting?What is Inbound Recruiting?
What is Inbound Recruiting?HubSpot
 
3 Proven Sales Email Templates Used by Successful Companies
3 Proven Sales Email Templates Used by Successful Companies3 Proven Sales Email Templates Used by Successful Companies
3 Proven Sales Email Templates Used by Successful CompaniesHubSpot
 
Add the Women Back: Wikipedia Edit-a-Thon
Add the Women Back: Wikipedia Edit-a-ThonAdd the Women Back: Wikipedia Edit-a-Thon
Add the Women Back: Wikipedia Edit-a-ThonHubSpot
 
pandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for Pythonpandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for PythonWes McKinney
 

Viewers also liked (17)

Del Mono al QA
Del Mono al QADel Mono al QA
Del Mono al QA
 
DC/OS: The definitive platform for modern apps
DC/OS: The definitive platform for modern appsDC/OS: The definitive platform for modern apps
DC/OS: The definitive platform for modern apps
 
Security&Governance
Security&GovernanceSecurity&Governance
Security&Governance
 
Kafka Connect by Datio
Kafka Connect by DatioKafka Connect by Datio
Kafka Connect by Datio
 
10 Things You Didn’t Know About Mobile Email from Litmus & HubSpot
 10 Things You Didn’t Know About Mobile Email from Litmus & HubSpot 10 Things You Didn’t Know About Mobile Email from Litmus & HubSpot
10 Things You Didn’t Know About Mobile Email from Litmus & HubSpot
 
PDP Your personal development plan
PDP Your personal development planPDP Your personal development plan
PDP Your personal development plan
 
How to Earn the Attention of Today's Buyer
How to Earn the Attention of Today's BuyerHow to Earn the Attention of Today's Buyer
How to Earn the Attention of Today's Buyer
 
25 Discovery Call Questions
25 Discovery Call Questions25 Discovery Call Questions
25 Discovery Call Questions
 
Modern Prospecting Techniques for Connecting with Prospects (from Sales Hacke...
Modern Prospecting Techniques for Connecting with Prospects (from Sales Hacke...Modern Prospecting Techniques for Connecting with Prospects (from Sales Hacke...
Modern Prospecting Techniques for Connecting with Prospects (from Sales Hacke...
 
Class 1: Email Marketing Certification course: Email Marketing and Your Business
Class 1: Email Marketing Certification course: Email Marketing and Your BusinessClass 1: Email Marketing Certification course: Email Marketing and Your Business
Class 1: Email Marketing Certification course: Email Marketing and Your Business
 
Behind the Scenes: Launching HubSpot Tokyo
Behind the Scenes: Launching HubSpot TokyoBehind the Scenes: Launching HubSpot Tokyo
Behind the Scenes: Launching HubSpot Tokyo
 
HubSpot Diversity Data 2016
HubSpot Diversity Data 2016HubSpot Diversity Data 2016
HubSpot Diversity Data 2016
 
Why People Block Ads (And What It Means for Marketers and Advertisers) [New R...
Why People Block Ads (And What It Means for Marketers and Advertisers) [New R...Why People Block Ads (And What It Means for Marketers and Advertisers) [New R...
Why People Block Ads (And What It Means for Marketers and Advertisers) [New R...
 
What is Inbound Recruiting?
What is Inbound Recruiting?What is Inbound Recruiting?
What is Inbound Recruiting?
 
3 Proven Sales Email Templates Used by Successful Companies
3 Proven Sales Email Templates Used by Successful Companies3 Proven Sales Email Templates Used by Successful Companies
3 Proven Sales Email Templates Used by Successful Companies
 
Add the Women Back: Wikipedia Edit-a-Thon
Add the Women Back: Wikipedia Edit-a-ThonAdd the Women Back: Wikipedia Edit-a-Thon
Add the Women Back: Wikipedia Edit-a-Thon
 
pandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for Pythonpandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for Python
 

Similar to Road to Analytics

Spark and scala..................................... ppt.pptx
Spark and scala..................................... ppt.pptxSpark and scala..................................... ppt.pptx
Spark and scala..................................... ppt.pptxshivani22y
 
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick WendellApache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick WendellDatabricks
 
Intro to apache spark
Intro to apache sparkIntro to apache spark
Intro to apache sparkAmine Sagaama
 
Introduction to Spark with Python
Introduction to Spark with PythonIntroduction to Spark with Python
Introduction to Spark with PythonGokhan Atil
 
Recent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced AnalyticsRecent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced AnalyticsDatabricks
 
Spark real world use cases and optimizations
Spark real world use cases and optimizationsSpark real world use cases and optimizations
Spark real world use cases and optimizationsGal Marder
 
Drill / SQL / Optiq
Drill / SQL / OptiqDrill / SQL / Optiq
Drill / SQL / OptiqJulian Hyde
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKzmhassan
 
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and SparkCassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and SparkDataStax Academy
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study NotesRichard Kuo
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25thSneha Challa
 
Introduction to Spark - DataFactZ
Introduction to Spark - DataFactZIntroduction to Spark - DataFactZ
Introduction to Spark - DataFactZDataFactZ
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQLYousun Jeong
 
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...Inhacking
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Аліна Шепшелей
 
Spark: The State of the Art Engine for Big Data Processing
Spark: The State of the Art Engine for Big Data ProcessingSpark: The State of the Art Engine for Big Data Processing
Spark: The State of the Art Engine for Big Data ProcessingRamaninder Singh Jhajj
 
SQL Performance Tuning and New Features in Oracle 19c
SQL Performance Tuning and New Features in Oracle 19cSQL Performance Tuning and New Features in Oracle 19c
SQL Performance Tuning and New Features in Oracle 19cRachelBarker26
 

Similar to Road to Analytics (20)

Spark and scala..................................... ppt.pptx
Spark and scala..................................... ppt.pptxSpark and scala..................................... ppt.pptx
Spark and scala..................................... ppt.pptx
 
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick WendellApache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
 
Intro to apache spark
Intro to apache sparkIntro to apache spark
Intro to apache spark
 
Introduction to Spark with Python
Introduction to Spark with PythonIntroduction to Spark with Python
Introduction to Spark with Python
 
Recent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced AnalyticsRecent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced Analytics
 
Spark real world use cases and optimizations
Spark real world use cases and optimizationsSpark real world use cases and optimizations
Spark real world use cases and optimizations
 
Drill / SQL / Optiq
Drill / SQL / OptiqDrill / SQL / Optiq
Drill / SQL / Optiq
 
Spark core
Spark coreSpark core
Spark core
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
 
Apache spark
Apache sparkApache spark
Apache spark
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and SparkCassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study Notes
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
 
Introduction to Spark - DataFactZ
Introduction to Spark - DataFactZIntroduction to Spark - DataFactZ
Introduction to Spark - DataFactZ
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
 
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
 
Spark: The State of the Art Engine for Big Data Processing
Spark: The State of the Art Engine for Big Data ProcessingSpark: The State of the Art Engine for Big Data Processing
Spark: The State of the Art Engine for Big Data Processing
 
SQL Performance Tuning and New Features in Oracle 19c
SQL Performance Tuning and New Features in Oracle 19cSQL Performance Tuning and New Features in Oracle 19c
SQL Performance Tuning and New Features in Oracle 19c
 

More from Datio Big Data

Descubriendo la Inteligencia Artificial
Descubriendo la Inteligencia ArtificialDescubriendo la Inteligencia Artificial
Descubriendo la Inteligencia ArtificialDatio Big Data
 
Learning Python. Level 0
Learning Python. Level 0Learning Python. Level 0
Learning Python. Level 0Datio Big Data
 
How to document without dying in the attempt
How to document without dying in the attemptHow to document without dying in the attempt
How to document without dying in the attemptDatio Big Data
 
Ceph: The Storage System of the Future
Ceph: The Storage System of the FutureCeph: The Storage System of the Future
Ceph: The Storage System of the FutureDatio Big Data
 
A Travel Through Mesos
A Travel Through MesosA Travel Through Mesos
A Travel Through MesosDatio Big Data
 
Quality Assurance Glossary
Quality Assurance GlossaryQuality Assurance Glossary
Quality Assurance GlossaryDatio Big Data
 
Gamification: from buzzword to reality
Gamification: from buzzword to realityGamification: from buzzword to reality
Gamification: from buzzword to realityDatio Big Data
 
Pandas: High Performance Structured Data Manipulation
Pandas: High Performance Structured Data ManipulationPandas: High Performance Structured Data Manipulation
Pandas: High Performance Structured Data ManipulationDatio Big Data
 

More from Datio Big Data (13)

Búsqueda IA
Búsqueda IABúsqueda IA
Búsqueda IA
 
Descubriendo la Inteligencia Artificial
Descubriendo la Inteligencia ArtificialDescubriendo la Inteligencia Artificial
Descubriendo la Inteligencia Artificial
 
Learning Python. Level 0
Learning Python. Level 0Learning Python. Level 0
Learning Python. Level 0
 
Learn Python
Learn PythonLearn Python
Learn Python
 
How to document without dying in the attempt
How to document without dying in the attemptHow to document without dying in the attempt
How to document without dying in the attempt
 
Developers on test
Developers on testDevelopers on test
Developers on test
 
Ceph: The Storage System of the Future
Ceph: The Storage System of the FutureCeph: The Storage System of the Future
Ceph: The Storage System of the Future
 
A Travel Through Mesos
A Travel Through MesosA Travel Through Mesos
A Travel Through Mesos
 
Datio OpenStack
Datio OpenStackDatio OpenStack
Datio OpenStack
 
Quality Assurance Glossary
Quality Assurance GlossaryQuality Assurance Glossary
Quality Assurance Glossary
 
Data Integration
Data IntegrationData Integration
Data Integration
 
Gamification: from buzzword to reality
Gamification: from buzzword to realityGamification: from buzzword to reality
Gamification: from buzzword to reality
 
Pandas: High Performance Structured Data Manipulation
Pandas: High Performance Structured Data ManipulationPandas: High Performance Structured Data Manipulation
Pandas: High Performance Structured Data Manipulation
 

Recently uploaded

毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...ttt fff
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一F La
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxdolaknnilon
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一F La
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
Business Analytics using Microsoft Excel
Business Analytics using Microsoft ExcelBusiness Analytics using Microsoft Excel
Business Analytics using Microsoft Excelysmaelreyes
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
 

Recently uploaded (20)

毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptx
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
Business Analytics using Microsoft Excel
Business Analytics using Microsoft ExcelBusiness Analytics using Microsoft Excel
Business Analytics using Microsoft Excel
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
 

Road to Analytics

  • 2. Contents 1 2 3 SAS vs Spark SAS Proc SQL vs Spark SQL Advantage Analytics
  • 3. 1. SAS vs Spark
  • 4. OVERVIEW SAS ○ The largest independent vendor in “advanced analytics” ○ 1976 foundation of the SAS Institute, Cary, North Carolina ○ Commercial software product SPARK ○ A fast and general engine for large-scale data processing ○ Started in 2009 as a research project in the UC Berkeley, AMPLab ○ Open source
  • 5. CODE SAS Basic programming model consists of code blocks: ○ SAS Data Step ■ generation of data ■ concatenation of data ○ SAS PROCedures ■ special functionalities SPARK “Line based” programming Native Language is Scala, but flexible programming model: ○ Scala ○ Java ○ Python ○ R
  • 6. DATA SAS: DATASET ○ Computed in memory (RAM) ○ A data set contains: ● observations: organized in rows ● variables: organized in columns SPARK: DATAFRAME ○ A distributed collection of data organized into named columns. ○ It is conceptually equivalent to: table in a relational database and dataframe in R/Python ○ It is a programming abstraction
  • 7. IMMUTABLE, PARTITIONED, DISTRIBUTED DATA STRUCTURE Transformations like: map, filter, union, join, group by… results in an other dataset
  • 8. SAS: data sasData set sasData; Fare2 = Fare + 2; run; Python Pandas: pandasDF['Fare2'] = pandasDF['Fare']+2 Spark: sparkDF = sparkDF .withColumn('Fare2',sparkDF['Fare']+2) NOTEBOOK IMMUTABLE, PARTITIONED, DISTRIBUTED DATA STRUCTURE
  • 9. READ SAS DATASETS The SAS-FILE (sas7bdat) is a file with special structure created by SAS and binary stored ● PYTHON: SAS7BDAT PACKAGE ● R: HAVEN LIBRARY ○
  • 10. 2. SAS Proc SQL vs Spark SQL
  • 11. SQL sentences SAS ProC SQL SAS Procedure that combines the functionality of DATA and PROC steps. It can sort, summarize, subset, join, concatenate datasets, create new variables... Spark SQL ○ Spark’s interface for working with structured and semi-structured data, query using SQL ○ Load data from JSON, Hive, Parquet ○ Evaluated “lazily”
  • 12. SQL sentences SAS ProC SQL PROC SQL; CREATE TABLE newTable AS SELECT Columns FROM Table WHERE Column > Value GROUP BY Columns; QUIT; Spark SQL sqlContext = new org.apache.spark.sql.SQLContext(sc) newTable = sqlContext.sql(“ SELECT Columns FROM Table WHERE Column > Value GROUP BY Columns”) NOTEBOOK
  • 13. AGGREGATE FUNCTION IN SPARK SQL sum, avg, mean, count, max, min, first, last, sttedev, variance, skewness, kurtosis… After aggregation Act on each group of data, return a single value as a result
  • 14. WINDOW FUNCTION IN SPARK SQL Ranking: rank, dense_rank, percent_rank, ntile, row_number Analytics: cume_dist, lag, first_value, last_value, lead Aggregate: aggregate funcs Calculate a return value over a set of rows called window that are somehow related to the current NOTEBOOK
  • 15. EXTEND SPARK SQL Standard functions are over 100 functions (pyspark) from pyspark.sql.functions import *
  • 16. BUILT-IN FUNCTIONS, UDFs “User Defined Function” Define new Column-based functions that extend the vocabulary of Spark Act on a single row as an input, single return value for every input row NOTEBOOK
  • 17. TIPS ○ Not thinking in sorted data. In parallel process we can’t acces per row. ○ Cache tables/DFs when they are used more than once ○ Merge doesn’t need ordered data as SAS ○ Use functions already defined instead of creating your own UDF ○ Save data in columnar format as Parquet ○ Avoid collecting data when you are working with Big Data, take a sample
  • 19. ADVANTAGE ANALYTICS SAS Stats Traditional Add-on package to SAS for Statistics ○ Analysis of variance ○ Bayesian analysis ○ Categorical data analysis ○ Distribution analysis ○ Mixed models ○ Predictive modeling... Spark MLlib Scalable machine learning library ○ Basic statistics ○ Classification and regression ○ Collaborative filtering ○ Clustering ○ Dimensionality reduction ○ Feature extraction and transformation...