SlideShare a Scribd company logo
Road to
Analytics
Contents
1
2
3
SAS vs Spark
SAS Proc SQL vs Spark SQL
Advantage Analytics
1. SAS vs Spark
OVERVIEW
SAS
○ The largest independent
vendor in “advanced analytics”
○ 1976 foundation of the SAS
Institute, Cary, North Carolina
○ Commercial software product
SPARK
○ A fast and general engine for
large-scale data processing
○ Started in 2009 as a research
project in the UC Berkeley,
AMPLab
○ Open source
CODE
SAS
Basic programming model
consists of code blocks:
○ SAS Data Step
■ generation of data
■ concatenation of data
○ SAS PROCedures
■ special functionalities
SPARK
“Line based” programming
Native Language is Scala, but
flexible programming model:
○ Scala
○ Java
○ Python
○ R
DATA
SAS: DATASET
○ Computed in memory (RAM)
○ A data set contains:
● observations: organized in
rows
● variables: organized in
columns
SPARK: DATAFRAME
○ A distributed collection of data
organized into named columns.
○ It is conceptually equivalent to:
table in a relational database and
dataframe in R/Python
○ It is a programming abstraction
IMMUTABLE,
PARTITIONED,
DISTRIBUTED
DATA STRUCTURE
Transformations like: map,
filter, union, join, group
by… results in an other
dataset
SAS:
data sasData
set sasData;
Fare2 = Fare + 2;
run;
Python Pandas:
pandasDF['Fare2'] = pandasDF['Fare']+2
Spark:
sparkDF = sparkDF
.withColumn('Fare2',sparkDF['Fare']+2)
NOTEBOOK
IMMUTABLE,
PARTITIONED,
DISTRIBUTED
DATA STRUCTURE
READ SAS DATASETS
The SAS-FILE (sas7bdat) is a file with special structure
created by SAS and binary stored
● PYTHON: SAS7BDAT PACKAGE
● R: HAVEN LIBRARY
○
2. SAS Proc SQL vs
Spark SQL
SQL sentences
SAS ProC SQL
SAS Procedure that combines the
functionality of DATA and PROC
steps. It can sort, summarize,
subset, join, concatenate
datasets, create new variables...
Spark SQL
○ Spark’s interface for working
with structured and
semi-structured data, query
using SQL
○ Load data from JSON, Hive,
Parquet
○ Evaluated “lazily”
SQL sentences
SAS ProC SQL
PROC SQL;
CREATE TABLE newTable AS
SELECT Columns
FROM Table
WHERE Column > Value
GROUP BY Columns;
QUIT;
Spark SQL
sqlContext = new
org.apache.spark.sql.SQLContext(sc)
newTable = sqlContext.sql(“
SELECT Columns
FROM Table
WHERE Column > Value
GROUP BY Columns”)
NOTEBOOK
AGGREGATE FUNCTION
IN SPARK SQL
sum, avg, mean, count,
max, min, first, last,
sttedev, variance,
skewness, kurtosis…
After aggregation
Act on each group of
data, return a single
value as a result
WINDOW FUNCTION IN
SPARK SQL
Ranking: rank, dense_rank,
percent_rank, ntile,
row_number
Analytics: cume_dist, lag,
first_value, last_value, lead
Aggregate: aggregate funcs
Calculate a return value
over a set of rows called
window that are
somehow related to the
current
NOTEBOOK
EXTEND SPARK SQL
Standard functions are over 100 functions
(pyspark)
from pyspark.sql.functions import *
BUILT-IN FUNCTIONS,
UDFs
“User Defined Function”
Define new Column-based
functions that extend the
vocabulary of Spark
Act on a single row as an
input, single return value for
every input row
NOTEBOOK
TIPS
○ Not thinking in sorted data. In parallel process we can’t acces per row.
○ Cache tables/DFs when they are used more than once
○ Merge doesn’t need ordered data as SAS
○ Use functions already defined instead of creating your own UDF
○ Save data in columnar format as Parquet
○ Avoid collecting data when you are working with Big Data, take a sample
3. Advantage Analytics
ADVANTAGE ANALYTICS
SAS Stats
Traditional Add-on package to
SAS for Statistics
○ Analysis of variance
○ Bayesian analysis
○ Categorical data analysis
○ Distribution analysis
○ Mixed models
○ Predictive modeling...
Spark MLlib
Scalable machine learning library
○ Basic statistics
○ Classification and regression
○ Collaborative filtering
○ Clustering
○ Dimensionality reduction
○ Feature extraction and
transformation...
BIBLIOGRAPHY
SPARK DOCUMENTATION:
https://spark.apache.org/docs/2.0.0/
PYSPARK API:
https://spark.apache.org/docs/2.0.0/api/python/index.html
PYSPARK FUNCTIONS:
https://spark.apache.org/docs/2.0.0/api/python/_modules/pyspark/sql/functions.html
THANKS!
Any questions?
@datiobd
maguilar@datiobd.com
datio-big-data

More Related Content

What's hot

Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
Databricks
 
Spark etl
Spark etlSpark etl
Spark etl
Imran Rashid
 
Using Apache Spark as ETL engine. Pros and Cons
Using Apache Spark as ETL engine. Pros and Cons          Using Apache Spark as ETL engine. Pros and Cons
Using Apache Spark as ETL engine. Pros and Cons
Provectus
 
Data Source API in Spark
Data Source API in SparkData Source API in Spark
Data Source API in Spark
Databricks
 
End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache Spark
Databricks
 
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
Spark Summit
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
DataArt
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
Dean Chen
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
Ryan Blue
 
Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)
Databricks
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Databricks
 
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst OptimizerDeep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Sachin Aggarwal
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
Databricks
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
Patrick Wendell
 
Apache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming modelApache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming model
Martin Zapletal
 
Spark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, StreamingSpark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, Streaming
Petr Zapletal
 
Distributed Stream Processing - Spark Summit East 2017
Distributed Stream Processing - Spark Summit East 2017Distributed Stream Processing - Spark Summit East 2017
Distributed Stream Processing - Spark Summit East 2017
Petr Zapletal
 
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Spark Summit
 
Spark Sql and DataFrame
Spark Sql and DataFrameSpark Sql and DataFrame
Spark Sql and DataFrame
Prashant Gupta
 

What's hot (20)

Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 
Spark etl
Spark etlSpark etl
Spark etl
 
Using Apache Spark as ETL engine. Pros and Cons
Using Apache Spark as ETL engine. Pros and Cons          Using Apache Spark as ETL engine. Pros and Cons
Using Apache Spark as ETL engine. Pros and Cons
 
Data Source API in Spark
Data Source API in SparkData Source API in Spark
Data Source API in Spark
 
End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache Spark
 
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst OptimizerDeep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Apache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming modelApache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming model
 
Spark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, StreamingSpark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, Streaming
 
Distributed Stream Processing - Spark Summit East 2017
Distributed Stream Processing - Spark Summit East 2017Distributed Stream Processing - Spark Summit East 2017
Distributed Stream Processing - Spark Summit East 2017
 
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
 
Spark Sql and DataFrame
Spark Sql and DataFrameSpark Sql and DataFrame
Spark Sql and DataFrame
 

Viewers also liked

Del Mono al QA
Del Mono al QADel Mono al QA
Del Mono al QA
Datio Big Data
 
DC/OS: The definitive platform for modern apps
DC/OS: The definitive platform for modern appsDC/OS: The definitive platform for modern apps
DC/OS: The definitive platform for modern apps
Datio Big Data
 
Security&Governance
Security&GovernanceSecurity&Governance
Security&Governance
Datio Big Data
 
Kafka Connect by Datio
Kafka Connect by DatioKafka Connect by Datio
Kafka Connect by Datio
Datio Big Data
 
10 Things You Didn’t Know About Mobile Email from Litmus & HubSpot
 10 Things You Didn’t Know About Mobile Email from Litmus & HubSpot 10 Things You Didn’t Know About Mobile Email from Litmus & HubSpot
10 Things You Didn’t Know About Mobile Email from Litmus & HubSpot
HubSpot
 
PDP Your personal development plan
PDP Your personal development planPDP Your personal development plan
PDP Your personal development plan
Datio Big Data
 
How to Earn the Attention of Today's Buyer
How to Earn the Attention of Today's BuyerHow to Earn the Attention of Today's Buyer
How to Earn the Attention of Today's Buyer
HubSpot
 
25 Discovery Call Questions
25 Discovery Call Questions25 Discovery Call Questions
25 Discovery Call Questions
HubSpot
 
Modern Prospecting Techniques for Connecting with Prospects (from Sales Hacke...
Modern Prospecting Techniques for Connecting with Prospects (from Sales Hacke...Modern Prospecting Techniques for Connecting with Prospects (from Sales Hacke...
Modern Prospecting Techniques for Connecting with Prospects (from Sales Hacke...
HubSpot
 
Class 1: Email Marketing Certification course: Email Marketing and Your Business
Class 1: Email Marketing Certification course: Email Marketing and Your BusinessClass 1: Email Marketing Certification course: Email Marketing and Your Business
Class 1: Email Marketing Certification course: Email Marketing and Your Business
HubSpot
 
Behind the Scenes: Launching HubSpot Tokyo
Behind the Scenes: Launching HubSpot TokyoBehind the Scenes: Launching HubSpot Tokyo
Behind the Scenes: Launching HubSpot Tokyo
HubSpot
 
HubSpot Diversity Data 2016
HubSpot Diversity Data 2016HubSpot Diversity Data 2016
HubSpot Diversity Data 2016
HubSpot
 
Why People Block Ads (And What It Means for Marketers and Advertisers) [New R...
Why People Block Ads (And What It Means for Marketers and Advertisers) [New R...Why People Block Ads (And What It Means for Marketers and Advertisers) [New R...
Why People Block Ads (And What It Means for Marketers and Advertisers) [New R...
HubSpot
 
What is Inbound Recruiting?
What is Inbound Recruiting?What is Inbound Recruiting?
What is Inbound Recruiting?
HubSpot
 
3 Proven Sales Email Templates Used by Successful Companies
3 Proven Sales Email Templates Used by Successful Companies3 Proven Sales Email Templates Used by Successful Companies
3 Proven Sales Email Templates Used by Successful Companies
HubSpot
 
Add the Women Back: Wikipedia Edit-a-Thon
Add the Women Back: Wikipedia Edit-a-ThonAdd the Women Back: Wikipedia Edit-a-Thon
Add the Women Back: Wikipedia Edit-a-Thon
HubSpot
 
pandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for Pythonpandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for Python
Wes McKinney
 

Viewers also liked (17)

Del Mono al QA
Del Mono al QADel Mono al QA
Del Mono al QA
 
DC/OS: The definitive platform for modern apps
DC/OS: The definitive platform for modern appsDC/OS: The definitive platform for modern apps
DC/OS: The definitive platform for modern apps
 
Security&Governance
Security&GovernanceSecurity&Governance
Security&Governance
 
Kafka Connect by Datio
Kafka Connect by DatioKafka Connect by Datio
Kafka Connect by Datio
 
10 Things You Didn’t Know About Mobile Email from Litmus & HubSpot
 10 Things You Didn’t Know About Mobile Email from Litmus & HubSpot 10 Things You Didn’t Know About Mobile Email from Litmus & HubSpot
10 Things You Didn’t Know About Mobile Email from Litmus & HubSpot
 
PDP Your personal development plan
PDP Your personal development planPDP Your personal development plan
PDP Your personal development plan
 
How to Earn the Attention of Today's Buyer
How to Earn the Attention of Today's BuyerHow to Earn the Attention of Today's Buyer
How to Earn the Attention of Today's Buyer
 
25 Discovery Call Questions
25 Discovery Call Questions25 Discovery Call Questions
25 Discovery Call Questions
 
Modern Prospecting Techniques for Connecting with Prospects (from Sales Hacke...
Modern Prospecting Techniques for Connecting with Prospects (from Sales Hacke...Modern Prospecting Techniques for Connecting with Prospects (from Sales Hacke...
Modern Prospecting Techniques for Connecting with Prospects (from Sales Hacke...
 
Class 1: Email Marketing Certification course: Email Marketing and Your Business
Class 1: Email Marketing Certification course: Email Marketing and Your BusinessClass 1: Email Marketing Certification course: Email Marketing and Your Business
Class 1: Email Marketing Certification course: Email Marketing and Your Business
 
Behind the Scenes: Launching HubSpot Tokyo
Behind the Scenes: Launching HubSpot TokyoBehind the Scenes: Launching HubSpot Tokyo
Behind the Scenes: Launching HubSpot Tokyo
 
HubSpot Diversity Data 2016
HubSpot Diversity Data 2016HubSpot Diversity Data 2016
HubSpot Diversity Data 2016
 
Why People Block Ads (And What It Means for Marketers and Advertisers) [New R...
Why People Block Ads (And What It Means for Marketers and Advertisers) [New R...Why People Block Ads (And What It Means for Marketers and Advertisers) [New R...
Why People Block Ads (And What It Means for Marketers and Advertisers) [New R...
 
What is Inbound Recruiting?
What is Inbound Recruiting?What is Inbound Recruiting?
What is Inbound Recruiting?
 
3 Proven Sales Email Templates Used by Successful Companies
3 Proven Sales Email Templates Used by Successful Companies3 Proven Sales Email Templates Used by Successful Companies
3 Proven Sales Email Templates Used by Successful Companies
 
Add the Women Back: Wikipedia Edit-a-Thon
Add the Women Back: Wikipedia Edit-a-ThonAdd the Women Back: Wikipedia Edit-a-Thon
Add the Women Back: Wikipedia Edit-a-Thon
 
pandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for Pythonpandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for Python
 

Similar to Road to Analytics

Spark and scala..................................... ppt.pptx
Spark and scala..................................... ppt.pptxSpark and scala..................................... ppt.pptx
Spark and scala..................................... ppt.pptx
shivani22y
 
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick WendellApache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Databricks
 
Intro to apache spark
Intro to apache sparkIntro to apache spark
Intro to apache spark
Amine Sagaama
 
Introduction to Spark with Python
Introduction to Spark with PythonIntroduction to Spark with Python
Introduction to Spark with Python
Gokhan Atil
 
Recent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced AnalyticsRecent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced Analytics
Databricks
 
Spark real world use cases and optimizations
Spark real world use cases and optimizationsSpark real world use cases and optimizations
Spark real world use cases and optimizations
Gal Marder
 
Drill / SQL / Optiq
Drill / SQL / OptiqDrill / SQL / Optiq
Drill / SQL / Optiq
Julian Hyde
 
Spark core
Spark coreSpark core
Spark core
Prashant Gupta
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
zmhassan
 
Apache spark
Apache sparkApache spark
Apache spark
TEJPAL GAUTAM
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
Farzad Nozarian
 
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and SparkCassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
DataStax Academy
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study Notes
Richard Kuo
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
Sneha Challa
 
Introduction to Spark - DataFactZ
Introduction to Spark - DataFactZIntroduction to Spark - DataFactZ
Introduction to Spark - DataFactZ
DataFactZ
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
Yousun Jeong
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Аліна Шепшелей
 
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
Inhacking
 
Spark: The State of the Art Engine for Big Data Processing
Spark: The State of the Art Engine for Big Data ProcessingSpark: The State of the Art Engine for Big Data Processing
Spark: The State of the Art Engine for Big Data Processing
Ramaninder Singh Jhajj
 
SQL Performance Tuning and New Features in Oracle 19c
SQL Performance Tuning and New Features in Oracle 19cSQL Performance Tuning and New Features in Oracle 19c
SQL Performance Tuning and New Features in Oracle 19c
RachelBarker26
 

Similar to Road to Analytics (20)

Spark and scala..................................... ppt.pptx
Spark and scala..................................... ppt.pptxSpark and scala..................................... ppt.pptx
Spark and scala..................................... ppt.pptx
 
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick WendellApache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
 
Intro to apache spark
Intro to apache sparkIntro to apache spark
Intro to apache spark
 
Introduction to Spark with Python
Introduction to Spark with PythonIntroduction to Spark with Python
Introduction to Spark with Python
 
Recent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced AnalyticsRecent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced Analytics
 
Spark real world use cases and optimizations
Spark real world use cases and optimizationsSpark real world use cases and optimizations
Spark real world use cases and optimizations
 
Drill / SQL / Optiq
Drill / SQL / OptiqDrill / SQL / Optiq
Drill / SQL / Optiq
 
Spark core
Spark coreSpark core
Spark core
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
 
Apache spark
Apache sparkApache spark
Apache spark
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and SparkCassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study Notes
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
 
Introduction to Spark - DataFactZ
Introduction to Spark - DataFactZIntroduction to Spark - DataFactZ
Introduction to Spark - DataFactZ
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
 
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
 
Spark: The State of the Art Engine for Big Data Processing
Spark: The State of the Art Engine for Big Data ProcessingSpark: The State of the Art Engine for Big Data Processing
Spark: The State of the Art Engine for Big Data Processing
 
SQL Performance Tuning and New Features in Oracle 19c
SQL Performance Tuning and New Features in Oracle 19cSQL Performance Tuning and New Features in Oracle 19c
SQL Performance Tuning and New Features in Oracle 19c
 

More from Datio Big Data

Búsqueda IA
Búsqueda IABúsqueda IA
Búsqueda IA
Datio Big Data
 
Descubriendo la Inteligencia Artificial
Descubriendo la Inteligencia ArtificialDescubriendo la Inteligencia Artificial
Descubriendo la Inteligencia Artificial
Datio Big Data
 
Learning Python. Level 0
Learning Python. Level 0Learning Python. Level 0
Learning Python. Level 0
Datio Big Data
 
Learn Python
Learn PythonLearn Python
Learn Python
Datio Big Data
 
How to document without dying in the attempt
How to document without dying in the attemptHow to document without dying in the attempt
How to document without dying in the attempt
Datio Big Data
 
Developers on test
Developers on testDevelopers on test
Developers on test
Datio Big Data
 
Ceph: The Storage System of the Future
Ceph: The Storage System of the FutureCeph: The Storage System of the Future
Ceph: The Storage System of the Future
Datio Big Data
 
A Travel Through Mesos
A Travel Through MesosA Travel Through Mesos
A Travel Through Mesos
Datio Big Data
 
Datio OpenStack
Datio OpenStackDatio OpenStack
Datio OpenStack
Datio Big Data
 
Quality Assurance Glossary
Quality Assurance GlossaryQuality Assurance Glossary
Quality Assurance Glossary
Datio Big Data
 
Data Integration
Data IntegrationData Integration
Data Integration
Datio Big Data
 
Gamification: from buzzword to reality
Gamification: from buzzword to realityGamification: from buzzword to reality
Gamification: from buzzword to reality
Datio Big Data
 
Pandas: High Performance Structured Data Manipulation
Pandas: High Performance Structured Data ManipulationPandas: High Performance Structured Data Manipulation
Pandas: High Performance Structured Data Manipulation
Datio Big Data
 

More from Datio Big Data (13)

Búsqueda IA
Búsqueda IABúsqueda IA
Búsqueda IA
 
Descubriendo la Inteligencia Artificial
Descubriendo la Inteligencia ArtificialDescubriendo la Inteligencia Artificial
Descubriendo la Inteligencia Artificial
 
Learning Python. Level 0
Learning Python. Level 0Learning Python. Level 0
Learning Python. Level 0
 
Learn Python
Learn PythonLearn Python
Learn Python
 
How to document without dying in the attempt
How to document without dying in the attemptHow to document without dying in the attempt
How to document without dying in the attempt
 
Developers on test
Developers on testDevelopers on test
Developers on test
 
Ceph: The Storage System of the Future
Ceph: The Storage System of the FutureCeph: The Storage System of the Future
Ceph: The Storage System of the Future
 
A Travel Through Mesos
A Travel Through MesosA Travel Through Mesos
A Travel Through Mesos
 
Datio OpenStack
Datio OpenStackDatio OpenStack
Datio OpenStack
 
Quality Assurance Glossary
Quality Assurance GlossaryQuality Assurance Glossary
Quality Assurance Glossary
 
Data Integration
Data IntegrationData Integration
Data Integration
 
Gamification: from buzzword to reality
Gamification: from buzzword to realityGamification: from buzzword to reality
Gamification: from buzzword to reality
 
Pandas: High Performance Structured Data Manipulation
Pandas: High Performance Structured Data ManipulationPandas: High Performance Structured Data Manipulation
Pandas: High Performance Structured Data Manipulation
 

Recently uploaded

一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
aqzctr7x
 
Monthly Management report for the Month of May 2024
Monthly Management report for the Month of May 2024Monthly Management report for the Month of May 2024
Monthly Management report for the Month of May 2024
facilitymanager11
 
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
xclpvhuk
 
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
Márton Kodok
 
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
z6osjkqvd
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
bmucuha
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
Timothy Spann
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
Timothy Spann
 
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
y3i0qsdzb
 
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
hyfjgavov
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
Sachin Paul
 
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Kaxil Naik
 
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
slg6lamcq
 
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCAModule 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
yuvarajkumar334
 
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
mkkikqvo
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
apvysm8
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
sameer shah
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
Social Samosa
 
UofT毕业证如何办理
UofT毕业证如何办理UofT毕业证如何办理
UofT毕业证如何办理
exukyp
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
soxrziqu
 

Recently uploaded (20)

一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
Monthly Management report for the Month of May 2024
Monthly Management report for the Month of May 2024Monthly Management report for the Month of May 2024
Monthly Management report for the Month of May 2024
 
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
 
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
 
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
 
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
 
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
 
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
 
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
 
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCAModule 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
 
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
 
UofT毕业证如何办理
UofT毕业证如何办理UofT毕业证如何办理
UofT毕业证如何办理
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
 

Road to Analytics

  • 2. Contents 1 2 3 SAS vs Spark SAS Proc SQL vs Spark SQL Advantage Analytics
  • 3. 1. SAS vs Spark
  • 4. OVERVIEW SAS ○ The largest independent vendor in “advanced analytics” ○ 1976 foundation of the SAS Institute, Cary, North Carolina ○ Commercial software product SPARK ○ A fast and general engine for large-scale data processing ○ Started in 2009 as a research project in the UC Berkeley, AMPLab ○ Open source
  • 5. CODE SAS Basic programming model consists of code blocks: ○ SAS Data Step ■ generation of data ■ concatenation of data ○ SAS PROCedures ■ special functionalities SPARK “Line based” programming Native Language is Scala, but flexible programming model: ○ Scala ○ Java ○ Python ○ R
  • 6. DATA SAS: DATASET ○ Computed in memory (RAM) ○ A data set contains: ● observations: organized in rows ● variables: organized in columns SPARK: DATAFRAME ○ A distributed collection of data organized into named columns. ○ It is conceptually equivalent to: table in a relational database and dataframe in R/Python ○ It is a programming abstraction
  • 7. IMMUTABLE, PARTITIONED, DISTRIBUTED DATA STRUCTURE Transformations like: map, filter, union, join, group by… results in an other dataset
  • 8. SAS: data sasData set sasData; Fare2 = Fare + 2; run; Python Pandas: pandasDF['Fare2'] = pandasDF['Fare']+2 Spark: sparkDF = sparkDF .withColumn('Fare2',sparkDF['Fare']+2) NOTEBOOK IMMUTABLE, PARTITIONED, DISTRIBUTED DATA STRUCTURE
  • 9. READ SAS DATASETS The SAS-FILE (sas7bdat) is a file with special structure created by SAS and binary stored ● PYTHON: SAS7BDAT PACKAGE ● R: HAVEN LIBRARY ○
  • 10. 2. SAS Proc SQL vs Spark SQL
  • 11. SQL sentences SAS ProC SQL SAS Procedure that combines the functionality of DATA and PROC steps. It can sort, summarize, subset, join, concatenate datasets, create new variables... Spark SQL ○ Spark’s interface for working with structured and semi-structured data, query using SQL ○ Load data from JSON, Hive, Parquet ○ Evaluated “lazily”
  • 12. SQL sentences SAS ProC SQL PROC SQL; CREATE TABLE newTable AS SELECT Columns FROM Table WHERE Column > Value GROUP BY Columns; QUIT; Spark SQL sqlContext = new org.apache.spark.sql.SQLContext(sc) newTable = sqlContext.sql(“ SELECT Columns FROM Table WHERE Column > Value GROUP BY Columns”) NOTEBOOK
  • 13. AGGREGATE FUNCTION IN SPARK SQL sum, avg, mean, count, max, min, first, last, sttedev, variance, skewness, kurtosis… After aggregation Act on each group of data, return a single value as a result
  • 14. WINDOW FUNCTION IN SPARK SQL Ranking: rank, dense_rank, percent_rank, ntile, row_number Analytics: cume_dist, lag, first_value, last_value, lead Aggregate: aggregate funcs Calculate a return value over a set of rows called window that are somehow related to the current NOTEBOOK
  • 15. EXTEND SPARK SQL Standard functions are over 100 functions (pyspark) from pyspark.sql.functions import *
  • 16. BUILT-IN FUNCTIONS, UDFs “User Defined Function” Define new Column-based functions that extend the vocabulary of Spark Act on a single row as an input, single return value for every input row NOTEBOOK
  • 17. TIPS ○ Not thinking in sorted data. In parallel process we can’t acces per row. ○ Cache tables/DFs when they are used more than once ○ Merge doesn’t need ordered data as SAS ○ Use functions already defined instead of creating your own UDF ○ Save data in columnar format as Parquet ○ Avoid collecting data when you are working with Big Data, take a sample
  • 19. ADVANTAGE ANALYTICS SAS Stats Traditional Add-on package to SAS for Statistics ○ Analysis of variance ○ Bayesian analysis ○ Categorical data analysis ○ Distribution analysis ○ Mixed models ○ Predictive modeling... Spark MLlib Scalable machine learning library ○ Basic statistics ○ Classification and regression ○ Collaborative filtering ○ Clustering ○ Dimensionality reduction ○ Feature extraction and transformation...