SlideShare a Scribd company logo
Big Data Sets Seen as a Big Problem and How to
Deal with Them
Frankfurt 2018
Daniil Shliakhov, Kharkiv, Ukraine
B I G D A T A S E T S S E E N A S A B I G P R O B L E M
INTRO
Page 2
Runningtime
Normal Size Data Sets Large Data Sets
Running time is an issue!
B I G D A T A S E T S S E E N A S A B I G P R O B L E M
INTRO
Page 3
Parameter Treatment n Mean SD Median Min Max
Alkaline
Phosphatase (U/L)
Baseline
Pooled TRT1 xxx xxx.x xxx.xx xxx.xx xxx xxx
Pooled TRT2 xxx xxx.x xxx.xx xxx.xx xxx xxx
Cycle 1
Pooled TRT1 xxx xxx.x xxx.xx xxx.xx xxx xxx
Pooled TRT2 xxx xxx.x xxx.xx xxx.xx xxx xxx
Cycle 2
Pooled TRT1 xxx xxx.x xxx.xx xxx.xx xxx xxx
Pooled TRT2 xxx xxx.x xxx.xx xxx.xx xxx xxx
GENERAL TIPS
RETRIEVING SAS DATA SETS
VIEW option
data adlb;
set adam.adlb;
run;
G E N E R A L T I P S
VIEW OPTION
Page 5
A simple data step?
How much time it may take to run this step if ADAM.ADLB is huge?
data adlb;
set adam.adlb;
run;
G E N E R A L T I P S
VIEW OPTION
Page 6
Real time 34:23
CPU time 3:10
35 seconds? Too looooong L
data adlb / VIEW=adlb;
set adam.adlb;
run;
G E N E R A L T I P S
VIEW OPTION
Page 7
A SAS VIEW is a type of SAS data set that retrieves data values from other files
data adlb / VIEW=adlb;
set adam.adlb;
run;
G E N E R A L T I P S
VIEW OPTION
Page 8
Real time 0:03
CPU time 0:01
Less than 1 second? MAGIC!
data analysis;
merge adam.adsl adam.adlb;
by studyid usubjid;
run;
proc sort data=analysis;
by trt01an parcat paramcd avisitn;
run;
G E N E R A L T I P S
VIEW OPTION
Page 9
Simple merge, common sort… How long?
data analysis;
merge adam.adsl adam.adlb;
by studyid usubjid;
run;
proc sort data=analysis;
by trt01an parcat paramcd avisitn;
run;
G E N E R A L T I P S
VIEW OPTION
Page 10
Data Step PROC step
Real time 36:04 1:12.40
CPU time 8:18 12.45
Simple merge, common sort… Too looooong again
data analysis / VIEW=analysis;
merge adam.adsl adam.adlb;
by studyid usubjid;
run;
proc sort data=analysis out=analysis_sorted;
by trt01an parcat paramcd avisitn;
run;
G E N E R A L T I P S
VIEW OPTION
Page 11
Adding VIEW option to do magic
data analysis / VIEW=analysis;
merge adam.adsl adam.adlb;
by studyid usubjid;
run;
proc sort data=analysis out=analysis_sorted;
by trt01an parcat paramcd avisitn;
run;
G E N E R A L T I P S
VIEW OPTION
Page 12
Data Step PROC step
Real time 0:04 1:17.65
CPU time 0:02 20.32
VIEW option gives extra time to drink coffee with colleagues
GENERAL TIPS
RETRIEVING SAS DATA SETS
IF or WHERE?
data adlb;
set adam.adlb;
if ANL01FL = 'Y';
run;
data adlb;
set adam.adlb;
where ANL01FL = 'Y';
run;
G E N E R A L T I P S
IF or WHERE?
Page 14
IF vs. WHERE. Who is the champion?
data adlb;
set adam.adlb;
if ANL01FL = 'Y';
run;
data adlb;
set adam.adlb;
where ANL01FL = 'Y';
run;
G E N E R A L T I P S
IF or WHERE?
Page 15
IF statement WHERE
statement
Real time 31:64 33:31
CPU time 3:53 5:68
IF is champion! Woohoo!
data adlb;
set adam.adlb;
if ANL01FL = 'Y';
run;
data adlb;
set adam.adlb;
where ANL01FL = 'Y';
run;
G E N E R A L T I P S
IF or WHERE?
Page 16
IF statement WHERE
statement
Real time 32:15 27:26
CPU time 4:28 2.98
INDEX APPLIED
INDEX helps WHERE to win J
SAS PROCEDURES
DESCRIPTIVE STATS
proc means data=adlb noprint;
by trt01an parcat paramcd avisitn;
var aval;
output out = mnout
n = n
mean = mean
median = median
std = std
min = min
max = max;
run;
S A S P R O C E D U R E S
DESCRIPTIVE STATS
Page 18
proc univariate data=adlb noprint;
by trt01an parcat paramcd avisitn;
var aval;
output out = mnout
n = n
mean = mean
median = median
std = std
min = min
max = max;
run;
S A S P R O C E D U R E S
DESCRIPTIVE STATS
Page 19
proc summary data=adlb noprint;
by trt01an parcat paramcd avisitn;
var aval;
output out = mnout
n = n
mean = mean
median = median
std = std
min = min
max = max;
run;
S A S P R O C E D U R E S
DESCRIPTIVE STATS
Page 20
proc sql noprint;
create table mnout as
select trt01an, parcat, paramcd, avisitn,
COUNT(*) as n,
MEAN(aval) as mean,
MEDIAN(aval) as median,
STD(aval) as std,
MIN(aval) as min,
MAX(aval) as max
from adlb
group by trt01an, parcat, paramcd, avisitn
quit;
S A S P R O C E D U R E S
DESCRIPTIVE STATS
Page 21
S A S P R O C E D U R E S
DESCRIPTIVE STATS
Page 22
MEANS UNIVARIATE SUMMARY SQL
Real time 15:14 24:78 13:24 13:45
CPU time 3:38 1:76 3:33 2:53
DESCRIPTIVE STATS
COMPARISON
SAS PROCEDURES
FREQUENCY
proc freq data=adlb noprint;
by trt01an parcat paramcd;
tables avisitn / out=frout;
run;
S A S P R O C E D U R E S
FREQUENCY
Page 24
proc summary data=adlb nway noprint;
by trt01an parcat paramcd avisitn;
output out=frout;
run;
S A S P R O C E D U R E S
FREQUENCY
Page 25
proc sql noprint;
create table frout as
select trt01an, parcat, paramcd, avisitn,
COUNT(*) as count
from adlb
group by trt01an, parcat, paramcd, avisitn
quit;
S A S P R O C E D U R E S
FREQUENCY
Page 26
S A S P R O C E D U R E S
FREQUENCY
Page 27
FREQ SQL SUMMARY
Real time 13:62 12:19 12:02
CPU time 2:04 1:63 0:88
FREQUENCY
COMPARISON
CONCLUSIONS
Do not be afraid to work with big data sets.
Just choose the “right” procedure!
Daniil Shliakhov
daniil.shlyakhov@intego-group.com
Kharkiv, Ukraine
www.intego-group.com
T H A N K Y O U

More Related Content

Similar to Big Data Sets Seen as Big Problem and How to Deal with Them

ADaM - Where Do I Start?
ADaM - Where Do I Start?ADaM - Where Do I Start?
ADaM - Where Do I Start?
Dr.Sangram Parbhane
 
機械学習によるデータ分析 実践編
機械学習によるデータ分析 実践編機械学習によるデータ分析 実践編
機械学習によるデータ分析 実践編
Ryota Kamoshida
 
tranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-data
tranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-datatranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-data
tranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-data
David Peyruc
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Data Con LA
 
MapReduce Tall-and-skinny QR and applications
MapReduce Tall-and-skinny QR and applicationsMapReduce Tall-and-skinny QR and applications
MapReduce Tall-and-skinny QR and applications
David Gleich
 
A Spark Framework For < $100, < 1 Hour, Accurate Personalized DNA Analy...
A Spark Framework For < $100, < 1 Hour, Accurate Personalized DNA Analy...A Spark Framework For < $100, < 1 Hour, Accurate Personalized DNA Analy...
A Spark Framework For < $100, < 1 Hour, Accurate Personalized DNA Analy...
Spark Summit
 
20181212 - PGconfASIA - LT - English
20181212 - PGconfASIA - LT - English20181212 - PGconfASIA - LT - English
20181212 - PGconfASIA - LT - English
Kohei KaiGai
 
analysis.ppt
analysis.pptanalysis.ppt
analysis.ppt
AarushSharma69
 
Consistency, Availability, Partition: Make Your Choice
Consistency, Availability, Partition: Make Your ChoiceConsistency, Availability, Partition: Make Your Choice
Consistency, Availability, Partition: Make Your Choice
Andrea Giuliano
 
RAPIDS: ускоряем Pandas и scikit-learn на GPU Павел Клеменков, NVidia
RAPIDS: ускоряем Pandas и scikit-learn на GPU  Павел Клеменков, NVidiaRAPIDS: ускоряем Pandas и scikit-learn на GPU  Павел Клеменков, NVidia
RAPIDS: ускоряем Pandas и scikit-learn на GPU Павел Клеменков, NVidia
Mail.ru Group
 
Introduction to Java Profiling
Introduction to Java ProfilingIntroduction to Java Profiling
Introduction to Java Profiling
Jerry Yoakum
 
High Performance Systems Without Tears - Scala Days Berlin 2018
High Performance Systems Without Tears - Scala Days Berlin 2018High Performance Systems Without Tears - Scala Days Berlin 2018
High Performance Systems Without Tears - Scala Days Berlin 2018
Zahari Dichev
 
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
Spark Summit
 
Cassandra EU - State of CQL
Cassandra EU - State of CQLCassandra EU - State of CQL
Cassandra EU - State of CQL
pcmanus
 
C* Summit EU 2013: The State of CQL
C* Summit EU 2013: The State of CQLC* Summit EU 2013: The State of CQL
C* Summit EU 2013: The State of CQL
DataStax Academy
 
Imlkn c45
Imlkn c45Imlkn c45
Imlkn c45
Hung Le
 
Scala as a Declarative Language
Scala as a Declarative LanguageScala as a Declarative Language
Scala as a Declarative Language
vsssuresh
 

Similar to Big Data Sets Seen as Big Problem and How to Deal with Them (20)

ADaM - Where Do I Start?
ADaM - Where Do I Start?ADaM - Where Do I Start?
ADaM - Where Do I Start?
 
機械学習によるデータ分析 実践編
機械学習によるデータ分析 実践編機械学習によるデータ分析 実践編
機械学習によるデータ分析 実践編
 
ADaM
ADaMADaM
ADaM
 
tranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-data
tranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-datatranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-data
tranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-data
 
Www.kutub.info 9472
Www.kutub.info 9472Www.kutub.info 9472
Www.kutub.info 9472
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
 
D3.js workshop
D3.js workshopD3.js workshop
D3.js workshop
 
MapReduce Tall-and-skinny QR and applications
MapReduce Tall-and-skinny QR and applicationsMapReduce Tall-and-skinny QR and applications
MapReduce Tall-and-skinny QR and applications
 
A Spark Framework For < $100, < 1 Hour, Accurate Personalized DNA Analy...
A Spark Framework For < $100, < 1 Hour, Accurate Personalized DNA Analy...A Spark Framework For < $100, < 1 Hour, Accurate Personalized DNA Analy...
A Spark Framework For < $100, < 1 Hour, Accurate Personalized DNA Analy...
 
20181212 - PGconfASIA - LT - English
20181212 - PGconfASIA - LT - English20181212 - PGconfASIA - LT - English
20181212 - PGconfASIA - LT - English
 
analysis.ppt
analysis.pptanalysis.ppt
analysis.ppt
 
Consistency, Availability, Partition: Make Your Choice
Consistency, Availability, Partition: Make Your ChoiceConsistency, Availability, Partition: Make Your Choice
Consistency, Availability, Partition: Make Your Choice
 
RAPIDS: ускоряем Pandas и scikit-learn на GPU Павел Клеменков, NVidia
RAPIDS: ускоряем Pandas и scikit-learn на GPU  Павел Клеменков, NVidiaRAPIDS: ускоряем Pandas и scikit-learn на GPU  Павел Клеменков, NVidia
RAPIDS: ускоряем Pandas и scikit-learn на GPU Павел Клеменков, NVidia
 
Introduction to Java Profiling
Introduction to Java ProfilingIntroduction to Java Profiling
Introduction to Java Profiling
 
High Performance Systems Without Tears - Scala Days Berlin 2018
High Performance Systems Without Tears - Scala Days Berlin 2018High Performance Systems Without Tears - Scala Days Berlin 2018
High Performance Systems Without Tears - Scala Days Berlin 2018
 
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
 
Cassandra EU - State of CQL
Cassandra EU - State of CQLCassandra EU - State of CQL
Cassandra EU - State of CQL
 
C* Summit EU 2013: The State of CQL
C* Summit EU 2013: The State of CQLC* Summit EU 2013: The State of CQL
C* Summit EU 2013: The State of CQL
 
Imlkn c45
Imlkn c45Imlkn c45
Imlkn c45
 
Scala as a Declarative Language
Scala as a Declarative LanguageScala as a Declarative Language
Scala as a Declarative Language
 

Recently uploaded

一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
ewymefz
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
balafet
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
pchutichetpong
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
MaleehaSheikh2
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 

Recently uploaded (20)

一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 

Big Data Sets Seen as Big Problem and How to Deal with Them

  • 1. Big Data Sets Seen as a Big Problem and How to Deal with Them Frankfurt 2018 Daniil Shliakhov, Kharkiv, Ukraine
  • 2. B I G D A T A S E T S S E E N A S A B I G P R O B L E M INTRO Page 2 Runningtime Normal Size Data Sets Large Data Sets Running time is an issue!
  • 3. B I G D A T A S E T S S E E N A S A B I G P R O B L E M INTRO Page 3 Parameter Treatment n Mean SD Median Min Max Alkaline Phosphatase (U/L) Baseline Pooled TRT1 xxx xxx.x xxx.xx xxx.xx xxx xxx Pooled TRT2 xxx xxx.x xxx.xx xxx.xx xxx xxx Cycle 1 Pooled TRT1 xxx xxx.x xxx.xx xxx.xx xxx xxx Pooled TRT2 xxx xxx.x xxx.xx xxx.xx xxx xxx Cycle 2 Pooled TRT1 xxx xxx.x xxx.xx xxx.xx xxx xxx Pooled TRT2 xxx xxx.x xxx.xx xxx.xx xxx xxx
  • 4. GENERAL TIPS RETRIEVING SAS DATA SETS VIEW option
  • 5. data adlb; set adam.adlb; run; G E N E R A L T I P S VIEW OPTION Page 5 A simple data step? How much time it may take to run this step if ADAM.ADLB is huge?
  • 6. data adlb; set adam.adlb; run; G E N E R A L T I P S VIEW OPTION Page 6 Real time 34:23 CPU time 3:10 35 seconds? Too looooong L
  • 7. data adlb / VIEW=adlb; set adam.adlb; run; G E N E R A L T I P S VIEW OPTION Page 7 A SAS VIEW is a type of SAS data set that retrieves data values from other files
  • 8. data adlb / VIEW=adlb; set adam.adlb; run; G E N E R A L T I P S VIEW OPTION Page 8 Real time 0:03 CPU time 0:01 Less than 1 second? MAGIC!
  • 9. data analysis; merge adam.adsl adam.adlb; by studyid usubjid; run; proc sort data=analysis; by trt01an parcat paramcd avisitn; run; G E N E R A L T I P S VIEW OPTION Page 9 Simple merge, common sort… How long?
  • 10. data analysis; merge adam.adsl adam.adlb; by studyid usubjid; run; proc sort data=analysis; by trt01an parcat paramcd avisitn; run; G E N E R A L T I P S VIEW OPTION Page 10 Data Step PROC step Real time 36:04 1:12.40 CPU time 8:18 12.45 Simple merge, common sort… Too looooong again
  • 11. data analysis / VIEW=analysis; merge adam.adsl adam.adlb; by studyid usubjid; run; proc sort data=analysis out=analysis_sorted; by trt01an parcat paramcd avisitn; run; G E N E R A L T I P S VIEW OPTION Page 11 Adding VIEW option to do magic
  • 12. data analysis / VIEW=analysis; merge adam.adsl adam.adlb; by studyid usubjid; run; proc sort data=analysis out=analysis_sorted; by trt01an parcat paramcd avisitn; run; G E N E R A L T I P S VIEW OPTION Page 12 Data Step PROC step Real time 0:04 1:17.65 CPU time 0:02 20.32 VIEW option gives extra time to drink coffee with colleagues
  • 13. GENERAL TIPS RETRIEVING SAS DATA SETS IF or WHERE?
  • 14. data adlb; set adam.adlb; if ANL01FL = 'Y'; run; data adlb; set adam.adlb; where ANL01FL = 'Y'; run; G E N E R A L T I P S IF or WHERE? Page 14 IF vs. WHERE. Who is the champion?
  • 15. data adlb; set adam.adlb; if ANL01FL = 'Y'; run; data adlb; set adam.adlb; where ANL01FL = 'Y'; run; G E N E R A L T I P S IF or WHERE? Page 15 IF statement WHERE statement Real time 31:64 33:31 CPU time 3:53 5:68 IF is champion! Woohoo!
  • 16. data adlb; set adam.adlb; if ANL01FL = 'Y'; run; data adlb; set adam.adlb; where ANL01FL = 'Y'; run; G E N E R A L T I P S IF or WHERE? Page 16 IF statement WHERE statement Real time 32:15 27:26 CPU time 4:28 2.98 INDEX APPLIED INDEX helps WHERE to win J
  • 18. proc means data=adlb noprint; by trt01an parcat paramcd avisitn; var aval; output out = mnout n = n mean = mean median = median std = std min = min max = max; run; S A S P R O C E D U R E S DESCRIPTIVE STATS Page 18
  • 19. proc univariate data=adlb noprint; by trt01an parcat paramcd avisitn; var aval; output out = mnout n = n mean = mean median = median std = std min = min max = max; run; S A S P R O C E D U R E S DESCRIPTIVE STATS Page 19
  • 20. proc summary data=adlb noprint; by trt01an parcat paramcd avisitn; var aval; output out = mnout n = n mean = mean median = median std = std min = min max = max; run; S A S P R O C E D U R E S DESCRIPTIVE STATS Page 20
  • 21. proc sql noprint; create table mnout as select trt01an, parcat, paramcd, avisitn, COUNT(*) as n, MEAN(aval) as mean, MEDIAN(aval) as median, STD(aval) as std, MIN(aval) as min, MAX(aval) as max from adlb group by trt01an, parcat, paramcd, avisitn quit; S A S P R O C E D U R E S DESCRIPTIVE STATS Page 21
  • 22. S A S P R O C E D U R E S DESCRIPTIVE STATS Page 22 MEANS UNIVARIATE SUMMARY SQL Real time 15:14 24:78 13:24 13:45 CPU time 3:38 1:76 3:33 2:53 DESCRIPTIVE STATS COMPARISON
  • 24. proc freq data=adlb noprint; by trt01an parcat paramcd; tables avisitn / out=frout; run; S A S P R O C E D U R E S FREQUENCY Page 24
  • 25. proc summary data=adlb nway noprint; by trt01an parcat paramcd avisitn; output out=frout; run; S A S P R O C E D U R E S FREQUENCY Page 25
  • 26. proc sql noprint; create table frout as select trt01an, parcat, paramcd, avisitn, COUNT(*) as count from adlb group by trt01an, parcat, paramcd, avisitn quit; S A S P R O C E D U R E S FREQUENCY Page 26
  • 27. S A S P R O C E D U R E S FREQUENCY Page 27 FREQ SQL SUMMARY Real time 13:62 12:19 12:02 CPU time 2:04 1:63 0:88 FREQUENCY COMPARISON
  • 28. CONCLUSIONS Do not be afraid to work with big data sets. Just choose the “right” procedure!