SlideShare a Scribd company logo
1 of 29
Download to read offline
Big Data Sets Seen as a Big Problem and How to
Deal with Them
Frankfurt 2018
Daniil Shliakhov, Kharkiv, Ukraine
B I G D A T A S E T S S E E N A S A B I G P R O B L E M
INTRO
Page 2
Runningtime
Normal Size Data Sets Large Data Sets
Running time is an issue!
B I G D A T A S E T S S E E N A S A B I G P R O B L E M
INTRO
Page 3
Parameter Treatment n Mean SD Median Min Max
Alkaline
Phosphatase (U/L)
Baseline
Pooled TRT1 xxx xxx.x xxx.xx xxx.xx xxx xxx
Pooled TRT2 xxx xxx.x xxx.xx xxx.xx xxx xxx
Cycle 1
Pooled TRT1 xxx xxx.x xxx.xx xxx.xx xxx xxx
Pooled TRT2 xxx xxx.x xxx.xx xxx.xx xxx xxx
Cycle 2
Pooled TRT1 xxx xxx.x xxx.xx xxx.xx xxx xxx
Pooled TRT2 xxx xxx.x xxx.xx xxx.xx xxx xxx
GENERAL TIPS
RETRIEVING SAS DATA SETS
VIEW option
data adlb;
set adam.adlb;
run;
G E N E R A L T I P S
VIEW OPTION
Page 5
A simple data step?
How much time it may take to run this step if ADAM.ADLB is huge?
data adlb;
set adam.adlb;
run;
G E N E R A L T I P S
VIEW OPTION
Page 6
Real time 34:23
CPU time 3:10
35 seconds? Too looooong L
data adlb / VIEW=adlb;
set adam.adlb;
run;
G E N E R A L T I P S
VIEW OPTION
Page 7
A SAS VIEW is a type of SAS data set that retrieves data values from other files
data adlb / VIEW=adlb;
set adam.adlb;
run;
G E N E R A L T I P S
VIEW OPTION
Page 8
Real time 0:03
CPU time 0:01
Less than 1 second? MAGIC!
data analysis;
merge adam.adsl adam.adlb;
by studyid usubjid;
run;
proc sort data=analysis;
by trt01an parcat paramcd avisitn;
run;
G E N E R A L T I P S
VIEW OPTION
Page 9
Simple merge, common sort… How long?
data analysis;
merge adam.adsl adam.adlb;
by studyid usubjid;
run;
proc sort data=analysis;
by trt01an parcat paramcd avisitn;
run;
G E N E R A L T I P S
VIEW OPTION
Page 10
Data Step PROC step
Real time 36:04 1:12.40
CPU time 8:18 12.45
Simple merge, common sort… Too looooong again
data analysis / VIEW=analysis;
merge adam.adsl adam.adlb;
by studyid usubjid;
run;
proc sort data=analysis out=analysis_sorted;
by trt01an parcat paramcd avisitn;
run;
G E N E R A L T I P S
VIEW OPTION
Page 11
Adding VIEW option to do magic
data analysis / VIEW=analysis;
merge adam.adsl adam.adlb;
by studyid usubjid;
run;
proc sort data=analysis out=analysis_sorted;
by trt01an parcat paramcd avisitn;
run;
G E N E R A L T I P S
VIEW OPTION
Page 12
Data Step PROC step
Real time 0:04 1:17.65
CPU time 0:02 20.32
VIEW option gives extra time to drink coffee with colleagues
GENERAL TIPS
RETRIEVING SAS DATA SETS
IF or WHERE?
data adlb;
set adam.adlb;
if ANL01FL = 'Y';
run;
data adlb;
set adam.adlb;
where ANL01FL = 'Y';
run;
G E N E R A L T I P S
IF or WHERE?
Page 14
IF vs. WHERE. Who is the champion?
data adlb;
set adam.adlb;
if ANL01FL = 'Y';
run;
data adlb;
set adam.adlb;
where ANL01FL = 'Y';
run;
G E N E R A L T I P S
IF or WHERE?
Page 15
IF statement WHERE
statement
Real time 31:64 33:31
CPU time 3:53 5:68
IF is champion! Woohoo!
data adlb;
set adam.adlb;
if ANL01FL = 'Y';
run;
data adlb;
set adam.adlb;
where ANL01FL = 'Y';
run;
G E N E R A L T I P S
IF or WHERE?
Page 16
IF statement WHERE
statement
Real time 32:15 27:26
CPU time 4:28 2.98
INDEX APPLIED
INDEX helps WHERE to win J
SAS PROCEDURES
DESCRIPTIVE STATS
proc means data=adlb noprint;
by trt01an parcat paramcd avisitn;
var aval;
output out = mnout
n = n
mean = mean
median = median
std = std
min = min
max = max;
run;
S A S P R O C E D U R E S
DESCRIPTIVE STATS
Page 18
proc univariate data=adlb noprint;
by trt01an parcat paramcd avisitn;
var aval;
output out = mnout
n = n
mean = mean
median = median
std = std
min = min
max = max;
run;
S A S P R O C E D U R E S
DESCRIPTIVE STATS
Page 19
proc summary data=adlb noprint;
by trt01an parcat paramcd avisitn;
var aval;
output out = mnout
n = n
mean = mean
median = median
std = std
min = min
max = max;
run;
S A S P R O C E D U R E S
DESCRIPTIVE STATS
Page 20
proc sql noprint;
create table mnout as
select trt01an, parcat, paramcd, avisitn,
COUNT(*) as n,
MEAN(aval) as mean,
MEDIAN(aval) as median,
STD(aval) as std,
MIN(aval) as min,
MAX(aval) as max
from adlb
group by trt01an, parcat, paramcd, avisitn
quit;
S A S P R O C E D U R E S
DESCRIPTIVE STATS
Page 21
S A S P R O C E D U R E S
DESCRIPTIVE STATS
Page 22
MEANS UNIVARIATE SUMMARY SQL
Real time 15:14 24:78 13:24 13:45
CPU time 3:38 1:76 3:33 2:53
DESCRIPTIVE STATS
COMPARISON
SAS PROCEDURES
FREQUENCY
proc freq data=adlb noprint;
by trt01an parcat paramcd;
tables avisitn / out=frout;
run;
S A S P R O C E D U R E S
FREQUENCY
Page 24
proc summary data=adlb nway noprint;
by trt01an parcat paramcd avisitn;
output out=frout;
run;
S A S P R O C E D U R E S
FREQUENCY
Page 25
proc sql noprint;
create table frout as
select trt01an, parcat, paramcd, avisitn,
COUNT(*) as count
from adlb
group by trt01an, parcat, paramcd, avisitn
quit;
S A S P R O C E D U R E S
FREQUENCY
Page 26
S A S P R O C E D U R E S
FREQUENCY
Page 27
FREQ SQL SUMMARY
Real time 13:62 12:19 12:02
CPU time 2:04 1:63 0:88
FREQUENCY
COMPARISON
CONCLUSIONS
Do not be afraid to work with big data sets.
Just choose the “right” procedure!
Daniil Shliakhov
daniil.shlyakhov@intego-group.com
Kharkiv, Ukraine
www.intego-group.com
T H A N K Y O U

More Related Content

Similar to Big Data Sets Seen as Big Problem and How to Deal with Them

機械学習によるデータ分析 実践編
機械学習によるデータ分析 実践編機械学習によるデータ分析 実践編
機械学習によるデータ分析 実践編Ryota Kamoshida
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...Data Con LA
 
MapReduce Tall-and-skinny QR and applications
MapReduce Tall-and-skinny QR and applicationsMapReduce Tall-and-skinny QR and applications
MapReduce Tall-and-skinny QR and applicationsDavid Gleich
 
A Spark Framework For < $100, < 1 Hour, Accurate Personalized DNA Analy...
A Spark Framework For < $100, < 1 Hour, Accurate Personalized DNA Analy...A Spark Framework For < $100, < 1 Hour, Accurate Personalized DNA Analy...
A Spark Framework For < $100, < 1 Hour, Accurate Personalized DNA Analy...Spark Summit
 
20181212 - PGconfASIA - LT - English
20181212 - PGconfASIA - LT - English20181212 - PGconfASIA - LT - English
20181212 - PGconfASIA - LT - EnglishKohei KaiGai
 
Consistency, Availability, Partition: Make Your Choice
Consistency, Availability, Partition: Make Your ChoiceConsistency, Availability, Partition: Make Your Choice
Consistency, Availability, Partition: Make Your ChoiceAndrea Giuliano
 
RAPIDS: ускоряем Pandas и scikit-learn на GPU Павел Клеменков, NVidia
RAPIDS: ускоряем Pandas и scikit-learn на GPU  Павел Клеменков, NVidiaRAPIDS: ускоряем Pandas и scikit-learn на GPU  Павел Клеменков, NVidia
RAPIDS: ускоряем Pandas и scikit-learn на GPU Павел Клеменков, NVidiaMail.ru Group
 
Introduction to Java Profiling
Introduction to Java ProfilingIntroduction to Java Profiling
Introduction to Java ProfilingJerry Yoakum
 
High Performance Systems Without Tears - Scala Days Berlin 2018
High Performance Systems Without Tears - Scala Days Berlin 2018High Performance Systems Without Tears - Scala Days Berlin 2018
High Performance Systems Without Tears - Scala Days Berlin 2018Zahari Dichev
 
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...Spark Summit
 
Cassandra EU - State of CQL
Cassandra EU - State of CQLCassandra EU - State of CQL
Cassandra EU - State of CQLpcmanus
 
C* Summit EU 2013: The State of CQL
C* Summit EU 2013: The State of CQLC* Summit EU 2013: The State of CQL
C* Summit EU 2013: The State of CQLDataStax Academy
 
Imlkn c45
Imlkn c45Imlkn c45
Imlkn c45Hung Le
 
Scala as a Declarative Language
Scala as a Declarative LanguageScala as a Declarative Language
Scala as a Declarative Languagevsssuresh
 
Lazy Join Optimizations Without Upfront Statistics with Matteo Interlandi
Lazy Join Optimizations Without Upfront Statistics with Matteo InterlandiLazy Join Optimizations Without Upfront Statistics with Matteo Interlandi
Lazy Join Optimizations Without Upfront Statistics with Matteo InterlandiDatabricks
 

Similar to Big Data Sets Seen as Big Problem and How to Deal with Them (20)

ADaM - Where Do I Start?
ADaM - Where Do I Start?ADaM - Where Do I Start?
ADaM - Where Do I Start?
 
機械学習によるデータ分析 実践編
機械学習によるデータ分析 実践編機械学習によるデータ分析 実践編
機械学習によるデータ分析 実践編
 
ADaM
ADaMADaM
ADaM
 
Www.kutub.info 9472
Www.kutub.info 9472Www.kutub.info 9472
Www.kutub.info 9472
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
 
D3.js workshop
D3.js workshopD3.js workshop
D3.js workshop
 
MapReduce Tall-and-skinny QR and applications
MapReduce Tall-and-skinny QR and applicationsMapReduce Tall-and-skinny QR and applications
MapReduce Tall-and-skinny QR and applications
 
A Spark Framework For < $100, < 1 Hour, Accurate Personalized DNA Analy...
A Spark Framework For < $100, < 1 Hour, Accurate Personalized DNA Analy...A Spark Framework For < $100, < 1 Hour, Accurate Personalized DNA Analy...
A Spark Framework For < $100, < 1 Hour, Accurate Personalized DNA Analy...
 
20181212 - PGconfASIA - LT - English
20181212 - PGconfASIA - LT - English20181212 - PGconfASIA - LT - English
20181212 - PGconfASIA - LT - English
 
analysis.ppt
analysis.pptanalysis.ppt
analysis.ppt
 
Consistency, Availability, Partition: Make Your Choice
Consistency, Availability, Partition: Make Your ChoiceConsistency, Availability, Partition: Make Your Choice
Consistency, Availability, Partition: Make Your Choice
 
RAPIDS: ускоряем Pandas и scikit-learn на GPU Павел Клеменков, NVidia
RAPIDS: ускоряем Pandas и scikit-learn на GPU  Павел Клеменков, NVidiaRAPIDS: ускоряем Pandas и scikit-learn на GPU  Павел Клеменков, NVidia
RAPIDS: ускоряем Pandas и scikit-learn на GPU Павел Клеменков, NVidia
 
Introduction to Java Profiling
Introduction to Java ProfilingIntroduction to Java Profiling
Introduction to Java Profiling
 
High Performance Systems Without Tears - Scala Days Berlin 2018
High Performance Systems Without Tears - Scala Days Berlin 2018High Performance Systems Without Tears - Scala Days Berlin 2018
High Performance Systems Without Tears - Scala Days Berlin 2018
 
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
 
Cassandra EU - State of CQL
Cassandra EU - State of CQLCassandra EU - State of CQL
Cassandra EU - State of CQL
 
C* Summit EU 2013: The State of CQL
C* Summit EU 2013: The State of CQLC* Summit EU 2013: The State of CQL
C* Summit EU 2013: The State of CQL
 
Imlkn c45
Imlkn c45Imlkn c45
Imlkn c45
 
Scala as a Declarative Language
Scala as a Declarative LanguageScala as a Declarative Language
Scala as a Declarative Language
 
Lazy Join Optimizations Without Upfront Statistics with Matteo Interlandi
Lazy Join Optimizations Without Upfront Statistics with Matteo InterlandiLazy Join Optimizations Without Upfront Statistics with Matteo Interlandi
Lazy Join Optimizations Without Upfront Statistics with Matteo Interlandi
 

Recently uploaded

Unsatisfied Bhabhi ℂall Girls Vadodara Book Esha 7427069034 Top Class ℂall Gi...
Unsatisfied Bhabhi ℂall Girls Vadodara Book Esha 7427069034 Top Class ℂall Gi...Unsatisfied Bhabhi ℂall Girls Vadodara Book Esha 7427069034 Top Class ℂall Gi...
Unsatisfied Bhabhi ℂall Girls Vadodara Book Esha 7427069034 Top Class ℂall Gi...Payal Garg #K09
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabiaahmedjiabur940
 
Seven tools of quality control.slideshare
Seven tools of quality control.slideshareSeven tools of quality control.slideshare
Seven tools of quality control.slideshareraiaryan448
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...Bertram Ludäscher
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxronsairoathenadugay
 
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...Klinik Aborsi
 
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证pwgnohujw
 
Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024patrickdtherriault
 
sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444saurabvyas476
 
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证acoha1
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样wsppdmt
 
Case Study 4 Where the cry of rebellion happen?
Case Study 4 Where the cry of rebellion happen?Case Study 4 Where the cry of rebellion happen?
Case Study 4 Where the cry of rebellion happen?RemarkSemacio
 
jll-asia-pacific-capital-tracker-1q24.pdf
jll-asia-pacific-capital-tracker-1q24.pdfjll-asia-pacific-capital-tracker-1q24.pdf
jll-asia-pacific-capital-tracker-1q24.pdfjaytendertech
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Klinik kandungan
 
Predictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting TechniquesPredictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting TechniquesBoston Institute of Analytics
 
社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token PredictionNABLAS株式会社
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格q6pzkpark
 
Audience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptxAudience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptxStephen266013
 
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证acoha1
 

Recently uploaded (20)

Unsatisfied Bhabhi ℂall Girls Vadodara Book Esha 7427069034 Top Class ℂall Gi...
Unsatisfied Bhabhi ℂall Girls Vadodara Book Esha 7427069034 Top Class ℂall Gi...Unsatisfied Bhabhi ℂall Girls Vadodara Book Esha 7427069034 Top Class ℂall Gi...
Unsatisfied Bhabhi ℂall Girls Vadodara Book Esha 7427069034 Top Class ℂall Gi...
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Seven tools of quality control.slideshare
Seven tools of quality control.slideshareSeven tools of quality control.slideshare
Seven tools of quality control.slideshare
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
 
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
 
Abortion pills in Doha {{ QATAR }} +966572737505) Get Cytotec
Abortion pills in Doha {{ QATAR }} +966572737505) Get CytotecAbortion pills in Doha {{ QATAR }} +966572737505) Get Cytotec
Abortion pills in Doha {{ QATAR }} +966572737505) Get Cytotec
 
Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024
 
sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444
 
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
Case Study 4 Where the cry of rebellion happen?
Case Study 4 Where the cry of rebellion happen?Case Study 4 Where the cry of rebellion happen?
Case Study 4 Where the cry of rebellion happen?
 
jll-asia-pacific-capital-tracker-1q24.pdf
jll-asia-pacific-capital-tracker-1q24.pdfjll-asia-pacific-capital-tracker-1q24.pdf
jll-asia-pacific-capital-tracker-1q24.pdf
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
Predictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting TechniquesPredictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting Techniques
 
社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
 
Audience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptxAudience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptx
 
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
 

Big Data Sets Seen as Big Problem and How to Deal with Them

  • 1. Big Data Sets Seen as a Big Problem and How to Deal with Them Frankfurt 2018 Daniil Shliakhov, Kharkiv, Ukraine
  • 2. B I G D A T A S E T S S E E N A S A B I G P R O B L E M INTRO Page 2 Runningtime Normal Size Data Sets Large Data Sets Running time is an issue!
  • 3. B I G D A T A S E T S S E E N A S A B I G P R O B L E M INTRO Page 3 Parameter Treatment n Mean SD Median Min Max Alkaline Phosphatase (U/L) Baseline Pooled TRT1 xxx xxx.x xxx.xx xxx.xx xxx xxx Pooled TRT2 xxx xxx.x xxx.xx xxx.xx xxx xxx Cycle 1 Pooled TRT1 xxx xxx.x xxx.xx xxx.xx xxx xxx Pooled TRT2 xxx xxx.x xxx.xx xxx.xx xxx xxx Cycle 2 Pooled TRT1 xxx xxx.x xxx.xx xxx.xx xxx xxx Pooled TRT2 xxx xxx.x xxx.xx xxx.xx xxx xxx
  • 4. GENERAL TIPS RETRIEVING SAS DATA SETS VIEW option
  • 5. data adlb; set adam.adlb; run; G E N E R A L T I P S VIEW OPTION Page 5 A simple data step? How much time it may take to run this step if ADAM.ADLB is huge?
  • 6. data adlb; set adam.adlb; run; G E N E R A L T I P S VIEW OPTION Page 6 Real time 34:23 CPU time 3:10 35 seconds? Too looooong L
  • 7. data adlb / VIEW=adlb; set adam.adlb; run; G E N E R A L T I P S VIEW OPTION Page 7 A SAS VIEW is a type of SAS data set that retrieves data values from other files
  • 8. data adlb / VIEW=adlb; set adam.adlb; run; G E N E R A L T I P S VIEW OPTION Page 8 Real time 0:03 CPU time 0:01 Less than 1 second? MAGIC!
  • 9. data analysis; merge adam.adsl adam.adlb; by studyid usubjid; run; proc sort data=analysis; by trt01an parcat paramcd avisitn; run; G E N E R A L T I P S VIEW OPTION Page 9 Simple merge, common sort… How long?
  • 10. data analysis; merge adam.adsl adam.adlb; by studyid usubjid; run; proc sort data=analysis; by trt01an parcat paramcd avisitn; run; G E N E R A L T I P S VIEW OPTION Page 10 Data Step PROC step Real time 36:04 1:12.40 CPU time 8:18 12.45 Simple merge, common sort… Too looooong again
  • 11. data analysis / VIEW=analysis; merge adam.adsl adam.adlb; by studyid usubjid; run; proc sort data=analysis out=analysis_sorted; by trt01an parcat paramcd avisitn; run; G E N E R A L T I P S VIEW OPTION Page 11 Adding VIEW option to do magic
  • 12. data analysis / VIEW=analysis; merge adam.adsl adam.adlb; by studyid usubjid; run; proc sort data=analysis out=analysis_sorted; by trt01an parcat paramcd avisitn; run; G E N E R A L T I P S VIEW OPTION Page 12 Data Step PROC step Real time 0:04 1:17.65 CPU time 0:02 20.32 VIEW option gives extra time to drink coffee with colleagues
  • 13. GENERAL TIPS RETRIEVING SAS DATA SETS IF or WHERE?
  • 14. data adlb; set adam.adlb; if ANL01FL = 'Y'; run; data adlb; set adam.adlb; where ANL01FL = 'Y'; run; G E N E R A L T I P S IF or WHERE? Page 14 IF vs. WHERE. Who is the champion?
  • 15. data adlb; set adam.adlb; if ANL01FL = 'Y'; run; data adlb; set adam.adlb; where ANL01FL = 'Y'; run; G E N E R A L T I P S IF or WHERE? Page 15 IF statement WHERE statement Real time 31:64 33:31 CPU time 3:53 5:68 IF is champion! Woohoo!
  • 16. data adlb; set adam.adlb; if ANL01FL = 'Y'; run; data adlb; set adam.adlb; where ANL01FL = 'Y'; run; G E N E R A L T I P S IF or WHERE? Page 16 IF statement WHERE statement Real time 32:15 27:26 CPU time 4:28 2.98 INDEX APPLIED INDEX helps WHERE to win J
  • 18. proc means data=adlb noprint; by trt01an parcat paramcd avisitn; var aval; output out = mnout n = n mean = mean median = median std = std min = min max = max; run; S A S P R O C E D U R E S DESCRIPTIVE STATS Page 18
  • 19. proc univariate data=adlb noprint; by trt01an parcat paramcd avisitn; var aval; output out = mnout n = n mean = mean median = median std = std min = min max = max; run; S A S P R O C E D U R E S DESCRIPTIVE STATS Page 19
  • 20. proc summary data=adlb noprint; by trt01an parcat paramcd avisitn; var aval; output out = mnout n = n mean = mean median = median std = std min = min max = max; run; S A S P R O C E D U R E S DESCRIPTIVE STATS Page 20
  • 21. proc sql noprint; create table mnout as select trt01an, parcat, paramcd, avisitn, COUNT(*) as n, MEAN(aval) as mean, MEDIAN(aval) as median, STD(aval) as std, MIN(aval) as min, MAX(aval) as max from adlb group by trt01an, parcat, paramcd, avisitn quit; S A S P R O C E D U R E S DESCRIPTIVE STATS Page 21
  • 22. S A S P R O C E D U R E S DESCRIPTIVE STATS Page 22 MEANS UNIVARIATE SUMMARY SQL Real time 15:14 24:78 13:24 13:45 CPU time 3:38 1:76 3:33 2:53 DESCRIPTIVE STATS COMPARISON
  • 24. proc freq data=adlb noprint; by trt01an parcat paramcd; tables avisitn / out=frout; run; S A S P R O C E D U R E S FREQUENCY Page 24
  • 25. proc summary data=adlb nway noprint; by trt01an parcat paramcd avisitn; output out=frout; run; S A S P R O C E D U R E S FREQUENCY Page 25
  • 26. proc sql noprint; create table frout as select trt01an, parcat, paramcd, avisitn, COUNT(*) as count from adlb group by trt01an, parcat, paramcd, avisitn quit; S A S P R O C E D U R E S FREQUENCY Page 26
  • 27. S A S P R O C E D U R E S FREQUENCY Page 27 FREQ SQL SUMMARY Real time 13:62 12:19 12:02 CPU time 2:04 1:63 0:88 FREQUENCY COMPARISON
  • 28. CONCLUSIONS Do not be afraid to work with big data sets. Just choose the “right” procedure!