presentationIDC - 14MAY2015

The big-data analytics challenge –
combining statistical and algorithmic
perspectives
Anat Reiner-Benaim
Department of Statistics
University of Haifa
IDC, May 14, 2015

Outline
IDC, May 2015
 data science -
◦ Definition?
◦ Who needs it?
◦ The elements of data science
 Analysis:
◦ Modeling
◦ Software
 Examples:
◦ Scheduling – prediction of runtime
◦ Genetics – detection of rare events
2

What is data Science?
IDC, May 2015
From Wikipedia:
“Data science is the
study of the
generalizable extraction
of knowledge from
data…
3

IDC, May 2015
More from Wikipedia:
…builds on techniques and theories from many fields,
including signal processing, mathematics, probability
models, machine learning, statistical learning, computer
programming, data engineering, pattern recognition and
learning, visualization, uncertainty modeling, data
warehousing and high performance computing...
…goal: extracting meaning from data and creating data
products…
…not restricted to only big data, although the fact that
data is scaling up makes big data an important aspect of
data science.”
4

Data Science – who needs it?
IDC, May 2015 5
Anyone who has (big) data, e.g.:
 Cellular industry – phones, apps, advertisers
 Internet: search engines, social media, marketing,
advertisers
 Computer networks and server systems
 Cyber security
 Credit cards
 Banks
 Health care providers
 Life science – genome, proteome…
 TV and related
 Weather forecast

The elements of data science
IDC, May 2015 6
• NoSQL Database
(e.g. Cassandra)
• DFS (Distributed File System)
(e.g. Hadoop, Spark, GraphLab)
Store,
Preprocess
Database - SQL
(e.g. MySQL, SAS-SQL)
Dump to SQL
• Apply sophisticated methods:
 Statistical modeling
 Machine learning algorithms
Analyze
“Big data
technologies”
“Big data
Analytics”

IDC, May 2015
◦ How can I decide that an item in a manufacturing process is
faulty?
◦ What is the difference between the new machine and the old
one?
◦ What are the factors that affect system load?
◦ How can I predict memory/runtime of a program?
◦ How can I predict that a costumer will churn?
◦ What is the chance that the phone/web user will click my
advertisement?
◦ What is the chance that the current ATM user is making a fraud?
◦ What are the chance for snow this week?
7
Data Analysis –
First, define the problem

IDC, May 2015
 Possible goals:
◦ Predicting, classifying
 (Logistic) Regression, LDA, QDA, Naïve Bayes, Neural networks
 CART, Random forests, SVM, KNN
◦ Clustering
 Hierarchical, K-means, Mixture models, HMM, PCA
◦ Anomaly detection, peak detection
 Scan statistic, outlier detection methods
◦ “A/B Testing” (actually two sample comparison)
 Parametric tests (normal, t, chi-square, ANOVA)
 Non-parametric tests (signed-rank, rank-sum, Kruskal-Wallis)
◦ Identify trends, cycles
 Regression, time-series
8
Modeling

IDC, May 2015 9
Choosing models
 Type of variables:
◦ Continuous, ordinal, categorical.
 Statistical assumption:
◦ Normality, equal-variance, independence.
 Missing data
 Stability

IDC, May 2015 10
Learning tools
 Bootstrap
◦ Repeatedly fit model on resampled data.
 Bagging (“bootstrap aggregation”)
◦ Combine bootstrap samples to prevent instability.
 Boosting
◦ Combine a set of weak learners to create a single strong learner
 Regularization
◦ Solve over-fitting by restriction
(e.g. limit regression to linear or low degree polynomial)
 Utility/cost function
◦ Evaluate performance, compare models
 Typically iterative procedures.
 combined with the modeling procedures
 Help optimize the model and evaluate its performance

IDC, May 2015 11
More to consider-
Control statistical error
due to large scale analysis
Multiple
statistical tests
Inflated
statistical error
 Control FDR?
FDR = expected proportion of false findings (e.g.
“features”)

IDC, May 2015 12
The R software
 Open source programming language and
environment for statistical computing.
 Widely used among statisticians for developing
statistical software (“packages”) and for data
analysis.
 Increasingly popular among all data professionals.
Advantages:
 Contains most updated statistical
models and machine learning
algorithms.
 Methods are based on research,
compiled and documented.
 Contains Hadoop functions
(package “rhdfs”).
 Very convenient for plain
programming, scripting,
simulations, visualization.
 Friendly interface
(e.g. R-Studio).
 The R project site

IDC, May 2015 13
Examples
 Runtime prediction
(manufacturing, scheduling)
 Anomaly/peak detection
(fraud, electronics, genetics)
 Diagnostics
(biotech, healthcare)
 Epistatic detection
(genetics)

Example 1:
Classification of Job Runtime
in Intel
Joint work with:
Anna Grabarnick, University of Haifa
Edi Shmueli, Intel

Job processing
IDC, May 2015 15
Users serversjobs Job
scheduler
Decide:
• Which server?
• Queue?

Job schedulers
IDC, May 2015
 Algorithms aimed to efficiently queuing and distributing
jobs among servers, thereby improving system
utilization.
 Popular scheduling algorithms (e.g. the backfilling) use
information on how long the jobs are expected to run.
 In serial job systems, scheduling performance can be
improved by merely separating the short jobs from the
long and assigning them to different queues in the
system.
 This helps reduce the likelihood that short jobs will be
delayed after long ones, and thus improves overall 16

Job processing
IDC, May 2015 17
Users serversClassify
Each
Job:
shortlong
jobs
scheduler

The problem
IDC, May 2015
 Main purpose:
Classify jobs into “short” and “long” durations.
 Questions:
◦ How can the classes can be defined?
◦ How can the jobs be classified?
18

Available data
IDC, May 2015
 two traces obtained from one of Intel’s data centers:
1. ~1 million jobs executed during a period of 10 consecutive days.
Used for data training.
2. ~755,000 jobs executed during a period of 7 consecutive days.
Used for model validation.
 Aside from runtime information, 9 categorical variables
were available:
19

IDC, May 2015 20
TABLE I. ROUGH GROUPING OF THE 9 CATEGORICAL VARIABLES
Group # of variables Relates to Example
A 3 Scheduling information Resources requested by the job
B 2 Execution-specific information Command line and arguments
C 4 Association information Project and component
TABLE I. STATISTICS REGARDING THE CATEGORICAL VARIABLES
Variable # of categories
# of missing (in
training data)
A1 9 0
A2 7 0
A3 5 0
B1 44 173
B2 22 184
C1 2 0
C2 5 239
C3 6 184
C4 32 0

Analysis steps
IDC, May 2015
 Exploratory visualization of the data.
 Class construction and characterization.
 Classification:
◦ Choice of a classification model.
◦ Optimize model.
◦ Validate model.
21

IDC, May 2015 24
Runtime distribution
All observations
Wtime < 15,000 sec
seconds
seconds

IDC, May 2015 25
Runtime - log transformation
log2 𝑤𝑡𝑖𝑚𝑒

IDC, May 2015 26
Constructing classes by the mixture
model
• The Gaussian (normal) mixture model has the form
𝑓 𝑥 =
𝑚=1
𝑀
𝛼 𝑚 𝜙 𝑥; 𝜇 𝑚, Σ 𝑚
with mixing proportions 𝛼 𝑚, 𝛼 𝑚 = 1.
• Each Gaussian density has a mean 𝜇 𝑚 and covariance matrix Σ 𝑚.
• The parameters are usually estimated by maximum likelihood
using the EM algorithm.

IDC, May 2015 27
• The parameters are usually estimated by maximum likelihood
using the EM algorithm.
• We model the runtime 𝑌 as a mixture of the two normal variables
𝑌1~𝑁 𝜇1, 𝜎1
2
, 𝑌2~𝑁 𝜇2, 𝜎2
2
.
𝑌 can be defined by
𝑌 = 1 − 𝛥 ∙ 𝑌1 + 𝛥 ∙ 𝑌2,
where Δ ∈ {0, 1} with ℙ Δ = 1 = 𝜋.
• Let 𝜙 𝜃(𝑥) denote the normal density with parameters 𝜃 = (𝜇, 𝜎2).
Then the density of 𝑌 is
𝑔 𝑌 𝑦 = 1 − 𝜋 𝜙 𝜃1
𝑦 + 𝜋𝜙 𝜃2
𝑦 .
• fit this model to our data by maximum likelihood. The parameters
are
𝜃 = 𝜋, 𝜃1, 𝜃2 = 𝜋, 𝜇1, 𝜎1
2
, 𝜇2, 𝜎2
2
.
The log-likelihood based on 𝑁 training cases is
𝑙 𝜃; Ζ =
𝑖=1
𝑁
log 1 − 𝜋 𝜙 𝜃1
𝑦𝑖 + 𝜋𝜙 𝜃2
𝑦𝑖 .
Mixture distribution – parameters
estimation
“Short”

IDC, May 2015 28
• Direct maximization of 𝑙 𝜃; Ζ is quite difficult numerically. Instead,
we consider unobserved latent variables Δ𝑖 taking values 0 or 1 as
earlier: if Δ𝑖 = 1 then 𝑌𝑖 comes from distribution 2, otherwise it
comes from distribution 1.
• Suppose we knew the values of the Δ𝑖's. Then the log-likelihood
would be
𝑙 𝜃; Ζ, Δ
=
𝑖=1
𝑁
1 − Δ𝑖 log 𝜙 𝜃1
𝑦𝑖 + Δ𝑖 log 𝜙 𝜃2
𝑦𝑖
+
𝑖=1
𝑁
1 − Δ𝑖 log 𝜋 + Δ𝑖 log 1 − 𝜋
and the maximum likelihood estimates of 𝜇1 and 𝜎1
2
would be the
sample mean and the sample variance of the observations with Δ𝑖
= 0. Similarly, the estimates for 𝜇2 and 𝜎2
2
would be the sample mean
and the sample variance of the observations with Δ𝑖 = 1.
Parameters estimation – cont’d

• Since the Δ𝑖 values are actually unknown, we proceed in an
iterative fashion, substituting for each Δ𝑖 in the previous equation
its expected value
𝛾𝑖 𝜃 = 𝔼 Δ𝑖 𝜃, Ζ = ℙ Δ𝑖 = 1 𝜃, Ζ ,
which is also called the responsibility of model 2 for observation 𝑖.
• We use the following procedure, known as the EM algorithm, for
the two-component Gaussian mixture:
1. Take initial guesses for the parameters 𝜋, 𝜇1, 𝜎1
2
, 𝜇2, 𝜎2
2
(see below).
2. Expectation step: compute the responsibilities
𝛾𝑖=
𝜋𝜙 𝜃2
𝑦𝑖
1 − 𝜋 𝜙 𝜃1
𝑦𝑖
, 𝑖
= 1, 2, … , 𝑁.
IDC, May 2015 29

3. Maximization step: compute the weighted means and variances,
𝜇1 =
𝑖=1
𝑁
1 − 𝛾𝑖 𝑦𝑖
𝑖=1
𝑁
1 − 𝛾𝑖
, 𝜎1
2
=
𝑖=1
𝑁
1 − 𝛾𝑖 𝑦𝑖 − 𝜇1
2
𝑖=1
𝑁
1 − 𝛾𝑖
,
𝜇2 =
𝑖=1
𝑁
𝛾𝑖 𝑦𝑖
𝑖=1
𝑁
𝛾𝑖
, 𝜎2
2
=
𝑖=1
𝑁
𝛾𝑖 𝑦𝑖 − 𝜇1
2
𝑖=1
𝑁
𝛾𝑖
,
and the mixing probability,
𝜋 =
𝑖=1
𝑁
𝛾𝑖
𝑁
.
4. Iterate steps 2 and 3 until convergence.
IDC, May 2015 30

• A simple choice for initial guesses for 𝜇1 and 𝜇2 is two randomly
selected observations 𝑦𝑖. The overall sample variance 𝑖=1
𝑁 𝑦 𝑖− 𝑦 2
𝑁
can be used as an initial guess for both 𝜎1
2
and 𝜎2
2
. The initial
mixing proportion 𝜋 can be set to 0.5.
• Software:
The "mixtools" R package was used for the mixture analysis, with
the function "normalmixEM" for parameter and posterior probability
(responsibility) estimation.
IDC, May 2015 31
Parameter estimation - additional notes

IDC, May 2015 32
• We obtain the following estimates:

IDC, May 2015 33
60.56%
39.44%
Partition of the runtimes to short (1) and long (2) for
threshold 0.5
1 2
• Each observation 𝑖 is assigned a posterior probability to
belong to each class:
𝜋𝜙 𝜃2
𝑦𝑖
1 − 𝜋 𝜙 𝜃1
𝑦𝑖
, 𝑖 = 1, 2, … , 𝑁.
• For instance, using probability threshold of 0.5:

Building a Classifier –
The Learning algorithm
IDC, May 2015 34
Fit a model on
training data:
• Model/feature
selection
Evaluate the
model on
testing data
Summarize model
performance:
• ROC
• Misclassification rates
• Fit (F test, SSE)
Compare
models
Validate on
validation set
Optimize on
full data:
• ROC,
pseudo-ROC

IDC, May 2015 35
• We use observations that are close to the means (∓0.5 sd).
They include ~450,000 observations (~43%).
The training and testing process
• 80% are for training – finding a classifier (model/feature selection)
• 20% are for testing– checking performance
• After obtaining a classifier – optimize:
choose the mixture threshold that maximizes performance on full
dataset.
• Sequential
procedures for
model reduction

IDC, May 2015 36
Classifiers
• Here we choose two classification models:
• logistic regression
• decision trees
• They can both handle:
• Missing data
• Candidate classifying variables that are either continuous or
categorical.
• Categorical variables with many categories

IDC, May 2015 37
Decision trees
• Classification rules are formed by the paths from the root to the leaves.
• No assumptions are made regarding the distribution of predictors.
• Relatively unstable.
• steps:
• Tree is built using recursive splitting of nodes, until a “maximal” tree is
generated.
• “Pruning” – simplification of the tree by cutting nodes off, prevents
overfitting.
• Selection of the “optimal” pruned tree – fits without overfitting.

IDC, May 2015 38
Logistic Regression
• Regression used to predict the outcome of a binary variable (like “short” or “long”).
• Conditional mean E(Y|X) is distributed Bernoulli.
• The connection between E(Y|X) and X can be described by the logistic function:
which has an “s” shape.
In general, the logistic function is
   
 
 
0 1
0 1 0 1
1
|
1 1
i
i i
X
i i X X
e
E Y X
e e
 
   


  
  
 
  z
e
zf 


1
1

IDC, May 2015 39
Performance measures
• We use ROC curve.
• It combines both types of errors:
• Sensitivity (“true positive rate”)
- probability for a “short” classification when the runtime is “short”.
• Specificity (“true negative rate”)
- probability for a “long” classification when the runtime is “long”.

IDC, May 2015 40
Performance optimization
• For the CART procedure, variables A1, A2, A3 and B4 were selected
to be in the classifier.
• For performance optimization, we use a pseudo-ROC curve:
• blue circle marks optimal tradeoff between sensitivity and specificity
• obtained for mixture probability threshold of 0.45.

IDC, May 2015 41
• For the Logistic regression, most variables were selected to be in
the classifier.
• For performance optimization, we compare ROC curves obtained for
different thresholds, and choose threshold 0.4:

IDC, May 2015 42
Validation results
• Total misclassification rates:
• CART: 9%.
• Logistic regression: 17%.
• Summary:
• Runtime can be effectively classified
using the available information.
• Further evaluation of our method is
required using different data sets from
different installations and times.

IDC, May 2015 43
Joint work with:
Pavel Goldstein and Prof. Avraham Korol,
University of Haifa
Example 2:
Detection of 2nd order Epistasis
on multi-trait complexes

IDC, May 2015
 Goal:
search for epistatic effects (interactions between
genomic loci) on expression traits.
44
Searching for Epistasis

Epistasis
no epistasis
epistasis
45IDC, May 2015
QTL2
QTL1
QTL2
QTL1
allele
allele

IDC, May 2015
 Despite the growing interest in searching for epistatic
interactions, there is no consensus as to the best
strategy for their detection
Suggested approach:
 QTL analysis - combine gene expression and mapping
data
 Use multi-trait complexes rather than single traits
(trait = gene expression of a particular gene).
 Screen for potential epistatic regions in a hierarchical
manner.
 Control the overall FDR (False Discovery Rate).
46

Multi-trait complexes
47IDC, May 2015
 Number of tests for interactions on single traits:
Number of genes (~7200) * number of loci pairs (~120,000) = a
lot!
 A dimension reduction stage can be of help!
 Suggestion:
Consider correlated traits as multi-trait complexes
has been shown to increase QTL detection power,
mapping resolution and estimation accuracy
(Korol et al, 2001).

 Use WGCNA – Weighted correlation network
 Top-down hierarchical clustering.
 Dynamic Tree Cut algorithm:
branch cutting method for detecting gene modules,
depending on their shape
 Building up meta-genes by taking the first principal
component of the genes from every cluster.
48IDC, May 2015
Clustering traits (genes)

Testing for epistasis:
Natural and Orthogonal Interactions (NOIA) model
(Alvarez-Castro and Carlborg , 2007)
For trait t, loci-pair l (loci A and B) and replicate i :
design
matrix
vector of
genetic effects
Indicator of
genotype
combinations
for two loci
genotypes
gene expression
49IDC, May 2015

The test for epistasis is done
hierarchically
50
Framework marker
Secondary markers
IDC, May 2015

False Discovery Rate (FDR)
in hierarchical testing
Yekutieli (2008) offers a procedure to control the FDR for the full
tree of tests
51IDC, May 2015

Hierarchical FDR control
A universal upper bound is derived for the full-tree FDR (Yekutieli, 2008):
An upper bound for 𝜹* may be estimated using:
where Rt
Pi=0 and Rt
Pi=1 are the number of discoveries in τt, given that Hi is a true null
hypothesis in τt, and false null hypothesis, respectively.
.
52IDC, May 2015

IDC, May 2015 53
Searching algorithm
 STAGE 1:
Construct multi-trait complexes (using WGCNA clustering)
 STAGE 2: hierarchical search
◦ step1:
Screen for combinations of loci-pair and multi-trait complex
with potential for epistasis (NOIA model)
◦ Step 2:
Test using higher resolution loci only for the selected
regions (NOIA model).

Data
 A sample of 210 individuals from Arabidopsis
thaliana population
 Genotypic map consists of 579 markers
 Transcript levels were quantified using Affymetrix
whole-genome microarrays
 Total of 22,810 gene expressions from all five
chromosomes
(non-expressed genes filtered out).
54IDC, May 2015

Two-stage hierarchical testing for
epistasis
 STAGE 1: Identified 314 gene clusters (WGSNA)
 STAGE 2:
47 sparse "framework" markers that are within 10 cM of each
other.
10-12 “secondary" marker related to each "framework" marker.
 First step:
1081 marker pairs X 314 meta-genes =339,434 tests
- 11 regions are identified.
 Second step:
- 1141 epistatic effects are identified.
55IDC, May 2015

IDC, May 2015 56
Epistatic
regions

IDC, May 2015 57
Simulation study

IDC, May 2015 58
Simulation study (cont’d)

Preprocessing
 The Variance Stabilization Normalization
 Gene expression filtering: 7244 genes out of 22810
 Markers preprocessing
59IDC, May 2015

Computational advantage
 Using the two-stage algorithm on meta-genes, 341,107
hypotheses were tests
 Naive analysis:
121278 loci pairs for each of 7244 traits, namely 878,537,832
tests would have been performed
 Reduction of tests number by 2575 times
60IDC, May 2015

Peak Detection
61
Point-wise
statistics
Wild-type
Mutant
1
,( )
t w
w
g g p
p t
Y t D
 

 
IDC, May 2015

Define a scan statistics
 For gene 𝑔, 𝑔 = 1, … , 𝑚, let
 Then the scan statistic for gene 𝑔 is
 For gene 𝑔, we test the null hypothesis that there
is no k such that
E 𝐷𝑔,𝑘 , … , E 𝐷𝑔,𝑘+𝑤−1 > 𝛿0,
where 𝛿0 is the baseline level for the gene.
1
,( )
t w
w
g g p
p t
Y t D
 

 
1 1
max ( )
g
w w
g g
t n w
S Y t
   

62IDC, May 2015

IDC, May 2015
Peak Detection
63
1 1
max
g
w
g g
t n w
S Y
   

Point-wise
statistics
Moving-sum
statistics
1
,( )
t w
w
g g p
p t
Y t D
 

 

IDC, May 2015 64
Summary – data science
• Data science is an emerging filed/profession that
incorporates knowledge and expertise form several
disciplines.
• It combines both big data technologies and
sophisticated methods for complicated data analysis.
• Data analysis is aimed to answer various questions
with case-specific challenges, and should therefore
be carefully tailored to the type of problem and data.

References
IDC, May 2015 65
 Reiner-Benaim, A., Shmueli, E. and Grabarnick, A.
(submitted)
A statistical learning approach for runtime prediction in Intel’s
data center.
 Goldstein, P., Korol, A. B. and Reiner-Benaim, A. (2014)
Two-stage genome-wide search for epistasis with
implementation to Recombinant Inbred Lines (RIL)
populations. PLOS ONE, 9(12).
 Reiner-Benaim, A., (2015) Scan statistic tail probability
assessment based on process covariance and window size.
Methodology and Computing in Applied Probability, In Press.
 Reiner-Benaim, A., Davis, R. W. and Juneau, K. (2014)
Scan statistics analysis for detection of introns in time-course
tiling array data. Statistical Applications in Genetics and
Molecular Biology, 13(2), 173-90.

presentationIDC - 14MAY2015

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to presentationIDC - 14MAY2015

Similar to presentationIDC - 14MAY2015 (20)

presentationIDC - 14MAY2015

Editor's Notes