Computational Giants_nhom.pptx

7 Computational Giants of
Massive Data Analysis
Instructor: Assoc. Prof. PhD. Nguyễn Thanh Bình
Master students:
Đoàn Đức Thế Anh
Võ Nam Thục Đoan
Nguyễn Ngọc Bảo Trân
Trần Trung Hiếu
 22C01001
 22C01004
 22C01021
 22C01009
CHAPTER 10

Massive data analysis
cannot be processed using a stand-alone computer
use of existing (distributed and parallel) hardware platforms
challenges to traditional statistical methods and algorithms
overall system architecture

Tasks of
machine learning / data mining
•orthogonal range-search, nearest-neighbor
O(N)
•all-nearest-neighbors O(N2)
Querying
•mixture of Gaussians, kernel density
estimation O(N2)
•kernel conditional density estimation O(N3)
1.Density
estimation
•decision tree, nearest-neighbor classifier
O(N2)
•support vector machine O(N3)
Classification
•linear regression, LASSO, kernel regression
O(N2)
•Gaussian process regression O(N3)
Regression
• PCA, non-negative matrix
factorization, kernel PCA O(N3)
• maximum variance unfolding
O(N3)
Dimension
reduction
• k-means, mean-shift O(N2)
• hierarchical (FoF) clustering
O(N3)
Clustering
• MST O(N3)
• bipartite cross-matching O(N3)
• n-point correlation 2-sample
testing O(Nn)
Testing
and
matching

The “7 Computational Giants” of Data
(computational problem types)
Basic statistics
Generalized N-body problem
Graph-theoretic computations
Linear-algebraic computations
Optimization
Integration
Alignment problems
1
2
3
4
5
6
7

Basic statistics
• Descriptive statistics: summarize the data
and provide insights into its
– central tendency: mean, median, mode
– variability of a data set: variance,
standard deviation, count, min max,
quartiles, skewness and kurtosis
– frequency distribution
N data points  O(N) calculations

Basic statistics
• Inferential statistics :
– generalize results to larger
populations based on small
samples
– looking at how things change over
time
– use sampling methods to find
samples that are representative of
the whole population
– determine what is happening
N data points  O(N2) calculations

Why is statistical computing important in
research and decision-making?
Evidence-based analysis
Explore relationships between variables
Evaluating the effectiveness of interventions
Contributing to improved outcomes
A vital role in fields: healthcare, finance, marketing, and social
sciences

Basic statistics - Challenges
High dimensionality
High dimensionality + large
sample size
Big2 Data: from multiple
sources, at different time
points, using different
technologies
• noise accumulation
• spurious correlations
• Incidental homogeneity
• heavy computational
cost
• algorithmic instability
• heterogeneity
• experimental variations
• statistical biases
false scientific
conclusions
wrong statistical
inference
statistical biases

Basic statistics - Solutions
New
statistical
thinking
New
computational
methods
Solutions
variable selection
dimension
reduction
new regularization
methods
independence
screening
the development of new
computational infrastructure and
data storage methods

Generalized N-body problem
• The 17th century, Sir Isaac Newton
formulated:
– The laws of motion
– The law of universal gravitation
 the behavior of objects and their interactions
 Origin of the N-body problem: predicting the
motions of N celestial objects interacting with
each other gravitationally
• Karl Fritiof Sundman: solved for n = 3
• L. K. Babadzanjanz and Qiudong Wang: generalized to n > 3

N-body problem
• Three bodies
with equal
mass
[published
2000]
• Three bodies
of unequal
mass
• Two pairs of
bodies orbiting
about each
other
• An orbit discovered
in 2008 by
Tiancheng
Ouyang, Duokui
Yan, and Skyler
Simmons at BYU

Generalized N-body problem - Challenges
• Numerical approximations
• Chaotic behavior
• Interdisciplinary nature
• Main obstacle: O(N2)

Generalized N-body problem - Solutions
• Barnes-Hut Algorithm [Barnes and Hut, 87]:
if

r
s 
s
r
 
i
R
R
i x
K
N
x
x
K )
,
(
)
,
( 
O(N log N)
N(N-1)/2 = O(N2)

Generalized N-body problem - Solutions
• Fast Multipole Method [Greengard and Rokhlin 1987]:
 

i
i
x
x
K
x )
,
(
, O(N)
multipole/Taylor expansion
of order p
Quadtree
[Callahan-Kosaraju 95]: O(N) is impossible for log-depth tree
N(N-1)/2 = O(N2)

Linear Algebraic computations
Problems involves matrix operations, solving linear systems, finding eigenvalues
eigenvectors, inverves, orthogonality,...
Examples: linear regression, SVD, PCA, clustering, graph analysis, image processing
(edge detection, compression, blurring,...)
Linear regression
SVD
PCA
Clustering
Kernel cho
edge detection

- Matrix with slowly decaying spectra → high computational
complexity, sensitive to noise.
- Nearly singular matrix det(M)~0 → nearly non-invertible, sensitive to
small changes in matrix entries.
→ Some solution approaches:
- Truncated SVD, regularization, pseudoinverse using SVD
- Random sampling + Statistical methods
E.g.: Choose a random submatrix based on suitable probability
distributions from the given matrix to approximate SVD of the
whole.
Linear Algebraic computations - Challenges

Other challenges:
- Optimization problems: generic
LA approaches yield high
training accuracy which can
cause overfitting
→ Gradient descent, random
sampling
- The data grows too massive that
it cannot be stored or handled
by a single device
→ Distributed linear algebra
Gradient descent
Matrices are
checkerboard
distributed on
TPU during
multiplication
Linear Algebraic computations - Challenges

Appear in statistical methods from early on and frequently
E.g.: semidefinite programming in manifold learning.
→ Optimizations generally focuses on minimize/ maximize the objective function.
Optimization
Linear programing Quadratic programing
From unconstrained to
constrained, both convex
and non-convex

- A large number of variables and constraints
- Finding a global solution for non-convex problems is an open
problem.
- Problems with integer constraints (integer programming).
- Challenging problems, such as high-dimensional nonlinear objective
problems, may contain multiple local optima in which
deterministic optimization algorithms may get stuck
Optimization - Challenges

Some approaches:
- Exploit the particular mathematical forms of certain problems to
find more effective optimizers
E.g.: Sequential Minimal Optimization decomposes SVM into sub-
problems by iteratively selecting 2 Lagrange multipliers to solve
- Stochastic optimization (introduce randomness) + Online learning
E.g.: Stochastic Gradient Descent - iteratively update parameters
with a random subset of data instead of the entire data.
Online learning
Optimization

Some approaches:
- Distributed optimization
E.g.: Tensorflow, PyTorch
a) across processors b) across multiple nodes
Distribute optimization process
Optimization

Graph-Theoretic Computations
• Graph-theoretic computations
involve traversing graphs, which
can be the data itself or
represent statistical models.
• Common statistical
computations on graphs include
betweenness centrality and
commute distances, used to
identify nodes or communities of
interest.
• Large-scale, sparse graphs
present computational
challenges for these
computations.

Challenges and Approaches
• Challenges: High interconnectivity in graphs,
• large maximal clique size, and memory constraints.
• Notable approaches:
• Sampling and disk-based methods for handling large graphs.
• Parallel/distributed approaches using sparse linear algebra
or graph concepts.
• Graph partitioning and linear algebraic reconditioning for
efficient computations.
• Transformation of graphical model inference problems into
optimization or variational methods.
• Sampling and parallel/distributed approaches for graphical
model inference.

Additional Applications:
• Manifold learning methods: Iso-map requires all-pairs-shortest-paths
computation.
• Single-linkage hierarchical clustering: Equivalent to computing a
minimum spanning tree.
• These examples highlight the intersection between graph
computations and distance-based or N-body-type problems.

Integration in Data Analysis
• Integration is a key computation
in data analysis, essential for
Bayesian inference and statistical
modeling.
• Challenges arise with high-
dimensional integrals, requiring
specialized approaches.

Approaches to High-Dimensional Integration
1. Markov Chain Monte Carlo (MCMC)
– Default approach for high-dimensional integration.
– Utilizes a sequence of random samples to
approximate the integral.
– Widely used in Bayesian inference and random
effects models.
2. Approximate Bayesian Computation (ABC) Methods
– Operate on summary data to accelerate
computation.
– Useful for cases where exact inference is
challenging.
– Achieves acceleration by working with population
means or variances.

Alternative Approaches and Strategies
1. Population Monte Carlo
– Form of adaptive importance sampling.
– Enhances the efficiency of Monte Carlo integration.
– Particularly useful for certain sequential models, such as particle
filtering.
2. Variational Methods
– Convert integration problems into optimization problems.
– Provide a general framework for approximate inference.
– Offers an alternative strategy to address high-dimensional integration
challenges.
3. Optimization-Based Point Estimation
– Skirts the full integration problem.
– Used in approaches like maximum a posteriori inference and empirical
Bayesian inference.
– Involves optimizing point estimates rather than performing full Bayesian
inference.a

Genomic data science
Genomic data science emerged as a field in the
1990s to bring together two laboratory activities:
Experimentation: Generating genomic
information from studying the genomes of
living organisms
Data analysis: Using statistical and
computational tools to analyze and
visualize genomic data, which includes
processing and storing data and using
algorithms and software to make
predictions based on available genomic
data
Facts
Data about a single human genome
sequence alone would take up 200
gigabytes
Need an estimated 40 exabytes to
store the genome- sequence data
generated worldwide by 2025

DNA to RNA to Protein, Illustrating the Genetic Code

Question about sequence
1. Biological question: “How similar are the genomes of humans and
chimpanzees?”
– Computational question: Given two sequences r and s, compute
their similarity, sim(s,r)
2. Biological question: “This gene causes obesity in mice. Do humans
have the same gene?”
– Computational question: Given a sequence r (the mouse gene)
and a database D of sequences (all human genes), find
sequences s in D where sim(r,s) is above a threshold

Question about sequence
3. Biological question: “We know some mutations of this gene cause sickle-cell anemia.
We have the sequences of 100 patients and 100 normal people. Let find out the disease-
causing mutations.
– Computational question: Given two sets of sequences of different lengths, find an
alignment that maximizes the overall similarity. Then look for mutations that are
unique to one group.
Patients ACGCGT ACGCGT ACGCGT
CGCGT _CGCGT _CGCGT
ACGCGA ACGCGA ACGCGA
Control AGCTT A_GCTT A_GCTT
ACGCTT ACGCTT ACGCTT
ACGCTA ACGCTA ACGCTA
Perfoming aligment
makes it easy to
compute the
similarity between
two sequences.

Scoring function
To compare the similarity of two string up to changes such as: Mutation, Insertion,
Deletion. For string AGGCCTC
Mutations: AGG A CTC
Insertions: AGG G CTCT
Deletions: AGG . CTC
Symbol:
Match : +m
Mismatch: -s
Gap: -d
Simple Scoring Function: F = (#matches) x m - (#mismatches) x s - (#gap) x d
Total score will reflect the quality of alignment

Standard of alignment
The highest score?

Computational Giants_nhom.pptx

Computational Giants_nhom.pptx

More Related Content

Similar to Computational Giants_nhom.pptx

Recently uploaded

Computational Giants_nhom.pptx

Editor's Notes