SlideShare a Scribd company logo
1 of 44
7 Computational Giants of
Massive Data Analysis
Instructor: Assoc. Prof. PhD. Nguyễn Thanh Bình
Master students:
Đoàn Đức Thế Anh
Võ Nam Thục Đoan
Nguyễn Ngọc Bảo Trân
Trần Trung Hiếu
 22C01001
 22C01004
 22C01021
 22C01009
CHAPTER 10
Massive data analysis
cannot be processed using a stand-alone computer
use of existing (distributed and parallel) hardware platforms
challenges to traditional statistical methods and algorithms
overall system architecture
Tasks of
machine learning / data mining
•orthogonal range-search, nearest-neighbor
O(N)
•all-nearest-neighbors O(N2)
Querying
•mixture of Gaussians, kernel density
estimation O(N2)
•kernel conditional density estimation O(N3)
1.Density
estimation
•decision tree, nearest-neighbor classifier
O(N2)
•support vector machine O(N3)
Classification
•linear regression, LASSO, kernel regression
O(N2)
•Gaussian process regression O(N3)
Regression
• PCA, non-negative matrix
factorization, kernel PCA O(N3)
• maximum variance unfolding
O(N3)
Dimension
reduction
• k-means, mean-shift O(N2)
• hierarchical (FoF) clustering
O(N3)
Clustering
• MST O(N3)
• bipartite cross-matching O(N3)
• n-point correlation 2-sample
testing O(Nn)
Testing
and
matching
The “7 Computational Giants” of Data
(computational problem types)
Basic statistics
Generalized N-body problem
Graph-theoretic computations
Linear-algebraic computations
Optimization
Integration
Alignment problems
1
2
3
4
5
6
7
Basic statistics
• Descriptive statistics: summarize the data
and provide insights into its
– central tendency: mean, median, mode
– variability of a data set: variance,
standard deviation, count, min max,
quartiles, skewness and kurtosis
– frequency distribution
N data points  O(N) calculations
Basic statistics
• Inferential statistics :
– generalize results to larger
populations based on small
samples
– looking at how things change over
time
– use sampling methods to find
samples that are representative of
the whole population
– determine what is happening
N data points  O(N2) calculations
Why is statistical computing important in
research and decision-making?
Evidence-based analysis
Explore relationships between variables
Evaluating the effectiveness of interventions
Contributing to improved outcomes
A vital role in fields: healthcare, finance, marketing, and social
sciences
Basic statistics - Challenges
High dimensionality
High dimensionality + large
sample size
Big2 Data: from multiple
sources, at different time
points, using different
technologies
• noise accumulation
• spurious correlations
• Incidental homogeneity
• heavy computational
cost
• algorithmic instability
• heterogeneity
• experimental variations
• statistical biases
false scientific
conclusions
wrong statistical
inference
statistical biases
Basic statistics - Solutions
New
statistical
thinking
New
computational
methods
Solutions
variable selection
dimension
reduction
new regularization
methods
independence
screening
the development of new
computational infrastructure and
data storage methods
Generalized N-body problem
• The 17th century, Sir Isaac Newton
formulated:
– The laws of motion
– The law of universal gravitation
 the behavior of objects and their interactions
 Origin of the N-body problem: predicting the
motions of N celestial objects interacting with
each other gravitationally
• Karl Fritiof Sundman: solved for n = 3
• L. K. Babadzanjanz and Qiudong Wang: generalized to n > 3
N-body problem
• Three bodies
with equal
mass
[published
2000]
• Three bodies
of unequal
mass
• Two pairs of
bodies orbiting
about each
other
• An orbit discovered
in 2008 by
Tiancheng
Ouyang, Duokui
Yan, and Skyler
Simmons at BYU
Generalized N-body problem - Challenges
• Numerical approximations
• Chaotic behavior
• Interdisciplinary nature
• Main obstacle: O(N2)
Generalized N-body problem - Solutions
• Barnes-Hut Algorithm [Barnes and Hut, 87]:
if

r
s 
s
r
 
i
R
R
i x
K
N
x
x
K )
,
(
)
,
( 
O(N log N)
N(N-1)/2 = O(N2)
Generalized N-body problem - Solutions
• Fast Multipole Method [Greengard and Rokhlin 1987]:
 

i
i
x
x
K
x )
,
(
, O(N)
multipole/Taylor expansion
of order p
Quadtree
[Callahan-Kosaraju 95]: O(N) is impossible for log-depth tree
N(N-1)/2 = O(N2)
Linear Algebraic computations
Problems involves matrix operations, solving linear systems, finding eigenvalues
eigenvectors, inverves, orthogonality,...
Examples: linear regression, SVD, PCA, clustering, graph analysis, image processing
(edge detection, compression, blurring,...)
Linear regression
SVD
PCA
Clustering
Kernel cho
edge detection
- Matrix with slowly decaying spectra → high computational
complexity, sensitive to noise.
- Nearly singular matrix det(M)~0 → nearly non-invertible, sensitive to
small changes in matrix entries.
→ Some solution approaches:
- Truncated SVD, regularization, pseudoinverse using SVD
- Random sampling + Statistical methods
E.g.: Choose a random submatrix based on suitable probability
distributions from the given matrix to approximate SVD of the
whole.
Linear Algebraic computations - Challenges
Other challenges:
- Optimization problems: generic
LA approaches yield high
training accuracy which can
cause overfitting
→ Gradient descent, random
sampling
- The data grows too massive that
it cannot be stored or handled
by a single device
→ Distributed linear algebra
Gradient descent
Matrices are
checkerboard
distributed on
TPU during
multiplication
Linear Algebraic computations - Challenges
Appear in statistical methods from early on and frequently
E.g.: semidefinite programming in manifold learning.
→ Optimizations generally focuses on minimize/ maximize the objective function.
Optimization
Linear programing Quadratic programing
From unconstrained to
constrained, both convex
and non-convex
- A large number of variables and constraints
- Finding a global solution for non-convex problems is an open
problem.
- Problems with integer constraints (integer programming).
- Challenging problems, such as high-dimensional nonlinear objective
problems, may contain multiple local optima in which
deterministic optimization algorithms may get stuck
Optimization - Challenges
Some approaches:
- Exploit the particular mathematical forms of certain problems to
find more effective optimizers
E.g.: Sequential Minimal Optimization decomposes SVM into sub-
problems by iteratively selecting 2 Lagrange multipliers to solve
- Stochastic optimization (introduce randomness) + Online learning
E.g.: Stochastic Gradient Descent - iteratively update parameters
with a random subset of data instead of the entire data.
Online learning
Optimization
Some approaches:
- Distributed optimization
E.g.: Tensorflow, PyTorch
a) across processors b) across multiple nodes
Distribute optimization process
Optimization
Graph-Theoretic Computations
• Graph-theoretic computations
involve traversing graphs, which
can be the data itself or
represent statistical models.
• Common statistical
computations on graphs include
betweenness centrality and
commute distances, used to
identify nodes or communities of
interest.
• Large-scale, sparse graphs
present computational
challenges for these
computations.
Challenges and Approaches
• Challenges: High interconnectivity in graphs,
• large maximal clique size, and memory constraints.
• Notable approaches:
• Sampling and disk-based methods for handling large graphs.
• Parallel/distributed approaches using sparse linear algebra
or graph concepts.
• Graph partitioning and linear algebraic reconditioning for
efficient computations.
• Transformation of graphical model inference problems into
optimization or variational methods.
• Sampling and parallel/distributed approaches for graphical
model inference.
Additional Applications:
• Manifold learning methods: Iso-map requires all-pairs-shortest-paths
computation.
• Single-linkage hierarchical clustering: Equivalent to computing a
minimum spanning tree.
• These examples highlight the intersection between graph
computations and distance-based or N-body-type problems.
Integration in Data Analysis
• Integration is a key computation
in data analysis, essential for
Bayesian inference and statistical
modeling.
• Challenges arise with high-
dimensional integrals, requiring
specialized approaches.
Approaches to High-Dimensional Integration
1. Markov Chain Monte Carlo (MCMC)
– Default approach for high-dimensional integration.
– Utilizes a sequence of random samples to
approximate the integral.
– Widely used in Bayesian inference and random
effects models.
2. Approximate Bayesian Computation (ABC) Methods
– Operate on summary data to accelerate
computation.
– Useful for cases where exact inference is
challenging.
– Achieves acceleration by working with population
means or variances.
Alternative Approaches and Strategies
1. Population Monte Carlo
– Form of adaptive importance sampling.
– Enhances the efficiency of Monte Carlo integration.
– Particularly useful for certain sequential models, such as particle
filtering.
2. Variational Methods
– Convert integration problems into optimization problems.
– Provide a general framework for approximate inference.
– Offers an alternative strategy to address high-dimensional integration
challenges.
3. Optimization-Based Point Estimation
– Skirts the full integration problem.
– Used in approaches like maximum a posteriori inference and empirical
Bayesian inference.
– Involves optimizing point estimates rather than performing full Bayesian
inference.a
Alignment
Genomic data science
Genomic data science emerged as a field in the
1990s to bring together two laboratory activities:
Experimentation: Generating genomic
information from studying the genomes of
living organisms
Data analysis: Using statistical and
computational tools to analyze and
visualize genomic data, which includes
processing and storing data and using
algorithms and software to make
predictions based on available genomic
data
Facts
Data about a single human genome
sequence alone would take up 200
gigabytes
Need an estimated 40 exabytes to
store the genome- sequence data
generated worldwide by 2025
DNA to RNA to Protein, Illustrating the Genetic Code
Sequence alignment
Question about sequence
1. Biological question: “How similar are the genomes of humans and
chimpanzees?”
– Computational question: Given two sequences r and s, compute
their similarity, sim(s,r)
2. Biological question: “This gene causes obesity in mice. Do humans
have the same gene?”
– Computational question: Given a sequence r (the mouse gene)
and a database D of sequences (all human genes), find
sequences s in D where sim(r,s) is above a threshold
Question about sequence
3. Biological question: “We know some mutations of this gene cause sickle-cell anemia.
We have the sequences of 100 patients and 100 normal people. Let find out the disease-
causing mutations.
– Computational question: Given two sets of sequences of different lengths, find an
alignment that maximizes the overall similarity. Then look for mutations that are
unique to one group.
Patients ACGCGT ACGCGT ACGCGT
CGCGT _CGCGT _CGCGT
ACGCGA ACGCGA ACGCGA
Control AGCTT A_GCTT A_GCTT
ACGCTT ACGCTT ACGCTT
ACGCTA ACGCTA ACGCTA
Perfoming aligment
makes it easy to
compute the
similarity between
two sequences.
Scoring function
To compare the similarity of two string up to changes such as: Mutation, Insertion,
Deletion. For string AGGCCTC
Mutations: AGG A CTC
Insertions: AGG G CTCT
Deletions: AGG . CTC
Symbol:
Match : +m
Mismatch: -s
Gap: -d
Simple Scoring Function: F = (#matches) x m - (#mismatches) x s - (#gap) x d
Total score will reflect the quality of alignment
Standard of alignment
The highest score?
Problems
Solutions
Thank you for your time 😊
Computational Giants_nhom.pptx
Computational Giants_nhom.pptx
Computational Giants_nhom.pptx

More Related Content

Similar to Computational Giants_nhom.pptx

SMART International Symposium for Next Generation Infrastructure: The roles o...
SMART International Symposium for Next Generation Infrastructure: The roles o...SMART International Symposium for Next Generation Infrastructure: The roles o...
SMART International Symposium for Next Generation Infrastructure: The roles o...SMART Infrastructure Facility
 
Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...
Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...
Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...Maninda Edirisooriya
 
cs 601 - lecture 1.pptx
cs 601 - lecture 1.pptxcs 601 - lecture 1.pptx
cs 601 - lecture 1.pptxGopalPatidar13
 
Machine learning and linear regression programming
Machine learning and linear regression programmingMachine learning and linear regression programming
Machine learning and linear regression programmingSoumya Mukherjee
 
DimensionalityReduction.pptx
DimensionalityReduction.pptxDimensionalityReduction.pptx
DimensionalityReduction.pptx36rajneekant
 
Advanced database and data mining & clustering concepts
Advanced database and data mining & clustering conceptsAdvanced database and data mining & clustering concepts
Advanced database and data mining & clustering conceptsNithyananthSengottai
 
ML SFCSE.pptx
ML SFCSE.pptxML SFCSE.pptx
ML SFCSE.pptxNIKHILGR3
 
Online machine learning in Streaming Applications
Online machine learning in Streaming ApplicationsOnline machine learning in Streaming Applications
Online machine learning in Streaming ApplicationsStavros Kontopoulos
 
Matrix Factorization In Recommender Systems
Matrix Factorization In Recommender SystemsMatrix Factorization In Recommender Systems
Matrix Factorization In Recommender SystemsYONG ZHENG
 
ICDE-2015 Shortest Path Traversal Optimization and Analysis for Large Graph C...
ICDE-2015 Shortest Path Traversal Optimization and Analysis for Large Graph C...ICDE-2015 Shortest Path Traversal Optimization and Analysis for Large Graph C...
ICDE-2015 Shortest Path Traversal Optimization and Analysis for Large Graph C...Waqas Nawaz
 
Introduction_to_exament_Methods.pdf
Introduction_to_exament_Methods.pdfIntroduction_to_exament_Methods.pdf
Introduction_to_exament_Methods.pdfMustafaELALAMI
 
To bag, or to boost? A question of balance
To bag, or to boost? A question of balanceTo bag, or to boost? A question of balance
To bag, or to boost? A question of balanceAlex Henderson
 
CSA 3702 machine learning module 3
CSA 3702 machine learning module 3CSA 3702 machine learning module 3
CSA 3702 machine learning module 3Nandhini S
 
Data Science and Machine Learning with Tensorflow
 Data Science and Machine Learning with Tensorflow Data Science and Machine Learning with Tensorflow
Data Science and Machine Learning with TensorflowShubham Sharma
 

Similar to Computational Giants_nhom.pptx (20)

Declarative data analysis
Declarative data analysisDeclarative data analysis
Declarative data analysis
 
SMART International Symposium for Next Generation Infrastructure: The roles o...
SMART International Symposium for Next Generation Infrastructure: The roles o...SMART International Symposium for Next Generation Infrastructure: The roles o...
SMART International Symposium for Next Generation Infrastructure: The roles o...
 
Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...
Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...
Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...
 
cs 601 - lecture 1.pptx
cs 601 - lecture 1.pptxcs 601 - lecture 1.pptx
cs 601 - lecture 1.pptx
 
Machine learning and linear regression programming
Machine learning and linear regression programmingMachine learning and linear regression programming
Machine learning and linear regression programming
 
DimensionalityReduction.pptx
DimensionalityReduction.pptxDimensionalityReduction.pptx
DimensionalityReduction.pptx
 
RBF2.ppt
RBF2.pptRBF2.ppt
RBF2.ppt
 
KNN
KNNKNN
KNN
 
Advanced database and data mining & clustering concepts
Advanced database and data mining & clustering conceptsAdvanced database and data mining & clustering concepts
Advanced database and data mining & clustering concepts
 
ML SFCSE.pptx
ML SFCSE.pptxML SFCSE.pptx
ML SFCSE.pptx
 
TruongNguyen_CV
TruongNguyen_CVTruongNguyen_CV
TruongNguyen_CV
 
Online machine learning in Streaming Applications
Online machine learning in Streaming ApplicationsOnline machine learning in Streaming Applications
Online machine learning in Streaming Applications
 
Data mining 2004
Data mining 2004Data mining 2004
Data mining 2004
 
Matrix Factorization In Recommender Systems
Matrix Factorization In Recommender SystemsMatrix Factorization In Recommender Systems
Matrix Factorization In Recommender Systems
 
ICDE-2015 Shortest Path Traversal Optimization and Analysis for Large Graph C...
ICDE-2015 Shortest Path Traversal Optimization and Analysis for Large Graph C...ICDE-2015 Shortest Path Traversal Optimization and Analysis for Large Graph C...
ICDE-2015 Shortest Path Traversal Optimization and Analysis for Large Graph C...
 
Introduction_to_exament_Methods.pdf
Introduction_to_exament_Methods.pdfIntroduction_to_exament_Methods.pdf
Introduction_to_exament_Methods.pdf
 
Optimization Using Evolutionary Computing Techniques
Optimization Using Evolutionary Computing Techniques Optimization Using Evolutionary Computing Techniques
Optimization Using Evolutionary Computing Techniques
 
To bag, or to boost? A question of balance
To bag, or to boost? A question of balanceTo bag, or to boost? A question of balance
To bag, or to boost? A question of balance
 
CSA 3702 machine learning module 3
CSA 3702 machine learning module 3CSA 3702 machine learning module 3
CSA 3702 machine learning module 3
 
Data Science and Machine Learning with Tensorflow
 Data Science and Machine Learning with Tensorflow Data Science and Machine Learning with Tensorflow
Data Science and Machine Learning with Tensorflow
 

Recently uploaded

Predicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationPredicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationBoston Institute of Analytics
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Digi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxDigi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxTanveerAhmed817946
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service LucknowAminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknowmakika9823
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 

Recently uploaded (20)

Predicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationPredicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project Presentation
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Digi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxDigi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptx
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Decoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in ActionDecoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in Action
 
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service LucknowAminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 

Computational Giants_nhom.pptx

  • 1. 7 Computational Giants of Massive Data Analysis Instructor: Assoc. Prof. PhD. Nguyễn Thanh Bình Master students: Đoàn Đức Thế Anh Võ Nam Thục Đoan Nguyễn Ngọc Bảo Trân Trần Trung Hiếu  22C01001  22C01004  22C01021  22C01009 CHAPTER 10
  • 2. Massive data analysis cannot be processed using a stand-alone computer use of existing (distributed and parallel) hardware platforms challenges to traditional statistical methods and algorithms overall system architecture
  • 3. Tasks of machine learning / data mining •orthogonal range-search, nearest-neighbor O(N) •all-nearest-neighbors O(N2) Querying •mixture of Gaussians, kernel density estimation O(N2) •kernel conditional density estimation O(N3) 1.Density estimation •decision tree, nearest-neighbor classifier O(N2) •support vector machine O(N3) Classification •linear regression, LASSO, kernel regression O(N2) •Gaussian process regression O(N3) Regression • PCA, non-negative matrix factorization, kernel PCA O(N3) • maximum variance unfolding O(N3) Dimension reduction • k-means, mean-shift O(N2) • hierarchical (FoF) clustering O(N3) Clustering • MST O(N3) • bipartite cross-matching O(N3) • n-point correlation 2-sample testing O(Nn) Testing and matching
  • 4. The “7 Computational Giants” of Data (computational problem types) Basic statistics Generalized N-body problem Graph-theoretic computations Linear-algebraic computations Optimization Integration Alignment problems 1 2 3 4 5 6 7
  • 5. Basic statistics • Descriptive statistics: summarize the data and provide insights into its – central tendency: mean, median, mode – variability of a data set: variance, standard deviation, count, min max, quartiles, skewness and kurtosis – frequency distribution N data points  O(N) calculations
  • 6. Basic statistics • Inferential statistics : – generalize results to larger populations based on small samples – looking at how things change over time – use sampling methods to find samples that are representative of the whole population – determine what is happening N data points  O(N2) calculations
  • 7. Why is statistical computing important in research and decision-making? Evidence-based analysis Explore relationships between variables Evaluating the effectiveness of interventions Contributing to improved outcomes A vital role in fields: healthcare, finance, marketing, and social sciences
  • 8. Basic statistics - Challenges High dimensionality High dimensionality + large sample size Big2 Data: from multiple sources, at different time points, using different technologies • noise accumulation • spurious correlations • Incidental homogeneity • heavy computational cost • algorithmic instability • heterogeneity • experimental variations • statistical biases false scientific conclusions wrong statistical inference statistical biases
  • 9. Basic statistics - Solutions New statistical thinking New computational methods Solutions variable selection dimension reduction new regularization methods independence screening the development of new computational infrastructure and data storage methods
  • 10. Generalized N-body problem • The 17th century, Sir Isaac Newton formulated: – The laws of motion – The law of universal gravitation  the behavior of objects and their interactions  Origin of the N-body problem: predicting the motions of N celestial objects interacting with each other gravitationally • Karl Fritiof Sundman: solved for n = 3 • L. K. Babadzanjanz and Qiudong Wang: generalized to n > 3
  • 11. N-body problem • Three bodies with equal mass [published 2000] • Three bodies of unequal mass • Two pairs of bodies orbiting about each other • An orbit discovered in 2008 by Tiancheng Ouyang, Duokui Yan, and Skyler Simmons at BYU
  • 12. Generalized N-body problem - Challenges • Numerical approximations • Chaotic behavior • Interdisciplinary nature • Main obstacle: O(N2)
  • 13. Generalized N-body problem - Solutions • Barnes-Hut Algorithm [Barnes and Hut, 87]: if  r s  s r   i R R i x K N x x K ) , ( ) , (  O(N log N) N(N-1)/2 = O(N2)
  • 14. Generalized N-body problem - Solutions • Fast Multipole Method [Greengard and Rokhlin 1987]:    i i x x K x ) , ( , O(N) multipole/Taylor expansion of order p Quadtree [Callahan-Kosaraju 95]: O(N) is impossible for log-depth tree N(N-1)/2 = O(N2)
  • 15. Linear Algebraic computations Problems involves matrix operations, solving linear systems, finding eigenvalues eigenvectors, inverves, orthogonality,... Examples: linear regression, SVD, PCA, clustering, graph analysis, image processing (edge detection, compression, blurring,...) Linear regression SVD PCA Clustering Kernel cho edge detection
  • 16. - Matrix with slowly decaying spectra → high computational complexity, sensitive to noise. - Nearly singular matrix det(M)~0 → nearly non-invertible, sensitive to small changes in matrix entries. → Some solution approaches: - Truncated SVD, regularization, pseudoinverse using SVD - Random sampling + Statistical methods E.g.: Choose a random submatrix based on suitable probability distributions from the given matrix to approximate SVD of the whole. Linear Algebraic computations - Challenges
  • 17. Other challenges: - Optimization problems: generic LA approaches yield high training accuracy which can cause overfitting → Gradient descent, random sampling - The data grows too massive that it cannot be stored or handled by a single device → Distributed linear algebra Gradient descent Matrices are checkerboard distributed on TPU during multiplication Linear Algebraic computations - Challenges
  • 18. Appear in statistical methods from early on and frequently E.g.: semidefinite programming in manifold learning. → Optimizations generally focuses on minimize/ maximize the objective function. Optimization Linear programing Quadratic programing From unconstrained to constrained, both convex and non-convex
  • 19. - A large number of variables and constraints - Finding a global solution for non-convex problems is an open problem. - Problems with integer constraints (integer programming). - Challenging problems, such as high-dimensional nonlinear objective problems, may contain multiple local optima in which deterministic optimization algorithms may get stuck Optimization - Challenges
  • 20. Some approaches: - Exploit the particular mathematical forms of certain problems to find more effective optimizers E.g.: Sequential Minimal Optimization decomposes SVM into sub- problems by iteratively selecting 2 Lagrange multipliers to solve - Stochastic optimization (introduce randomness) + Online learning E.g.: Stochastic Gradient Descent - iteratively update parameters with a random subset of data instead of the entire data. Online learning Optimization
  • 21. Some approaches: - Distributed optimization E.g.: Tensorflow, PyTorch a) across processors b) across multiple nodes Distribute optimization process Optimization
  • 22. Graph-Theoretic Computations • Graph-theoretic computations involve traversing graphs, which can be the data itself or represent statistical models. • Common statistical computations on graphs include betweenness centrality and commute distances, used to identify nodes or communities of interest. • Large-scale, sparse graphs present computational challenges for these computations.
  • 23. Challenges and Approaches • Challenges: High interconnectivity in graphs, • large maximal clique size, and memory constraints. • Notable approaches: • Sampling and disk-based methods for handling large graphs. • Parallel/distributed approaches using sparse linear algebra or graph concepts. • Graph partitioning and linear algebraic reconditioning for efficient computations. • Transformation of graphical model inference problems into optimization or variational methods. • Sampling and parallel/distributed approaches for graphical model inference.
  • 24. Additional Applications: • Manifold learning methods: Iso-map requires all-pairs-shortest-paths computation. • Single-linkage hierarchical clustering: Equivalent to computing a minimum spanning tree. • These examples highlight the intersection between graph computations and distance-based or N-body-type problems.
  • 25. Integration in Data Analysis • Integration is a key computation in data analysis, essential for Bayesian inference and statistical modeling. • Challenges arise with high- dimensional integrals, requiring specialized approaches.
  • 26. Approaches to High-Dimensional Integration 1. Markov Chain Monte Carlo (MCMC) – Default approach for high-dimensional integration. – Utilizes a sequence of random samples to approximate the integral. – Widely used in Bayesian inference and random effects models. 2. Approximate Bayesian Computation (ABC) Methods – Operate on summary data to accelerate computation. – Useful for cases where exact inference is challenging. – Achieves acceleration by working with population means or variances.
  • 27. Alternative Approaches and Strategies 1. Population Monte Carlo – Form of adaptive importance sampling. – Enhances the efficiency of Monte Carlo integration. – Particularly useful for certain sequential models, such as particle filtering. 2. Variational Methods – Convert integration problems into optimization problems. – Provide a general framework for approximate inference. – Offers an alternative strategy to address high-dimensional integration challenges. 3. Optimization-Based Point Estimation – Skirts the full integration problem. – Used in approaches like maximum a posteriori inference and empirical Bayesian inference. – Involves optimizing point estimates rather than performing full Bayesian inference.a
  • 29. Genomic data science Genomic data science emerged as a field in the 1990s to bring together two laboratory activities: Experimentation: Generating genomic information from studying the genomes of living organisms Data analysis: Using statistical and computational tools to analyze and visualize genomic data, which includes processing and storing data and using algorithms and software to make predictions based on available genomic data Facts Data about a single human genome sequence alone would take up 200 gigabytes Need an estimated 40 exabytes to store the genome- sequence data generated worldwide by 2025
  • 30. DNA to RNA to Protein, Illustrating the Genetic Code
  • 31.
  • 32.
  • 34.
  • 35. Question about sequence 1. Biological question: “How similar are the genomes of humans and chimpanzees?” – Computational question: Given two sequences r and s, compute their similarity, sim(s,r) 2. Biological question: “This gene causes obesity in mice. Do humans have the same gene?” – Computational question: Given a sequence r (the mouse gene) and a database D of sequences (all human genes), find sequences s in D where sim(r,s) is above a threshold
  • 36. Question about sequence 3. Biological question: “We know some mutations of this gene cause sickle-cell anemia. We have the sequences of 100 patients and 100 normal people. Let find out the disease- causing mutations. – Computational question: Given two sets of sequences of different lengths, find an alignment that maximizes the overall similarity. Then look for mutations that are unique to one group. Patients ACGCGT ACGCGT ACGCGT CGCGT _CGCGT _CGCGT ACGCGA ACGCGA ACGCGA Control AGCTT A_GCTT A_GCTT ACGCTT ACGCTT ACGCTT ACGCTA ACGCTA ACGCTA Perfoming aligment makes it easy to compute the similarity between two sequences.
  • 37. Scoring function To compare the similarity of two string up to changes such as: Mutation, Insertion, Deletion. For string AGGCCTC Mutations: AGG A CTC Insertions: AGG G CTCT Deletions: AGG . CTC Symbol: Match : +m Mismatch: -s Gap: -d Simple Scoring Function: F = (#matches) x m - (#mismatches) x s - (#gap) x d Total score will reflect the quality of alignment
  • 38. Standard of alignment The highest score?
  • 41. Thank you for your time 😊

Editor's Notes

  1. Entered text Massive data refers to a large amount of data that is too difficult to process using traditional tools like spreadsheets or text processors. It can exist in structured or unstructured form and consists of petabytes and exabytes of data. Big data can be analyzed for insights that improve decisions and give confidence for making strategic business moves. Processing massive data, also known as big data, can present several challenges. Here are some common ones: Storage, Processing speed, Data quality, Security, Data integration, Cost, Scalability
  2. Giới thiệu massive data -> kiến trúc hệ thống
  3. Giảm chiều dữ liệu có thể được sử dụng cho giảm nhiễu (noise reduction), trực quan hóa dữ liệu (data visualization), phân tích cụm, hoặc là một bước trung gian để tạo điều kiện thuận lợi cho các phân tích khác.
  4. its inverse may be highly sensitive to small changes in the matrix entries. Nearly non-inverible → iterative
  5. dividing the computational workload and data across multiple processing units
  6. Linear programing (determine the best outcome in a linear mathematical model, given a set of linear constraints.) LA computations are a special case (2nd-order optimization). quadratic(quadratic objective function and linear constraints) 2nd-order cone programming (linear objective, linear constraints bao gồm 2nd order cone deals with the optimization of linear objective functions subject to linear matrix inequality constraints. It generalizes linear programming to handle optimization problems involving positive semidefinite matrices. Manifold learning: học cấu trúc trong dữ lieệu cao chiều – biểu diển ít chiều hơn
  7. Các bài toán tối ưu được biểu diễn dưới dạng mô hình hóa toán học với Huấn luyện SVM yêu cầu tìm nghiệm của QP rất lớn, tốn nhiều tgian A stochastic program is an optimization problem in which some or all problem parameters are uncertain, but follow known probability distributions. This framework contrasts with deterministic optimization, in which all problem parameters are assumed to be known exactly.
  8. Các bài toán tối ưu được biểu diễn dưới dạng mô hình hóa toán học với Huấn luyện SVM yêu cầu tìm nghiệm của QP rất lớn, tốn nhiều tgian exploits the particular structure of this quadratic optimization problem of SVM by iteratively selecting two Lagrange multipliers and solving a sub-problem to update them. he objective function aims to maximize the margin between the decision boundary and the support vectors while minimizing the classification errors. The Lagrange multipliers (α values) are the variables to be optimized. The constraints ensure that the sum of the Lagrange multipliers weighted by the corresponding target variables is zero and that the Lagrange multipliers are within a specified range (0 ≤ α[i] ≤ C). thuật toán GD trong deep learning receives a sequence of data points one at a time and updates its model iteratively. the use of randomness in the objective function or in the optimization algorithm.
  9. Các bài toán tối ưu được biểu diễn dưới dạng mô hình hóa toán học với Huấn luyện SVM yêu cầu tìm nghiệm của QP rất lớn, tốn nhiều tgian exploits the particular structure of this quadratic optimization problem of SVM by iteratively selecting two Lagrange multipliers and solving a sub-problem to update them. he objective function aims to maximize the margin between the decision boundary and the support vectors while minimizing the classification errors. The Lagrange multipliers (α values) are the variables to be optimized. The constraints ensure that the sum of the Lagrange multipliers weighted by the corresponding target variables is zero and that the Lagrange multipliers are within a specified range (0 ≤ α[i] ≤ C). thuật toán GD trong deep learning receives a sequence of data points one at a time and updates its model iteratively. the use of randomness in the objective function or in the optimization algorithm.
  10. Để so sánh độ tương tự giữa 2 chuỗi với các thay đổi như đột biến, chèn hoặc xoá. Ví dụ chuỗi AGGCCTC Interactive demo for Needleman–Wunsch algorithm (mostafa.io)
  11. Tiêu chuẩn đánh giá Alignment
  12. Để giải quyết vấn đề và đạt được hiệu quả tính toán có thể hướng đến các hướng sau: sampling, parallel/distributed computing, algorithms
  13. Interactive demo for Needleman–Wunsch algorithm (mostafa.io)