SlideShare a Scribd company logo
1 of 54
Download to read offline
Making BIGDATA 
smaller 
Tony Tran
About Me 
● I am from SF 
○ SF Bay Area Machine Learning Meetup 
○ www.sfdatajobs.com 
● Background: 
○ BS/MS CompSci (focus on ML/Vision) 
○ 4 years as Data Engineer in Ad Tech 
○ Currently Consulting
What does “data” mean? 
RAW DATA 
structured & 
unstructured 
Clean & 
Transform 
Extract 
Features 
DATA 
observations and 
features 
observations 
features 
D features 
DATA 
(matrix) 
m observations
What is “Big Data?” 
Big data is an all-encompassing term for any collection of data sets so large 
and complex that it becomes difficult to process them using traditional data 
processing applications -- Wikipedia
What is “Big Data?” 
Big data is an all-encompassing term for any collection of data sets so large 
and complex that it becomes difficult to process them using traditional data 
processing applications -- Wikipedia 
To me, Big Data is when: 
● run out of disc space 
● run into “Out Of Memory” Errors 
● S3 billing triggers your credit company to call you 
● you are willing to go through the pains of setting up a hadoop/spark/etc 
cluster (have you tried configuring your own cluster?)
Question 
● Can we make big data smaller? 
● What are the benefits of smaller data?
Question 
● Can we make big data smaller? 
● Benefits of having “small data”: 
○ Reduce storage costs 
○ Reduce computational costs 
○ No more “Out Of Memory” Errors
Ideas for making data small 
● Reduce the number of observations 
○ Only keep observations that are “important” 
○ Remove redundant observations 
○ Randomly sample 
● Reduce the number of features 
○ removing non-useful features 
○ combining features 
○ something clever
Random sampling of obsrv.
Ideas for making data small 
● Reduce the number of observations 
○ Only keep rows that are “important” 
○ Remove redundant rows 
○ Randomly sample 
● Reduce the number of features 
○ removing non-useful features 
○ combining features 
○ something clever
Our Focus: reducing feat. 
m 
D 
m 
d 
Dimensionality 
Reduction 
Note: d << D 
Dimensionality 
reduction 
Note: we want to preserve the 
distances between observations as 
best as possible.
Exercise 
Given the following set of 2d observations, how can we 
represent each observation in 1d while still preserving the 
distances between points as best as possible?
Exercise
Exercise
Exercise 
Projecting observations onto x-axis
Exercise 
Projecting observations onto y-axis
Will it still work if ... 
Are we getting good 
results because the x-axis 
is aligned with the spread 
of the observations?
Unaligned data 
Can we find a better 
coordinate system 
that is aligned with 
the data?
Aligned coordinate system 
Direction which aligns with 
the spread of observations 
● dimensionality reduction easy 
with aligned coordinate 
system
Computing Spread 
● Spread = variance of projected observations 
● How to determine observations in new coordinate 
system?
Linear Algebra 
v2 v1 
a1 = v1_ * p 
||v1|| 
p 
a2 = v2_ * p 
||v2|| 
p = (p1, p2) originally 
p = (a1, a2 ) in new coordinate system
Observations 
● Finding an aligned coordinate system makes 
it easy for us to do dimensionality reduction. 
○ represent observations in new coordinate system 
then remove features (axes). 
● The direction parallel to spread of data 
maximizes interpoint distances.
New tool (PCA) 
● How do we find aligned coordinate system? 
Is there a tool already developed for this?
New tool (PCA) 
● How do we find aligned coordinate system? 
● Principal Component Analysis 
○ Given set of observations, finds an aligned 
coordinate system. 
○ First direction of coordinate system will contain the 
most spread, followed by the second, so forth. 
○ O(m3) runtime
PCA (scikit-learn) 
>>> X = … data matrix of size (m x D) ... 
>>> from sklearn.decomposition import PCA 
>>> 
>>> pca = PCA(n_components=d) 
>>> pca.fit(X) # fits new coordinate system to data 
>>> pca.transform(X) # transforms data to new coordinate 
# system and removes dimensions 
>>>> gives us matrix of size (m x d)
Our Focus: reducing feat. 
m 
D 
m 
d 
Dimensionality 
Reduction 
Note: d << D 
Dimensionality 
reduction 
Note: we want to preserve the 
distances between observations as 
best as possible.
3D to 2D 
v1 
v2 
v3 
projected data space 
keep only v1 and v2
Image Data 
● What if our data is images? How do we 
represent an image as an observation? 
(100x100) matrix
Images and vectors 
(100x100) matrix 
10k-dimensional 
vector
Image Data 
(100x100) matrix 
10k-dimensional 
vector 
(100x100) matrix 
10k-dimensional 
vector 
r3 
r1 
r2 
r4 r5 ... 
r9999 
r9998 
r10000 
...
Image Data 
(100x100) matrix 
10k-dimensional 
vector 
(100x100) matrix 
10k-dimensional 
vector 
r3 
r1 
r2 
r4 r5 ... 
r9999 
r9998 
r10000 
... 
...
Image Data 
(100x100) matrix 
10k-dimensional 
vector 
(100x100) matrix 
10k-dimensional 
vector 
r3 
r1 
r2 
r4 r5 ... 
r9999 
r9998 
r10000 
... 
... 
v999 
9 
v4 
v2 
v1 
v10k 
v1 
v3
Image Data 
v1 
v2 
v10k 
... 
r3 
r1 
r2 
r4 r5 ... 
r9999 
r9998 
r10000 
... 
v999 
9 
v4 
v2 
v1 
v10k 
v1 
v3
Image Data
Image Data 
= a1 + a2 +…+ a10k 
Original Image
Image Data 
= a1 + a2 +…+ a10k 
Original Image 
= a1 + …+ a20 
reconstruct with 20 directions
Image Data 
= a1 + a2 +…+ a10k 
Original Image 
= a1 + …+ a20 
reconstruct with 20 directions 
= a1 + …+ a90 
reconstruct with 90 directions
Image Data 
= a1 + a2 +…+ a10k 
Original Image 
= a1 + …+ a90 
reconstruct with 90 directions 
Compression! 
Each image can now be represented by 90 weights!
Image Data 
= a1 + …+ a90 
reconstruct with 90 directions 
● original image representation: 10k values 
● compression requires: 
○ 90 direction vectors = (90 x 10k values) 
○ 1 image = 90 weights (for the direction vectors)
Image Data 
= a1 + …+ a90 
reconstruct with 90 directions 
● For 200 images: 
○ original representation: 200*10k 
○ compression: (90x10k) + 200*90
Image Data 
= a1 + …+ a90 
reconstruct with 90 directions 
● For 200 images: 
○ original representation: 200*10k 
○ compression: (90x10k) + 200*90 
makes sense to use this 
compression technique when 
we have more than 90 images 
to compress
Keep in mind 
● O(m3) runtime 
● Need to keep around d directions of length D to 
perform projection. 
● Requires to be able to read in data to memory. 
● What if data is non-linear?
Random Projections 
● Generate a (d x D) matrix, P, where 
elements are drawn from a normal 
distribution ~ N(0, 1/d) 
● To compute projected observation: 
○ onew = P.dot(o) 
D 
d = d P 
* D
Intuition 
● Randomly determine coordinate system.
Intuition 
● Randomly determine coordinate system.
Intuition 
● Randomly determine coordinate system.
Intuition 
● Randomly determine coordinate system. 
● Keep d directions
Safe value for d? 
Using this technique, what is a “safe” value for d? 
m 
D 
m 
d 
Dimensionality 
Reduction 
Note: d << D 
Random 
Projection
Safe value for d? 
>> from sklearn.random_projection import johnson_lindenstrauss_min_dim 
def johnson_lindenstrauss_min_dim(n_observations,eps=0.1): 
● input: 
○ n_observations -- the number of observations you have 
○ eps -- the amount of error you’re willing to tolerate 
● output: 
○ safe number of features that you can project down to
Mathematical Guarantees 
Original distance 
Projected distance 
d >= 4 log(m) / (eps^2 / 2 - eps^3 / 3)
Practical usage 
● High probability that projection will be good, 
but there’s still a chance that it will not be! 
○ Create multiple projections and test guarantees with 
sampled observations.
Comparison 
● PCA 
○ finds aligned coordinate system which maximizes spread. 
○ o(m3) runtime + requires all points to be read into memory. 
○ o(dD) space to store aligned coordinate system for projection. 
● Random Projection 
○ finds random coordinate system 
○ o(dD) runtime and space to construct projection Matrix 
○ Guaranteed with high probability to work
Thank you 
Tony Tran 
tony@sfdatajobs.com 
@quicksorter
References 
● http://www.cs.princeton.edu/~cdecoro/eigenfaces/ 
● http://scikit-learn.org/stable/modules/generated/sklearn. 
decomposition.PCA.html#sklearn.decomposition.PCA 
● http://scikit-learn.org/stable/modules/random_projection. 
html 
● http://blog.yhathq.com/posts/sparse-random-projections. 
html

More Related Content

What's hot

Chapter 3 Output Primitives
Chapter 3 Output PrimitivesChapter 3 Output Primitives
Chapter 3 Output PrimitivesPrathimaBaliga
 
Bresenham circles and polygons derication
Bresenham circles and polygons dericationBresenham circles and polygons derication
Bresenham circles and polygons dericationKumar
 
Lec02 03 rasterization
Lec02 03 rasterizationLec02 03 rasterization
Lec02 03 rasterizationMaaz Rizwan
 
Line drawing algorithm and antialiasing techniques
Line drawing algorithm and antialiasing techniquesLine drawing algorithm and antialiasing techniques
Line drawing algorithm and antialiasing techniquesAnkit Garg
 
Working principle of dda and bresenham line drawing explaination with example
Working principle of dda and bresenham line drawing explaination with exampleWorking principle of dda and bresenham line drawing explaination with example
Working principle of dda and bresenham line drawing explaination with exampleAashish Adhikari
 
Line drawing algo.
Line drawing algo.Line drawing algo.
Line drawing algo.Mohd Arif
 
Circle drawing algo.
Circle drawing algo.Circle drawing algo.
Circle drawing algo.Mohd Arif
 
DDA (digital differential analyzer)
DDA (digital differential analyzer)DDA (digital differential analyzer)
DDA (digital differential analyzer)Inamul Hossain Imran
 
Output primitives in Computer Graphics
Output primitives in Computer GraphicsOutput primitives in Computer Graphics
Output primitives in Computer GraphicsKamal Acharya
 
Digital Differential Analyzer Line Drawing Algorithm
Digital Differential Analyzer Line Drawing AlgorithmDigital Differential Analyzer Line Drawing Algorithm
Digital Differential Analyzer Line Drawing AlgorithmKasun Ranga Wijeweera
 
Computer graphics presentation
Computer graphics presentationComputer graphics presentation
Computer graphics presentationLOKENDRA PRAJAPATI
 
Unit 2
Unit 2Unit 2
Unit 2ypnrao
 
Bresenhamcircle derivation
Bresenhamcircle derivationBresenhamcircle derivation
Bresenhamcircle derivationMazharul Islam
 
A mid point ellipse drawing algorithm on a hexagonal grid
A mid  point ellipse drawing algorithm on a hexagonal gridA mid  point ellipse drawing algorithm on a hexagonal grid
A mid point ellipse drawing algorithm on a hexagonal gridS M K
 

What's hot (20)

Chapter 3 Output Primitives
Chapter 3 Output PrimitivesChapter 3 Output Primitives
Chapter 3 Output Primitives
 
Bresenham circles and polygons derication
Bresenham circles and polygons dericationBresenham circles and polygons derication
Bresenham circles and polygons derication
 
Dda algorithm
Dda algorithmDda algorithm
Dda algorithm
 
Lec02 03 rasterization
Lec02 03 rasterizationLec02 03 rasterization
Lec02 03 rasterization
 
Line drawing algorithm and antialiasing techniques
Line drawing algorithm and antialiasing techniquesLine drawing algorithm and antialiasing techniques
Line drawing algorithm and antialiasing techniques
 
Working principle of dda and bresenham line drawing explaination with example
Working principle of dda and bresenham line drawing explaination with exampleWorking principle of dda and bresenham line drawing explaination with example
Working principle of dda and bresenham line drawing explaination with example
 
Line drawing algo.
Line drawing algo.Line drawing algo.
Line drawing algo.
 
Lect14 lines+circles
Lect14 lines+circlesLect14 lines+circles
Lect14 lines+circles
 
Circle drawing algo.
Circle drawing algo.Circle drawing algo.
Circle drawing algo.
 
bresenham circles and polygons in computer graphics(Computer graphics tutorials)
bresenham circles and polygons in computer graphics(Computer graphics tutorials)bresenham circles and polygons in computer graphics(Computer graphics tutorials)
bresenham circles and polygons in computer graphics(Computer graphics tutorials)
 
DDA (digital differential analyzer)
DDA (digital differential analyzer)DDA (digital differential analyzer)
DDA (digital differential analyzer)
 
Output primitives in Computer Graphics
Output primitives in Computer GraphicsOutput primitives in Computer Graphics
Output primitives in Computer Graphics
 
Digital Differential Analyzer Line Drawing Algorithm
Digital Differential Analyzer Line Drawing AlgorithmDigital Differential Analyzer Line Drawing Algorithm
Digital Differential Analyzer Line Drawing Algorithm
 
Cs580
Cs580Cs580
Cs580
 
Dda line-algorithm
Dda line-algorithmDda line-algorithm
Dda line-algorithm
 
Computer graphics presentation
Computer graphics presentationComputer graphics presentation
Computer graphics presentation
 
Unit 2
Unit 2Unit 2
Unit 2
 
Bresenhamcircle derivation
Bresenhamcircle derivationBresenhamcircle derivation
Bresenhamcircle derivation
 
A mid point ellipse drawing algorithm on a hexagonal grid
A mid  point ellipse drawing algorithm on a hexagonal gridA mid  point ellipse drawing algorithm on a hexagonal grid
A mid point ellipse drawing algorithm on a hexagonal grid
 
Line circle draw
Line circle drawLine circle draw
Line circle draw
 

Viewers also liked

Simplicity in Web Application Design - Laura Chessman, Lisa Battle and Rachel...
Simplicity in Web Application Design - Laura Chessman, Lisa Battle and Rachel...Simplicity in Web Application Design - Laura Chessman, Lisa Battle and Rachel...
Simplicity in Web Application Design - Laura Chessman, Lisa Battle and Rachel...UXPA International
 
Minimum enclosingdisk
Minimum enclosingdiskMinimum enclosingdisk
Minimum enclosingdiskAnirban Mitra
 
Resilient priority queue
Resilient priority queueResilient priority queue
Resilient priority queueAnirban Mitra
 

Viewers also liked (7)

Simplicity in Web Application Design - Laura Chessman, Lisa Battle and Rachel...
Simplicity in Web Application Design - Laura Chessman, Lisa Battle and Rachel...Simplicity in Web Application Design - Laura Chessman, Lisa Battle and Rachel...
Simplicity in Web Application Design - Laura Chessman, Lisa Battle and Rachel...
 
Leaf
LeafLeaf
Leaf
 
Airline seatproblem
Airline seatproblemAirline seatproblem
Airline seatproblem
 
Minimum enclosingdisk
Minimum enclosingdiskMinimum enclosingdisk
Minimum enclosingdisk
 
Small world
Small worldSmall world
Small world
 
Lfu
LfuLfu
Lfu
 
Resilient priority queue
Resilient priority queueResilient priority queue
Resilient priority queue
 

Similar to Making BIG DATA smaller

Dimensionality Reduction | Machine Learning | CloudxLab
Dimensionality Reduction | Machine Learning | CloudxLabDimensionality Reduction | Machine Learning | CloudxLab
Dimensionality Reduction | Machine Learning | CloudxLabCloudxLab
 
Weakly supervised semantic segmentation of 3D point cloud
Weakly supervised semantic segmentation of 3D point cloudWeakly supervised semantic segmentation of 3D point cloud
Weakly supervised semantic segmentation of 3D point cloudArithmer Inc.
 
DimensionalityReduction.pptx
DimensionalityReduction.pptxDimensionalityReduction.pptx
DimensionalityReduction.pptx36rajneekant
 
Support Vector Machines Simply
Support Vector Machines SimplySupport Vector Machines Simply
Support Vector Machines SimplyEmad Nabil
 
Aaa ped-17-Unsupervised Learning: Dimensionality reduction
Aaa ped-17-Unsupervised Learning: Dimensionality reductionAaa ped-17-Unsupervised Learning: Dimensionality reduction
Aaa ped-17-Unsupervised Learning: Dimensionality reductionAminaRepo
 
Faire de la reconnaissance d'images avec le Deep Learning - Cristina & Pierre...
Faire de la reconnaissance d'images avec le Deep Learning - Cristina & Pierre...Faire de la reconnaissance d'images avec le Deep Learning - Cristina & Pierre...
Faire de la reconnaissance d'images avec le Deep Learning - Cristina & Pierre...Jedha Bootcamp
 
Introducton to Convolutional Nerural Network with TensorFlow
Introducton to Convolutional Nerural Network with TensorFlowIntroducton to Convolutional Nerural Network with TensorFlow
Introducton to Convolutional Nerural Network with TensorFlowEtsuji Nakai
 
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...MLconf
 
CS 354 More Graphics Pipeline
CS 354 More Graphics PipelineCS 354 More Graphics Pipeline
CS 354 More Graphics PipelineMark Kilgard
 
Deep Learning Introduction - WeCloudData
Deep Learning Introduction - WeCloudDataDeep Learning Introduction - WeCloudData
Deep Learning Introduction - WeCloudDataWeCloudData
 
Md2k 0219 shang
Md2k 0219 shangMd2k 0219 shang
Md2k 0219 shangBBKuhn
 
Variational Autoencoders For Image Generation
Variational Autoencoders For Image GenerationVariational Autoencoders For Image Generation
Variational Autoencoders For Image GenerationJason Anderson
 
Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017Gabriel Moreira
 
Machine Learning Notes for beginners ,Step by step
Machine Learning Notes for beginners ,Step by stepMachine Learning Notes for beginners ,Step by step
Machine Learning Notes for beginners ,Step by stepSanjanaSaxena17
 
Structured Forests for Fast Edge Detection [Paper Presentation]
Structured Forests for Fast Edge Detection [Paper Presentation]Structured Forests for Fast Edge Detection [Paper Presentation]
Structured Forests for Fast Edge Detection [Paper Presentation]Mohammad Shaker
 
Questions On The Equation For Regression
Questions On The Equation For RegressionQuestions On The Equation For Regression
Questions On The Equation For RegressionTiffany Sandoval
 

Similar to Making BIG DATA smaller (20)

Dimensionality Reduction | Machine Learning | CloudxLab
Dimensionality Reduction | Machine Learning | CloudxLabDimensionality Reduction | Machine Learning | CloudxLab
Dimensionality Reduction | Machine Learning | CloudxLab
 
Cgm Lab Manual
Cgm Lab ManualCgm Lab Manual
Cgm Lab Manual
 
Weakly supervised semantic segmentation of 3D point cloud
Weakly supervised semantic segmentation of 3D point cloudWeakly supervised semantic segmentation of 3D point cloud
Weakly supervised semantic segmentation of 3D point cloud
 
DimensionalityReduction.pptx
DimensionalityReduction.pptxDimensionalityReduction.pptx
DimensionalityReduction.pptx
 
Support Vector Machines Simply
Support Vector Machines SimplySupport Vector Machines Simply
Support Vector Machines Simply
 
Visual Techniques
Visual TechniquesVisual Techniques
Visual Techniques
 
Practical data analysis with wine
Practical data analysis with winePractical data analysis with wine
Practical data analysis with wine
 
Aaa ped-17-Unsupervised Learning: Dimensionality reduction
Aaa ped-17-Unsupervised Learning: Dimensionality reductionAaa ped-17-Unsupervised Learning: Dimensionality reduction
Aaa ped-17-Unsupervised Learning: Dimensionality reduction
 
Faire de la reconnaissance d'images avec le Deep Learning - Cristina & Pierre...
Faire de la reconnaissance d'images avec le Deep Learning - Cristina & Pierre...Faire de la reconnaissance d'images avec le Deep Learning - Cristina & Pierre...
Faire de la reconnaissance d'images avec le Deep Learning - Cristina & Pierre...
 
Introducton to Convolutional Nerural Network with TensorFlow
Introducton to Convolutional Nerural Network with TensorFlowIntroducton to Convolutional Nerural Network with TensorFlow
Introducton to Convolutional Nerural Network with TensorFlow
 
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
 
CS 354 More Graphics Pipeline
CS 354 More Graphics PipelineCS 354 More Graphics Pipeline
CS 354 More Graphics Pipeline
 
Deep Learning Introduction - WeCloudData
Deep Learning Introduction - WeCloudDataDeep Learning Introduction - WeCloudData
Deep Learning Introduction - WeCloudData
 
Scalable machine learning
Scalable machine learningScalable machine learning
Scalable machine learning
 
Md2k 0219 shang
Md2k 0219 shangMd2k 0219 shang
Md2k 0219 shang
 
Variational Autoencoders For Image Generation
Variational Autoencoders For Image GenerationVariational Autoencoders For Image Generation
Variational Autoencoders For Image Generation
 
Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017
 
Machine Learning Notes for beginners ,Step by step
Machine Learning Notes for beginners ,Step by stepMachine Learning Notes for beginners ,Step by step
Machine Learning Notes for beginners ,Step by step
 
Structured Forests for Fast Edge Detection [Paper Presentation]
Structured Forests for Fast Edge Detection [Paper Presentation]Structured Forests for Fast Edge Detection [Paper Presentation]
Structured Forests for Fast Edge Detection [Paper Presentation]
 
Questions On The Equation For Regression
Questions On The Equation For RegressionQuestions On The Equation For Regression
Questions On The Equation For Regression
 

Recently uploaded

TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....rightmanforbloodline
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Decarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational PerformanceDecarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational PerformanceIES VE
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Bhuvaneswari Subramani
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMKumar Satyam
 
JavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate GuideJavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate GuidePixlogix Infotech
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...WSO2
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...caitlingebhard1
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
The Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightThe Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightSafe Software
 
API Governance and Monetization - The evolution of API governance
API Governance and Monetization -  The evolution of API governanceAPI Governance and Monetization -  The evolution of API governance
API Governance and Monetization - The evolution of API governanceWSO2
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 

Recently uploaded (20)

TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Decarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational PerformanceDecarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational Performance
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDM
 
JavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate GuideJavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate Guide
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
The Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightThe Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and Insight
 
API Governance and Monetization - The evolution of API governance
API Governance and Monetization -  The evolution of API governanceAPI Governance and Monetization -  The evolution of API governance
API Governance and Monetization - The evolution of API governance
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 

Making BIG DATA smaller

  • 2. About Me ● I am from SF ○ SF Bay Area Machine Learning Meetup ○ www.sfdatajobs.com ● Background: ○ BS/MS CompSci (focus on ML/Vision) ○ 4 years as Data Engineer in Ad Tech ○ Currently Consulting
  • 3. What does “data” mean? RAW DATA structured & unstructured Clean & Transform Extract Features DATA observations and features observations features D features DATA (matrix) m observations
  • 4. What is “Big Data?” Big data is an all-encompassing term for any collection of data sets so large and complex that it becomes difficult to process them using traditional data processing applications -- Wikipedia
  • 5. What is “Big Data?” Big data is an all-encompassing term for any collection of data sets so large and complex that it becomes difficult to process them using traditional data processing applications -- Wikipedia To me, Big Data is when: ● run out of disc space ● run into “Out Of Memory” Errors ● S3 billing triggers your credit company to call you ● you are willing to go through the pains of setting up a hadoop/spark/etc cluster (have you tried configuring your own cluster?)
  • 6. Question ● Can we make big data smaller? ● What are the benefits of smaller data?
  • 7. Question ● Can we make big data smaller? ● Benefits of having “small data”: ○ Reduce storage costs ○ Reduce computational costs ○ No more “Out Of Memory” Errors
  • 8. Ideas for making data small ● Reduce the number of observations ○ Only keep observations that are “important” ○ Remove redundant observations ○ Randomly sample ● Reduce the number of features ○ removing non-useful features ○ combining features ○ something clever
  • 10. Ideas for making data small ● Reduce the number of observations ○ Only keep rows that are “important” ○ Remove redundant rows ○ Randomly sample ● Reduce the number of features ○ removing non-useful features ○ combining features ○ something clever
  • 11. Our Focus: reducing feat. m D m d Dimensionality Reduction Note: d << D Dimensionality reduction Note: we want to preserve the distances between observations as best as possible.
  • 12. Exercise Given the following set of 2d observations, how can we represent each observation in 1d while still preserving the distances between points as best as possible?
  • 17. Will it still work if ... Are we getting good results because the x-axis is aligned with the spread of the observations?
  • 18. Unaligned data Can we find a better coordinate system that is aligned with the data?
  • 19. Aligned coordinate system Direction which aligns with the spread of observations ● dimensionality reduction easy with aligned coordinate system
  • 20. Computing Spread ● Spread = variance of projected observations ● How to determine observations in new coordinate system?
  • 21. Linear Algebra v2 v1 a1 = v1_ * p ||v1|| p a2 = v2_ * p ||v2|| p = (p1, p2) originally p = (a1, a2 ) in new coordinate system
  • 22. Observations ● Finding an aligned coordinate system makes it easy for us to do dimensionality reduction. ○ represent observations in new coordinate system then remove features (axes). ● The direction parallel to spread of data maximizes interpoint distances.
  • 23. New tool (PCA) ● How do we find aligned coordinate system? Is there a tool already developed for this?
  • 24. New tool (PCA) ● How do we find aligned coordinate system? ● Principal Component Analysis ○ Given set of observations, finds an aligned coordinate system. ○ First direction of coordinate system will contain the most spread, followed by the second, so forth. ○ O(m3) runtime
  • 25. PCA (scikit-learn) >>> X = … data matrix of size (m x D) ... >>> from sklearn.decomposition import PCA >>> >>> pca = PCA(n_components=d) >>> pca.fit(X) # fits new coordinate system to data >>> pca.transform(X) # transforms data to new coordinate # system and removes dimensions >>>> gives us matrix of size (m x d)
  • 26. Our Focus: reducing feat. m D m d Dimensionality Reduction Note: d << D Dimensionality reduction Note: we want to preserve the distances between observations as best as possible.
  • 27. 3D to 2D v1 v2 v3 projected data space keep only v1 and v2
  • 28. Image Data ● What if our data is images? How do we represent an image as an observation? (100x100) matrix
  • 29. Images and vectors (100x100) matrix 10k-dimensional vector
  • 30. Image Data (100x100) matrix 10k-dimensional vector (100x100) matrix 10k-dimensional vector r3 r1 r2 r4 r5 ... r9999 r9998 r10000 ...
  • 31. Image Data (100x100) matrix 10k-dimensional vector (100x100) matrix 10k-dimensional vector r3 r1 r2 r4 r5 ... r9999 r9998 r10000 ... ...
  • 32. Image Data (100x100) matrix 10k-dimensional vector (100x100) matrix 10k-dimensional vector r3 r1 r2 r4 r5 ... r9999 r9998 r10000 ... ... v999 9 v4 v2 v1 v10k v1 v3
  • 33. Image Data v1 v2 v10k ... r3 r1 r2 r4 r5 ... r9999 r9998 r10000 ... v999 9 v4 v2 v1 v10k v1 v3
  • 35. Image Data = a1 + a2 +…+ a10k Original Image
  • 36. Image Data = a1 + a2 +…+ a10k Original Image = a1 + …+ a20 reconstruct with 20 directions
  • 37. Image Data = a1 + a2 +…+ a10k Original Image = a1 + …+ a20 reconstruct with 20 directions = a1 + …+ a90 reconstruct with 90 directions
  • 38. Image Data = a1 + a2 +…+ a10k Original Image = a1 + …+ a90 reconstruct with 90 directions Compression! Each image can now be represented by 90 weights!
  • 39. Image Data = a1 + …+ a90 reconstruct with 90 directions ● original image representation: 10k values ● compression requires: ○ 90 direction vectors = (90 x 10k values) ○ 1 image = 90 weights (for the direction vectors)
  • 40. Image Data = a1 + …+ a90 reconstruct with 90 directions ● For 200 images: ○ original representation: 200*10k ○ compression: (90x10k) + 200*90
  • 41. Image Data = a1 + …+ a90 reconstruct with 90 directions ● For 200 images: ○ original representation: 200*10k ○ compression: (90x10k) + 200*90 makes sense to use this compression technique when we have more than 90 images to compress
  • 42. Keep in mind ● O(m3) runtime ● Need to keep around d directions of length D to perform projection. ● Requires to be able to read in data to memory. ● What if data is non-linear?
  • 43. Random Projections ● Generate a (d x D) matrix, P, where elements are drawn from a normal distribution ~ N(0, 1/d) ● To compute projected observation: ○ onew = P.dot(o) D d = d P * D
  • 44. Intuition ● Randomly determine coordinate system.
  • 45. Intuition ● Randomly determine coordinate system.
  • 46. Intuition ● Randomly determine coordinate system.
  • 47. Intuition ● Randomly determine coordinate system. ● Keep d directions
  • 48. Safe value for d? Using this technique, what is a “safe” value for d? m D m d Dimensionality Reduction Note: d << D Random Projection
  • 49. Safe value for d? >> from sklearn.random_projection import johnson_lindenstrauss_min_dim def johnson_lindenstrauss_min_dim(n_observations,eps=0.1): ● input: ○ n_observations -- the number of observations you have ○ eps -- the amount of error you’re willing to tolerate ● output: ○ safe number of features that you can project down to
  • 50. Mathematical Guarantees Original distance Projected distance d >= 4 log(m) / (eps^2 / 2 - eps^3 / 3)
  • 51. Practical usage ● High probability that projection will be good, but there’s still a chance that it will not be! ○ Create multiple projections and test guarantees with sampled observations.
  • 52. Comparison ● PCA ○ finds aligned coordinate system which maximizes spread. ○ o(m3) runtime + requires all points to be read into memory. ○ o(dD) space to store aligned coordinate system for projection. ● Random Projection ○ finds random coordinate system ○ o(dD) runtime and space to construct projection Matrix ○ Guaranteed with high probability to work
  • 53. Thank you Tony Tran tony@sfdatajobs.com @quicksorter
  • 54. References ● http://www.cs.princeton.edu/~cdecoro/eigenfaces/ ● http://scikit-learn.org/stable/modules/generated/sklearn. decomposition.PCA.html#sklearn.decomposition.PCA ● http://scikit-learn.org/stable/modules/random_projection. html ● http://blog.yhathq.com/posts/sparse-random-projections. html