Making BIG DATA smaller

Making BIGDATA
smaller
Tony Tran

About Me
● I am from SF
○ SF Bay Area Machine Learning Meetup
○ www.sfdatajobs.com
● Background:
○ BS/MS CompSci (focus on ML/Vision)
○ 4 years as Data Engineer in Ad Tech
○ Currently Consulting

What does “data” mean?
RAW DATA
structured &
unstructured
Clean &
Transform
Extract
Features
DATA
observations and
features
observations
features
D features
DATA
(matrix)
m observations

What is “Big Data?”
Big data is an all-encompassing term for any collection of data sets so large
and complex that it becomes difficult to process them using traditional data
processing applications -- Wikipedia

What is “Big Data?”
Big data is an all-encompassing term for any collection of data sets so large
and complex that it becomes difficult to process them using traditional data
processing applications -- Wikipedia
To me, Big Data is when:
● run out of disc space
● run into “Out Of Memory” Errors
● S3 billing triggers your credit company to call you
● you are willing to go through the pains of setting up a hadoop/spark/etc
cluster (have you tried configuring your own cluster?)

Question
● Can we make big data smaller?
● What are the benefits of smaller data?

Question
● Can we make big data smaller?
● Benefits of having “small data”:
○ Reduce storage costs
○ Reduce computational costs
○ No more “Out Of Memory” Errors

Ideas for making data small
● Reduce the number of observations
○ Only keep observations that are “important”
○ Remove redundant observations
○ Randomly sample
● Reduce the number of features
○ removing non-useful features
○ combining features
○ something clever

Ideas for making data small
● Reduce the number of observations
○ Only keep rows that are “important”
○ Remove redundant rows
○ Randomly sample
● Reduce the number of features
○ removing non-useful features
○ combining features
○ something clever

Our Focus: reducing feat.
m
D
m
d
Dimensionality
Reduction
Note: d << D
Dimensionality
reduction
Note: we want to preserve the
distances between observations as
best as possible.

Exercise
Given the following set of 2d observations, how can we
represent each observation in 1d while still preserving the
distances between points as best as possible?

Exercise
Projecting observations onto x-axis

Exercise
Projecting observations onto y-axis

Will it still work if ...
Are we getting good
results because the x-axis
is aligned with the spread
of the observations?

Unaligned data
Can we find a better
coordinate system
that is aligned with
the data?

Aligned coordinate system
Direction which aligns with
the spread of observations
● dimensionality reduction easy
with aligned coordinate
system

Computing Spread
● Spread = variance of projected observations
● How to determine observations in new coordinate
system?

Linear Algebra
v2 v1
a1 = v1_ * p
||v1||
p
a2 = v2_ * p
||v2||
p = (p1, p2) originally
p = (a1, a2 ) in new coordinate system

Observations
● Finding an aligned coordinate system makes
it easy for us to do dimensionality reduction.
○ represent observations in new coordinate system
then remove features (axes).
● The direction parallel to spread of data
maximizes interpoint distances.

New tool (PCA)
● How do we find aligned coordinate system?
Is there a tool already developed for this?

New tool (PCA)
● How do we find aligned coordinate system?
● Principal Component Analysis
○ Given set of observations, finds an aligned
coordinate system.
○ First direction of coordinate system will contain the
most spread, followed by the second, so forth.
○ O(m3) runtime

PCA (scikit-learn)
>>> X = … data matrix of size (m x D) ...
>>> from sklearn.decomposition import PCA
>>>
>>> pca = PCA(n_components=d)
>>> pca.fit(X) # fits new coordinate system to data
>>> pca.transform(X) # transforms data to new coordinate
# system and removes dimensions
>>>> gives us matrix of size (m x d)

3D to 2D
v1
v2
v3
projected data space
keep only v1 and v2

Image Data
● What if our data is images? How do we
represent an image as an observation?
(100x100) matrix

Images and vectors
(100x100) matrix
10k-dimensional
vector

Image Data
(100x100) matrix
10k-dimensional
vector
(100x100) matrix
10k-dimensional
vector
r3
r1
r2
r4 r5 ...
r9999
r9998
r10000
...

Image Data
(100x100) matrix
10k-dimensional
vector
(100x100) matrix
10k-dimensional
vector
r3
r1
r2
r4 r5 ...
r9999
r9998
r10000
...
...

Image Data
(100x100) matrix
10k-dimensional
vector
(100x100) matrix
10k-dimensional
vector
r3
r1
r2
r4 r5 ...
r9999
r9998
r10000
...
...
v999
9
v4
v2
v1
v10k
v1
v3

Image Data
v1
v2
v10k
...
r3
r1
r2
r4 r5 ...
r9999
r9998
r10000
...
v999
9
v4
v2
v1
v10k
v1
v3

Image Data
= a1 + a2 +…+ a10k
Original Image

Image Data
= a1 + a2 +…+ a10k
Original Image
= a1 + …+ a20
reconstruct with 20 directions

Image Data
= a1 + a2 +…+ a10k
Original Image
= a1 + …+ a20
= a1 + …+ a90

Image Data
= a1 + a2 +…+ a10k
Original Image
= a1 + …+ a90
Compression!
Each image can now be represented by 90 weights!

Image Data
= a1 + …+ a90
● original image representation: 10k values
● compression requires:
○ 90 direction vectors = (90 x 10k values)
○ 1 image = 90 weights (for the direction vectors)

Image Data
= a1 + …+ a90
● For 200 images:
○ original representation: 200*10k
○ compression: (90x10k) + 200*90

Image Data
= a1 + …+ a90
● For 200 images:
○ original representation: 200*10k
○ compression: (90x10k) + 200*90
makes sense to use this
compression technique when
we have more than 90 images
to compress

Keep in mind
● O(m3) runtime
● Need to keep around d directions of length D to
perform projection.
● Requires to be able to read in data to memory.
● What if data is non-linear?

Random Projections
● Generate a (d x D) matrix, P, where
elements are drawn from a normal
distribution ~ N(0, 1/d)
● To compute projected observation:
○ onew = P.dot(o)
D
d = d P
* D

Intuition
● Randomly determine coordinate system.

Intuition
● Randomly determine coordinate system.
● Keep d directions

Safe value for d?
Using this technique, what is a “safe” value for d?
m
D
m
d
Dimensionality
Reduction
Note: d << D
Random
Projection

Safe value for d?
>> from sklearn.random_projection import johnson_lindenstrauss_min_dim
def johnson_lindenstrauss_min_dim(n_observations,eps=0.1):
● input:
○ n_observations -- the number of observations you have
○ eps -- the amount of error you’re willing to tolerate
● output:
○ safe number of features that you can project down to

Mathematical Guarantees
Original distance
Projected distance
d >= 4 log(m) / (eps^2 / 2 - eps^3 / 3)

Practical usage
● High probability that projection will be good,
but there’s still a chance that it will not be!
○ Create multiple projections and test guarantees with
sampled observations.

Comparison
● PCA
○ finds aligned coordinate system which maximizes spread.
○ o(m3) runtime + requires all points to be read into memory.
○ o(dD) space to store aligned coordinate system for projection.
● Random Projection
○ finds random coordinate system
○ o(dD) runtime and space to construct projection Matrix
○ Guaranteed with high probability to work

Thank you
Tony Tran
tony@sfdatajobs.com
@quicksorter

References
● http://www.cs.princeton.edu/~cdecoro/eigenfaces/
● http://scikit-learn.org/stable/modules/generated/sklearn.
decomposition.PCA.html#sklearn.decomposition.PCA
● http://scikit-learn.org/stable/modules/random_projection.
html
● http://blog.yhathq.com/posts/sparse-random-projections.
html

Making BIG DATA smaller

More Related Content

What's hot

Viewers also liked

Similar to Making BIG DATA smaller

Recently uploaded

Making BIG DATA smaller