Tensor models and other dreams by PhD Andres Mendez-Vazquez
1. Tensor Models and Other Dreams...
Andres Mendez-Vazquez
January 26, 2018
1 / 64
2. Outline
1 Introduction
The Dream of Tensors
A Short Story on Compression
A Short History
What a Heck are Tensors?
2 The Tensor Models for Data Science
Decomposition for Compression
CANDECOMP/PARAFAC Decomposition
The Dream of Compression and BIG DATA
Tensorizing Neural Networks
Hardware Support for the Dream
3 Conclusions
The Dream Will Follow....
2 / 64
3. Outline
1 Introduction
The Dream of Tensors
A Short Story on Compression
A Short History
What a Heck are Tensors?
2 The Tensor Models for Data Science
Decomposition for Compression
CANDECOMP/PARAFAC Decomposition
The Dream of Compression and BIG DATA
Tensorizing Neural Networks
Hardware Support for the Dream
3 Conclusions
The Dream Will Follow....
3 / 64
4. Tensors are this way...
As words defining an important moment in life
Without you
All the stars we steal from the night sky
Will never be enough
Never be enough
These hands could hold the world
but it’ll
Never be enough...
- Justin Paul / Benj Pasek, Greatest Showman
4 / 64
5. Tensors are like such words...
They represent generalizations that represent our dreams...
In Data Sciences...
5 / 64
6. Outline
1 Introduction
The Dream of Tensors
A Short Story on Compression
A Short History
What a Heck are Tensors?
2 The Tensor Models for Data Science
Decomposition for Compression
CANDECOMP/PARAFAC Decomposition
The Dream of Compression and BIG DATA
Tensorizing Neural Networks
Hardware Support for the Dream
3 Conclusions
The Dream Will Follow....
6 / 64
8. Then, we have an Opportunity or a Terrible Problem
How do you represent them in a easy way to handle them?
After all we want to
Search them
Compare them
Rank them
What about using vectors?
word 1 word 2 word 3 word 4
· · ·
word d
counter counter counter counter counter
x1 x2 x3 x4 · · · xd
8 / 64
9. Then, we have an Opportunity or a Terrible Problem
How do you represent them in a easy way to handle them?
After all we want to
Search them
Compare them
Rank them
What about using vectors?
word 1 word 2 word 3 word 4
· · ·
word d
counter counter counter counter counter
x1 x2 x3 x4 · · · xd
8 / 64
10. Then, we have an Opportunity or a Terrible Problem
How do you represent them in a easy way to handle them?
After all we want to
Search them
Compare them
Rank them
What about using vectors?
word 1 word 2 word 3 word 4
· · ·
word d
counter counter counter counter counter
x1 x2 x3 x4 · · · xd
8 / 64
11. Then, we have an Opportunity or a Terrible Problem
How do you represent them in a easy way to handle them?
After all we want to
Search them
Compare them
Rank them
What about using vectors?
word 1 word 2 word 3 word 4
· · ·
word d
counter counter counter counter counter
x1 x2 x3 x4 · · · xd
8 / 64
12. Then, we have an Opportunity or a Terrible Problem
How do you represent them in a easy way to handle them?
After all we want to
Search them
Compare them
Rank them
What about using vectors?
word 1 word 2 word 3 word 4
· · ·
word d
counter counter counter counter counter
x1 x2 x3 x4 · · · xd
8 / 64
13. The Matrix at the Center of Everything!!!
The Vector/Matrix Representation
They are basically a N × d matrix like this
A =
(x1)1 · · · (x1)j · · · (x1)d
...
...
(xi)1 (xi)j (xi)d
...
...
(xN )1 · · · (xN )j · · · (xN )d
A is a matrix with...
N represents the thousands of documents...
d represents the thousands of words in a dictionary.....
9 / 64
14. The Matrix at the Center of Everything!!!
The Vector/Matrix Representation
They are basically a N × d matrix like this
A =
(x1)1 · · · (x1)j · · · (x1)d
...
...
(xi)1 (xi)j (xi)d
...
...
(xN )1 · · · (xN )j · · · (xN )d
A is a matrix with...
N represents the thousands of documents...
d represents the thousands of words in a dictionary.....
9 / 64
15. A Small Problem
The matrix alone consumes... so much...
You have 2 bytes per memory cell
If we have N = 106
, d = 50, 000
We have
2 × N × d = 100 Gigabytes
10 / 64
16. A Small Problem
The matrix alone consumes... so much...
You have 2 bytes per memory cell
If we have N = 106
, d = 50, 000
We have
2 × N × d = 100 Gigabytes
10 / 64
18. We have a trick!!!
Something Notable
The Matrix is Highly SPARSE
12 / 64
19. Therefore
If you are smart enough
You start represent the matrix information using sparse techniques
5x5 Matrix
Numeric Elements
Empty Elements
Sparse Matrix
13 / 64
20. Then
If you are quite smart....
You discover that few of the eigenvalues provide some information...
Every Matrix has a Singular Value Decomposition
A = UΣV T
The columns of U are an orthonormal basis for the column space.
The columns of V are an orthonormal basis for the row space.
The Σ is diagonal and the entries on its diagonal σi = Σii are positive
real numbers, called the singular values of A.
14 / 64
21. Then
If you are quite smart....
You discover that few of the eigenvalues provide some information...
Every Matrix has a Singular Value Decomposition
A = UΣV T
The columns of U are an orthonormal basis for the column space.
The columns of V are an orthonormal basis for the row space.
The Σ is diagonal and the entries on its diagonal σi = Σii are positive
real numbers, called the singular values of A.
14 / 64
22. Then
If you are quite smart....
You discover that few of the eigenvalues provide some information...
Every Matrix has a Singular Value Decomposition
A = UΣV T
The columns of U are an orthonormal basis for the column space.
The columns of V are an orthonormal basis for the row space.
The Σ is diagonal and the entries on its diagonal σi = Σii are positive
real numbers, called the singular values of A.
14 / 64
23. Then
If you are quite smart....
You discover that few of the eigenvalues provide some information...
Every Matrix has a Singular Value Decomposition
A = UΣV T
The columns of U are an orthonormal basis for the column space.
The columns of V are an orthonormal basis for the row space.
The Σ is diagonal and the entries on its diagonal σi = Σii are positive
real numbers, called the singular values of A.
14 / 64
24. How much compression can we get?
The Matrix Sparse Representation
It Achieves 90% Compression - We go from 100 Gigabytes to 10
Gigabytes
From 50,000 dimensions/words we go to 300 dimensions
Using the Singular Value Decomposition
Making Possible to go from 100 Gigabytes to
2 × N × 300 = 0.6 Gigabytes
15 / 64
25. How much compression can we get?
The Matrix Sparse Representation
It Achieves 90% Compression - We go from 100 Gigabytes to 10
Gigabytes
From 50,000 dimensions/words we go to 300 dimensions
Using the Singular Value Decomposition
Making Possible to go from 100 Gigabytes to
2 × N × 300 = 0.6 Gigabytes
15 / 64
26. How much compression can we get?
The Matrix Sparse Representation
It Achieves 90% Compression - We go from 100 Gigabytes to 10
Gigabytes
From 50,000 dimensions/words we go to 300 dimensions
Using the Singular Value Decomposition
Making Possible to go from 100 Gigabytes to
2 × N × 300 = 0.6 Gigabytes
15 / 64
27. IMAGINE!!!!
We have a crazy moment!!!
All the stars we steal from the night sky
Will never be enough
Never be enough
Towers of gold are still too little
These hands could hold the world
but it’ll
Never be enough
Never be enough
For me
16 / 64
28. Then
You go ambitious!!! You add a new dimension representing feelings!!!
Feeling
Dim
ensionality
17 / 64
29. Outline
1 Introduction
The Dream of Tensors
A Short Story on Compression
A Short History
What a Heck are Tensors?
2 The Tensor Models for Data Science
Decomposition for Compression
CANDECOMP/PARAFAC Decomposition
The Dream of Compression and BIG DATA
Tensorizing Neural Networks
Hardware Support for the Dream
3 Conclusions
The Dream Will Follow....
18 / 64
30. They have a somewhat short history!!!
First Most
They are abstract entities invariant under coordinate transformations.
They were mentioned first by Woldemar Wright in 1898
A German physicist, who taught at the Georg August University of
Göttingen.
He mentioned the tensors in a study about the physical properties of
crystals.
But Before That
The Great Riemann introduced the concept of topological manifold...
the beginning of the dream...
Through a quadratic linear element to study its properties...
ds2
= gijdxi
dxj
19 / 64
31. They have a somewhat short history!!!
First Most
They are abstract entities invariant under coordinate transformations.
They were mentioned first by Woldemar Wright in 1898
A German physicist, who taught at the Georg August University of
Göttingen.
He mentioned the tensors in a study about the physical properties of
crystals.
But Before That
The Great Riemann introduced the concept of topological manifold...
the beginning of the dream...
Through a quadratic linear element to study its properties...
ds2
= gijdxi
dxj
19 / 64
32. They have a somewhat short history!!!
First Most
They are abstract entities invariant under coordinate transformations.
They were mentioned first by Woldemar Wright in 1898
A German physicist, who taught at the Georg August University of
Göttingen.
He mentioned the tensors in a study about the physical properties of
crystals.
But Before That
The Great Riemann introduced the concept of topological manifold...
the beginning of the dream...
Through a quadratic linear element to study its properties...
ds2
= gijdxi
dxj
19 / 64
33. They have a somewhat short history!!!
First Most
They are abstract entities invariant under coordinate transformations.
They were mentioned first by Woldemar Wright in 1898
A German physicist, who taught at the Georg August University of
Göttingen.
He mentioned the tensors in a study about the physical properties of
crystals.
But Before That
The Great Riemann introduced the concept of topological manifold...
the beginning of the dream...
Through a quadratic linear element to study its properties...
ds2
= gijdxi
dxj
19 / 64
34. They have a somewhat short history!!!
First Most
They are abstract entities invariant under coordinate transformations.
They were mentioned first by Woldemar Wright in 1898
A German physicist, who taught at the Georg August University of
Göttingen.
He mentioned the tensors in a study about the physical properties of
crystals.
But Before That
The Great Riemann introduced the concept of topological manifold...
the beginning of the dream...
Through a quadratic linear element to study its properties...
ds2
= gijdxi
dxj
19 / 64
35. Then
Gregorio Ricci-Curbastro and Tullio Levi-Civita
They wrote a paper in the Mathematische Annalen , Vol. 54 (1901) ,
entitled "Méthodes de calcul differéntiel absolu"
A Monster Came Around
20 / 64
36. Then
Gregorio Ricci-Curbastro and Tullio Levi-Civita
They wrote a paper in the Mathematische Annalen , Vol. 54 (1901) ,
entitled "Méthodes de calcul differéntiel absolu"
A Monster Came Around
20 / 64
37. “Every Genius has stood in the Shoulder of Giants” -
Newton
Einstein adopted the concepts at the paper
And the Theory of General Relativity was born
He renamed the entire field from “calcul absolu”
TENSOR CALCULUS
21 / 64
38. “Every Genius has stood in the Shoulder of Giants” -
Newton
Einstein adopted the concepts at the paper
And the Theory of General Relativity was born
He renamed the entire field from “calcul absolu”
TENSOR CALCULUS
21 / 64
39. Outline
1 Introduction
The Dream of Tensors
A Short Story on Compression
A Short History
What a Heck are Tensors?
2 The Tensor Models for Data Science
Decomposition for Compression
CANDECOMP/PARAFAC Decomposition
The Dream of Compression and BIG DATA
Tensorizing Neural Networks
Hardware Support for the Dream
3 Conclusions
The Dream Will Follow....
22 / 64
41. We define
A Coordinate System
We define vectors in terms of a base
v = vxe1 + vye2 =
vx
vy
∈ R2
v = v 2 = v2
x + v2
y
1
2
Note: This is important vectors are always the same thing no
matter the coordinate thing
24 / 64
42. Therefore
Imagine to represent the new basis in terms of an old basis
e1 · v = vx = e1 · vxe1 + e1 · vye2
e2 · v = vy = e2 · vxe1 + e2 · vye2
Where
ei · ej = Projection of ei onto ej
25 / 64
43. Therefore
Imagine to represent the new basis in terms of an old basis
e1 · v = vx = e1 · vxe1 + e1 · vye2
e2 · v = vy = e2 · vxe1 + e2 · vye2
Where
ei · ej = Projection of ei onto ej
25 / 64
44. Using a Little bit of Notation
We need a notation that is both more compact
Let the indices i, j represent the numbers 1, 2 corresponding to the
coordinates x, y
Write components of v as vi and v i in the two coordinate system
Then define
aij
= ei · ej
Note: This define the “ROTATION”
In fact are individually just the cosines of the angle
between one axis and another
26 / 64
45. Using a Little bit of Notation
We need a notation that is both more compact
Let the indices i, j represent the numbers 1, 2 corresponding to the
coordinates x, y
Write components of v as vi and v i in the two coordinate system
Then define
aij
= ei · ej
Note: This define the “ROTATION”
In fact are individually just the cosines of the angle
between one axis and another
26 / 64
46. Therefore
We can rewrite the entire transformation
v i
=
2
j=1
aij
vj
We will agree that whenever an index appears twice, we have a sum
v i
= aij
vj
27 / 64
47. Therefore
We can rewrite the entire transformation
v i
=
2
j=1
aij
vj
We will agree that whenever an index appears twice, we have a sum
v i
= aij
vj
27 / 64
48. We have then...
We can do the following
v 1
v 2 =
a11 a12
a21 a22
v1
v2
Then, we compress our notation more
v = av
28 / 64
49. We have then...
We can do the following
v 1
v 2 =
a11 a12
a21 a22
v1
v2
Then, we compress our notation more
v = av
28 / 64
50. Then, we can redefine our dot product
The Basis of Projecting into other vectors
v · w = vi
wi
= v
i
w
i
= aij
aik
vj
wk
Using the Kronecker Delta
δij
=
0 if i = j
1 if i = j
Therefore, we have
aij
aik
= δjk
29 / 64
51. Then, we can redefine our dot product
The Basis of Projecting into other vectors
v · w = vi
wi
= v
i
w
i
= aij
aik
vj
wk
Using the Kronecker Delta
δij
=
0 if i = j
1 if i = j
Therefore, we have
aij
aik
= δjk
29 / 64
52. Then, we can redefine our dot product
The Basis of Projecting into other vectors
v · w = vi
wi
= v
i
w
i
= aij
aik
vj
wk
Using the Kronecker Delta
δij
=
0 if i = j
1 if i = j
Therefore, we have
aij
aik
= δjk
29 / 64
53. Proving the Invariance of the dot product
Therefore
v
i
· w
i
= δjk
vj
wk
= vj
· wj
30 / 64
54. Then, we have
A scalar is a number K
It has the same value in different coordinate systems.
A vector is a set of numbers vi
They Transform according to
v i
= aij
vj
A (Second Rank) Tensor is a set of numbers Tij
They transform according to
T ij
= aik
ajl
Tkl
31 / 64
55. Then, we have
A scalar is a number K
It has the same value in different coordinate systems.
A vector is a set of numbers vi
They Transform according to
v i
= aij
vj
A (Second Rank) Tensor is a set of numbers Tij
They transform according to
T ij
= aik
ajl
Tkl
31 / 64
56. Then, we have
A scalar is a number K
It has the same value in different coordinate systems.
A vector is a set of numbers vi
They Transform according to
v i
= aij
vj
A (Second Rank) Tensor is a set of numbers Tij
They transform according to
T ij
= aik
ajl
Tkl
31 / 64
57. Then you can go higher
For Example, tensors in Rank 3
32 / 64
58. Outline
1 Introduction
The Dream of Tensors
A Short Story on Compression
A Short History
What a Heck are Tensors?
2 The Tensor Models for Data Science
Decomposition for Compression
CANDECOMP/PARAFAC Decomposition
The Dream of Compression and BIG DATA
Tensorizing Neural Networks
Hardware Support for the Dream
3 Conclusions
The Dream Will Follow....
33 / 64
59. Once, we have an idea of Tensor
Do we have similar decompositions that the ones in SVD?
We have them......!!!
A Little Bit of History
Tensor decompositions originated with Hitchcock in 1927
An American mathematician and physicist known for his formulation of
the transportation problem in 1941.
A multiway model is attributed to Cattell in 1944
A British and American psychologist, known for his psychometric
research into intrapersonal psychological structure.
But it is until Ledyard R. Tucker
“Some mathematical notes on three-mode factor analysis,”
Psychometrika, 31 (1966), pp. 279–311.
34 / 64
60. Once, we have an idea of Tensor
Do we have similar decompositions that the ones in SVD?
We have them......!!!
A Little Bit of History
Tensor decompositions originated with Hitchcock in 1927
An American mathematician and physicist known for his formulation of
the transportation problem in 1941.
A multiway model is attributed to Cattell in 1944
A British and American psychologist, known for his psychometric
research into intrapersonal psychological structure.
But it is until Ledyard R. Tucker
“Some mathematical notes on three-mode factor analysis,”
Psychometrika, 31 (1966), pp. 279–311.
34 / 64
61. Once, we have an idea of Tensor
Do we have similar decompositions that the ones in SVD?
We have them......!!!
A Little Bit of History
Tensor decompositions originated with Hitchcock in 1927
An American mathematician and physicist known for his formulation of
the transportation problem in 1941.
A multiway model is attributed to Cattell in 1944
A British and American psychologist, known for his psychometric
research into intrapersonal psychological structure.
But it is until Ledyard R. Tucker
“Some mathematical notes on three-mode factor analysis,”
Psychometrika, 31 (1966), pp. 279–311.
34 / 64
62. Once, we have an idea of Tensor
Do we have similar decompositions that the ones in SVD?
We have them......!!!
A Little Bit of History
Tensor decompositions originated with Hitchcock in 1927
An American mathematician and physicist known for his formulation of
the transportation problem in 1941.
A multiway model is attributed to Cattell in 1944
A British and American psychologist, known for his psychometric
research into intrapersonal psychological structure.
But it is until Ledyard R. Tucker
“Some mathematical notes on three-mode factor analysis,”
Psychometrika, 31 (1966), pp. 279–311.
34 / 64
63. The Dream has been expanding beyond Physics
In the last ten years
1 Signal Processing
2 Numerical Linear Algebra
3 Computer Vision
4 Data Mining
5 Graph analysis
6 Neurosciences
7 etc
And we are going further
The Dream of Representation is at full speed when dealing with BIG
DATA!!!
35 / 64
64. The Dream has been expanding beyond Physics
In the last ten years
1 Signal Processing
2 Numerical Linear Algebra
3 Computer Vision
4 Data Mining
5 Graph analysis
6 Neurosciences
7 etc
And we are going further
The Dream of Representation is at full speed when dealing with BIG
DATA!!!
35 / 64
65. The Dream has been expanding beyond Physics
In the last ten years
1 Signal Processing
2 Numerical Linear Algebra
3 Computer Vision
4 Data Mining
5 Graph analysis
6 Neurosciences
7 etc
And we are going further
The Dream of Representation is at full speed when dealing with BIG
DATA!!!
35 / 64
66. The Dream has been expanding beyond Physics
In the last ten years
1 Signal Processing
2 Numerical Linear Algebra
3 Computer Vision
4 Data Mining
5 Graph analysis
6 Neurosciences
7 etc
And we are going further
The Dream of Representation is at full speed when dealing with BIG
DATA!!!
35 / 64
67. The Dream has been expanding beyond Physics
In the last ten years
1 Signal Processing
2 Numerical Linear Algebra
3 Computer Vision
4 Data Mining
5 Graph analysis
6 Neurosciences
7 etc
And we are going further
The Dream of Representation is at full speed when dealing with BIG
DATA!!!
35 / 64
68. The Dream has been expanding beyond Physics
In the last ten years
1 Signal Processing
2 Numerical Linear Algebra
3 Computer Vision
4 Data Mining
5 Graph analysis
6 Neurosciences
7 etc
And we are going further
The Dream of Representation is at full speed when dealing with BIG
DATA!!!
35 / 64
69. The Dream has been expanding beyond Physics
In the last ten years
1 Signal Processing
2 Numerical Linear Algebra
3 Computer Vision
4 Data Mining
5 Graph analysis
6 Neurosciences
7 etc
And we are going further
The Dream of Representation is at full speed when dealing with BIG
DATA!!!
35 / 64
70. The Dream has been expanding beyond Physics
In the last ten years
1 Signal Processing
2 Numerical Linear Algebra
3 Computer Vision
4 Data Mining
5 Graph analysis
6 Neurosciences
7 etc
And we are going further
The Dream of Representation is at full speed when dealing with BIG
DATA!!!
35 / 64
71. Decomposition of Tensors
Hitchcock Proposed such decomposition first... then the deluge
Name Proposed by
Polyadic form of a tensor Hitchcock, 1927
Three-mode Tucker 1966
factor analysis
PARAFAC (parallel factors) Harshman, 1970
CANDECOMP or CAND Carroll and Chang, 1970
(canonical decomposition)
Topographic components Möcks, 1988
model
CP (CANDECOMP/PARAFAC) Kiers, 2000
36 / 64
72. Outline
1 Introduction
The Dream of Tensors
A Short Story on Compression
A Short History
What a Heck are Tensors?
2 The Tensor Models for Data Science
Decomposition for Compression
CANDECOMP/PARAFAC Decomposition
The Dream of Compression and BIG DATA
Tensorizing Neural Networks
Hardware Support for the Dream
3 Conclusions
The Dream Will Follow....
37 / 64
73. Look at the most modern on, 17 years ago...
The CP decomposition factorizes a tensor into a sum of component
rank-one tensors (Vectors!!!)
X ≈
R
r=1
ar ◦ br ◦ cr with X ∈ RI×J×K
Where
R is a positive integer
ar ∈ RI
br
∈ RJ
cr ∈ RK
38 / 64
74. Look at the most modern on, 17 years ago...
The CP decomposition factorizes a tensor into a sum of component
rank-one tensors (Vectors!!!)
X ≈
R
r=1
ar ◦ br ◦ cr with X ∈ RI×J×K
Where
R is a positive integer
ar ∈ RI
br
∈ RJ
cr ∈ RK
38 / 64
75. Then, Point Wise
We have the following
xijk =
R
r=1
airbjrccr
Graphically
39 / 64
76. Then, Point Wise
We have the following
xijk =
R
r=1
airbjrccr
Graphically
39 / 64
77. Therefore
The rank of a tensor X, rank(X)
It is defined as the smallest number of rank-one tensors that generate X
as their sum!!!
Problem!!!
The problem is NP-hard
But that has not stopped us because
We can use many of the methods in optimization to try to figure out
the magical number R!!!
From Approximation Techniques...
To Branch and Bound...
Even Naive techniques...
40 / 64
78. Therefore
The rank of a tensor X, rank(X)
It is defined as the smallest number of rank-one tensors that generate X
as their sum!!!
Problem!!!
The problem is NP-hard
But that has not stopped us because
We can use many of the methods in optimization to try to figure out
the magical number R!!!
From Approximation Techniques...
To Branch and Bound...
Even Naive techniques...
40 / 64
79. Therefore
The rank of a tensor X, rank(X)
It is defined as the smallest number of rank-one tensors that generate X
as their sum!!!
Problem!!!
The problem is NP-hard
But that has not stopped us because
We can use many of the methods in optimization to try to figure out
the magical number R!!!
From Approximation Techniques...
To Branch and Bound...
Even Naive techniques...
40 / 64
80. Why so much effort?
A Big Difference with SVD
It is never unique unless we have a orthogonality between the columns or
rows in the matrix.
We have then
That Tensors are way more general and less prone to problems!!!
41 / 64
81. Why so much effort?
A Big Difference with SVD
It is never unique unless we have a orthogonality between the columns or
rows in the matrix.
We have then
That Tensors are way more general and less prone to problems!!!
41 / 64
82. Now
We introduce a little bit of more notation
X ≈
R
r=1
ar ◦ br ◦ cr = A, B, C
CP Decompose the Tensor using the following Optimization
min
X
X − X
s.t. X =
R
r=1
λar ◦ br ◦ cr = λ; A, B, C
42 / 64
83. Now
We introduce a little bit of more notation
X ≈
R
r=1
ar ◦ br ◦ cr = A, B, C
CP Decompose the Tensor using the following Optimization
min
X
X − X
s.t. X =
R
r=1
λar ◦ br ◦ cr = λ; A, B, C
42 / 64
84. Outline
1 Introduction
The Dream of Tensors
A Short Story on Compression
A Short History
What a Heck are Tensors?
2 The Tensor Models for Data Science
Decomposition for Compression
CANDECOMP/PARAFAC Decomposition
The Dream of Compression and BIG DATA
Tensorizing Neural Networks
Hardware Support for the Dream
3 Conclusions
The Dream Will Follow....
43 / 64
85. Here is why...
Here a simulation by direct numerical simulation
It can easily produce 100 GB to 1000 GB per DAY
The data came from (CIRCA 2016)
It a is called S3D, a massively parallel compressible reacting flow solver
developed at Sandia National Laboratories...
For example, data came from
1 Autoignitive premixture of air and ethanol in Homogeneous Charge
Compression Ignition (HCCI)
1 Each time step requires 111 MB of storage, and the entire dataset is 70
GB.
2 A temporally-evolving planar slot jet flame with DME (dimethyl
ether) as the fuel
1 Each time step requires 32 GB storage, so the entire dataset is 520 GB
44 / 64
86. Here is why...
Here a simulation by direct numerical simulation
It can easily produce 100 GB to 1000 GB per DAY
The data came from (CIRCA 2016)
It a is called S3D, a massively parallel compressible reacting flow solver
developed at Sandia National Laboratories...
For example, data came from
1 Autoignitive premixture of air and ethanol in Homogeneous Charge
Compression Ignition (HCCI)
1 Each time step requires 111 MB of storage, and the entire dataset is 70
GB.
2 A temporally-evolving planar slot jet flame with DME (dimethyl
ether) as the fuel
1 Each time step requires 32 GB storage, so the entire dataset is 520 GB
44 / 64
87. Here is why...
Here a simulation by direct numerical simulation
It can easily produce 100 GB to 1000 GB per DAY
The data came from (CIRCA 2016)
It a is called S3D, a massively parallel compressible reacting flow solver
developed at Sandia National Laboratories...
For example, data came from
1 Autoignitive premixture of air and ethanol in Homogeneous Charge
Compression Ignition (HCCI)
1 Each time step requires 111 MB of storage, and the entire dataset is 70
GB.
2 A temporally-evolving planar slot jet flame with DME (dimethyl
ether) as the fuel
1 Each time step requires 32 GB storage, so the entire dataset is 520 GB
44 / 64
88. Here is why...
Here a simulation by direct numerical simulation
It can easily produce 100 GB to 1000 GB per DAY
The data came from (CIRCA 2016)
It a is called S3D, a massively parallel compressible reacting flow solver
developed at Sandia National Laboratories...
For example, data came from
1 Autoignitive premixture of air and ethanol in Homogeneous Charge
Compression Ignition (HCCI)
1 Each time step requires 111 MB of storage, and the entire dataset is 70
GB.
2 A temporally-evolving planar slot jet flame with DME (dimethyl
ether) as the fuel
1 Each time step requires 32 GB storage, so the entire dataset is 520 GB
44 / 64
89. Even in Machines like
a Cray XC30 super- computer
5,576 dual-socket 12-core Intel “Ivy Bridge” (2.4 GHz) compute
nodes.
The peak flop rate of each core is 19.2 GFLOPS.
Each node has 64 GB of memory.
This machines will go down
Because the data representation is not efficient...
45 / 64
90. Even in Machines like
a Cray XC30 super- computer
5,576 dual-socket 12-core Intel “Ivy Bridge” (2.4 GHz) compute
nodes.
The peak flop rate of each core is 19.2 GFLOPS.
Each node has 64 GB of memory.
This machines will go down
Because the data representation is not efficient...
45 / 64
92. Furthermore...
We have that for 550 Gigabytes compression’s as
1 5 Times 100 Gigs
2 16 Times 34 Gigs
3 55 Times 10 Gig
4 etc
Improving Running times like crazy... from 3 seconds to 70 seconds
when processing 15 TB of data...
47 / 64
93. Outline
1 Introduction
The Dream of Tensors
A Short Story on Compression
A Short History
What a Heck are Tensors?
2 The Tensor Models for Data Science
Decomposition for Compression
CANDECOMP/PARAFAC Decomposition
The Dream of Compression and BIG DATA
Tensorizing Neural Networks
Hardware Support for the Dream
3 Conclusions
The Dream Will Follow....
48 / 64
94. We have a huge problem in Deep Neural Networks
Modern Architectures
They are consuming from 89% to 100% of the memory at host GPU and
Machines
Depending on the place the calculations are done!!!
49 / 64
95. Problem with such Architectures
Recent studies show
The weight matrix of the fully-connected layer is highly redundant.
if you reduce the number of parameters, you could achieve
A similar predictive power
Possible making them less prone to over-fitting or under-fitting
50 / 64
96. Problem with such Architectures
Recent studies show
The weight matrix of the fully-connected layer is highly redundant.
if you reduce the number of parameters, you could achieve
A similar predictive power
Possible making them less prone to over-fitting or under-fitting
50 / 64
97. Thus
In the Paper
Novikov, A., Podoprikhin, D., Osokin, A. and Vetrov, D.P., 2015.
Tensorizing neural networks. In Advances in Neural Information
Processing Systems (pp. 442-450).
They Proposed the TT-Representation
Where in a d−dimensional array (Tensor) A
If for a each dimension k = 1, ..., d and each possible value of the kth
dimension index jk = 1, ..., nk
There exists a matrix Gk [jk] such that all the elements of A can be
computed as a product of matrices.
51 / 64
98. Thus
In the Paper
Novikov, A., Podoprikhin, D., Osokin, A. and Vetrov, D.P., 2015.
Tensorizing neural networks. In Advances in Neural Information
Processing Systems (pp. 442-450).
They Proposed the TT-Representation
Where in a d−dimensional array (Tensor) A
If for a each dimension k = 1, ..., d and each possible value of the kth
dimension index jk = 1, ..., nk
There exists a matrix Gk [jk] such that all the elements of A can be
computed as a product of matrices.
51 / 64
99. Thus
In the Paper
Novikov, A., Podoprikhin, D., Osokin, A. and Vetrov, D.P., 2015.
Tensorizing neural networks. In Advances in Neural Information
Processing Systems (pp. 442-450).
They Proposed the TT-Representation
Where in a d−dimensional array (Tensor) A
If for a each dimension k = 1, ..., d and each possible value of the kth
dimension index jk = 1, ..., nk
There exists a matrix Gk [jk] such that all the elements of A can be
computed as a product of matrices.
51 / 64
100. Then
The TT-Representation
A (j1, j2 · · · , jd) = G1 [j1] G2 [j2] · · · Gd [jd]
All matrices Gk [jk] related to the same dimension k are restricted to
be of the same size rk−1 × rk.
52 / 64
101. Here a problem, we do not have a unique representation
We then go for the lowest rank
A (j1, j2 · · · , jd) =
α0,...,αd
G1 [j1] (α0, α1) · · · Gd [jd] (αd−1, αd)
Where
Gk [jk] (αk−1, αk) represent the element of the matrix Gk [jk] at position
(α0, α1)
53 / 64
102. Here a problem, we do not have a unique representation
We then go for the lowest rank
A (j1, j2 · · · , jd) =
α0,...,αd
G1 [j1] (α0, α1) · · · Gd [jd] (αd−1, αd)
Where
Gk [jk] (αk−1, αk) represent the element of the matrix Gk [jk] at position
(α0, α1)
53 / 64
103. With Memory Usage
For full representation
d
k=1
nk
and the TT-Representation
d
k=1
nkrk−1rk
54 / 64
104. With Memory Usage
For full representation
d
k=1
nk
and the TT-Representation
d
k=1
nkrk−1rk
54 / 64
105. Then
They propose to store each layer in a TT-Representation W
Where W are the weight of a fully connected layer
Then, using our old back-propagation
y = Wx + b
With W ∈ RN×M and b ∈ RM
In TT-Representation
Y (i1, i2 · · · , id) =
j1,...,jd
G1 [i1, j1] ...Gd [id, jd] X (j1, j2 · · · , jd) + B (i1, i2 · · · , id)
55 / 64
106. Then
They propose to store each layer in a TT-Representation W
Where W are the weight of a fully connected layer
Then, using our old back-propagation
y = Wx + b
With W ∈ RN×M and b ∈ RM
In TT-Representation
Y (i1, i2 · · · , id) =
j1,...,jd
G1 [i1, j1] ...Gd [id, jd] X (j1, j2 · · · , jd) + B (i1, i2 · · · , id)
55 / 64
107. Then
They propose to store each layer in a TT-Representation W
Where W are the weight of a fully connected layer
Then, using our old back-propagation
y = Wx + b
With W ∈ RN×M and b ∈ RM
In TT-Representation
Y (i1, i2 · · · , id) =
j1,...,jd
G1 [i1, j1] ...Gd [id, jd] X (j1, j2 · · · , jd) + B (i1, i2 · · · , id)
55 / 64
108. This has the following complexity
The previous representation allows to handle a larger number of
parameters
Without too much overhead...
With the following complexities
Operation Time Memory
FC forward pass O(MN) O(MN)
TT forward pass O dr2m max {M, N} O dr2 max {M, N}
FC backward pass O(MN) O(MN)
TT backward pass O dr2m max {M, N} O dr3 max {M, N}
56 / 64
109. This has the following complexity
The previous representation allows to handle a larger number of
parameters
Without too much overhead...
With the following complexities
Operation Time Memory
FC forward pass O(MN) O(MN)
TT forward pass O dr2m max {M, N} O dr2 max {M, N}
FC backward pass O(MN) O(MN)
TT backward pass O dr2m max {M, N} O dr3 max {M, N}
56 / 64
110. Applications for this
Manage Better
The amount of memory being used in the devices
Increase the size of the Deep Networks
Although I have some thoughts about this...
Implement CNN Networks into mobile devices
Kim, Yong-Deok, Eunhyeok Park, Sungjoo Yoo, Taelim Choi, Lu
Yang, and Dongjun Shin. "Compression of deep convolutional neural
networks for fast and low power mobile applications." arXiv preprint
arXiv:1511.06530 (2015).
57 / 64
111. Applications for this
Manage Better
The amount of memory being used in the devices
Increase the size of the Deep Networks
Although I have some thoughts about this...
Implement CNN Networks into mobile devices
Kim, Yong-Deok, Eunhyeok Park, Sungjoo Yoo, Taelim Choi, Lu
Yang, and Dongjun Shin. "Compression of deep convolutional neural
networks for fast and low power mobile applications." arXiv preprint
arXiv:1511.06530 (2015).
57 / 64
112. Applications for this
Manage Better
The amount of memory being used in the devices
Increase the size of the Deep Networks
Although I have some thoughts about this...
Implement CNN Networks into mobile devices
Kim, Yong-Deok, Eunhyeok Park, Sungjoo Yoo, Taelim Choi, Lu
Yang, and Dongjun Shin. "Compression of deep convolutional neural
networks for fast and low power mobile applications." arXiv preprint
arXiv:1511.06530 (2015).
57 / 64
113. Outline
1 Introduction
The Dream of Tensors
A Short Story on Compression
A Short History
What a Heck are Tensors?
2 The Tensor Models for Data Science
Decomposition for Compression
CANDECOMP/PARAFAC Decomposition
The Dream of Compression and BIG DATA
Tensorizing Neural Networks
Hardware Support for the Dream
3 Conclusions
The Dream Will Follow....
58 / 64
114. Given that
Something Notable
Sparse tensors appear in many large-scale applications with
multidimensional and sparse data.
What support do we have for such situations?
Liu, Bangtian, Chengyao Wen, Anand D. Sarwate, and Maryam Mehri
Dehnavi. "A Unified Optimization Approach for Sparse Tensor
Operations on GPUs." arXiv preprint arXiv:1705.09905 (2017).
59 / 64
115. Given that
Something Notable
Sparse tensors appear in many large-scale applications with
multidimensional and sparse data.
What support do we have for such situations?
Liu, Bangtian, Chengyao Wen, Anand D. Sarwate, and Maryam Mehri
Dehnavi. "A Unified Optimization Approach for Sparse Tensor
Operations on GPUs." arXiv preprint arXiv:1705.09905 (2017).
59 / 64
116. They pointed out different resources that you have around
Shared memory systems
The Tensor Toolbox [21], [4] and N-way Toolbox [22] are two widely
used MATLAB
The Cyclops Tensor Framework (CTF) is a C++ library which
provides automatic parallelization for sparse tensor operations.
etc
Distributed memory systems
Gigatensor handles tera-scale tensors using the MapReduce
framework.
Hypertensor is a sparse tensor library for SpMTTKRP on
distributed-memory environments.
etc
60 / 64
117. They pointed out different resources that you have around
Shared memory systems
The Tensor Toolbox [21], [4] and N-way Toolbox [22] are two widely
used MATLAB
The Cyclops Tensor Framework (CTF) is a C++ library which
provides automatic parallelization for sparse tensor operations.
etc
Distributed memory systems
Gigatensor handles tera-scale tensors using the MapReduce
framework.
Hypertensor is a sparse tensor library for SpMTTKRP on
distributed-memory environments.
etc
60 / 64
118. They pointed out different resources that you have around
Shared memory systems
The Tensor Toolbox [21], [4] and N-way Toolbox [22] are two widely
used MATLAB
The Cyclops Tensor Framework (CTF) is a C++ library which
provides automatic parallelization for sparse tensor operations.
etc
Distributed memory systems
Gigatensor handles tera-scale tensors using the MapReduce
framework.
Hypertensor is a sparse tensor library for SpMTTKRP on
distributed-memory environments.
etc
60 / 64
119. They pointed out different resources that you have around
Shared memory systems
The Tensor Toolbox [21], [4] and N-way Toolbox [22] are two widely
used MATLAB
The Cyclops Tensor Framework (CTF) is a C++ library which
provides automatic parallelization for sparse tensor operations.
etc
Distributed memory systems
Gigatensor handles tera-scale tensors using the MapReduce
framework.
Hypertensor is a sparse tensor library for SpMTTKRP on
distributed-memory environments.
etc
60 / 64
120. They pointed out different resources that you have around
Shared memory systems
The Tensor Toolbox [21], [4] and N-way Toolbox [22] are two widely
used MATLAB
The Cyclops Tensor Framework (CTF) is a C++ library which
provides automatic parallelization for sparse tensor operations.
etc
Distributed memory systems
Gigatensor handles tera-scale tensors using the MapReduce
framework.
Hypertensor is a sparse tensor library for SpMTTKRP on
distributed-memory environments.
etc
60 / 64
121. They pointed out different resources that you have around
Shared memory systems
The Tensor Toolbox [21], [4] and N-way Toolbox [22] are two widely
used MATLAB
The Cyclops Tensor Framework (CTF) is a C++ library which
provides automatic parallelization for sparse tensor operations.
etc
Distributed memory systems
Gigatensor handles tera-scale tensors using the MapReduce
framework.
Hypertensor is a sparse tensor library for SpMTTKRP on
distributed-memory environments.
etc
60 / 64
122. And the Grial
GPU
Li proposes a parallel algorithm and implementation of on GPUs via
parallelizing certain algorithms on fibers.
TensorFlow... actually supports certain version of Tensor
representation...
Something Notable
Efforts to solve more problems are on the way
The future looks promising
61 / 64
123. And the Grial
GPU
Li proposes a parallel algorithm and implementation of on GPUs via
parallelizing certain algorithms on fibers.
TensorFlow... actually supports certain version of Tensor
representation...
Something Notable
Efforts to solve more problems are on the way
The future looks promising
61 / 64
124. And the Grial
GPU
Li proposes a parallel algorithm and implementation of on GPUs via
parallelizing certain algorithms on fibers.
TensorFlow... actually supports certain version of Tensor
representation...
Something Notable
Efforts to solve more problems are on the way
The future looks promising
61 / 64
125. And the Grial
GPU
Li proposes a parallel algorithm and implementation of on GPUs via
parallelizing certain algorithms on fibers.
TensorFlow... actually supports certain version of Tensor
representation...
Something Notable
Efforts to solve more problems are on the way
The future looks promising
61 / 64
126. Outline
1 Introduction
The Dream of Tensors
A Short Story on Compression
A Short History
What a Heck are Tensors?
2 The Tensor Models for Data Science
Decomposition for Compression
CANDECOMP/PARAFAC Decomposition
The Dream of Compression and BIG DATA
Tensorizing Neural Networks
Hardware Support for the Dream
3 Conclusions
The Dream Will Follow....
62 / 64
127. As Always
We need people able to dream these new ways of doing stuff...
Therefore, a series of pieces of advise...
Learn more than a simple framework...
Learn the mathematics
And more importantly
Learn how to Model the Reality using such
Mathematical Tools...
63 / 64
128. As Always
We need people able to dream these new ways of doing stuff...
Therefore, a series of pieces of advise...
Learn more than a simple framework...
Learn the mathematics
And more importantly
Learn how to Model the Reality using such
Mathematical Tools...
63 / 64
129. As Always
We need people able to dream these new ways of doing stuff...
Therefore, a series of pieces of advise...
Learn more than a simple framework...
Learn the mathematics
And more importantly
Learn how to Model the Reality using such
Mathematical Tools...
63 / 64