Tensor models and other dreams by PhD Andres Mendez-Vazquez

1. Tensor Models and Other Dreams... Andres Mendez-Vazquez January 26, 2018 1 / 64

2. Outline 1 Introduction The Dream of Tensors A Short Story on Compression A Short History What a Heck are Tensors? 2 The Tensor Models for Data Science Decomposition for Compression CANDECOMP/PARAFAC Decomposition The Dream of Compression and BIG DATA Tensorizing Neural Networks Hardware Support for the Dream 3 Conclusions The Dream Will Follow.... 2 / 64

4. Tensors are this way... As words deﬁning an important moment in life Without you All the stars we steal from the night sky Will never be enough Never be enough These hands could hold the world but it’ll Never be enough... - Justin Paul / Benj Pasek, Greatest Showman 4 / 64

5. Tensors are like such words... They represent generalizations that represent our dreams... In Data Sciences... 5 / 64

7. Document Representation Imagine the following... You have a bunch of documents... They are hundred thousands of them... 7 / 64

8. Then, we have an Opportunity or a Terrible Problem How do you represent them in a easy way to handle them? After all we want to Search them Compare them Rank them What about using vectors? word 1 word 2 word 3 word 4 · · · word d counter counter counter counter counter x1 x2 x3 x4 · · · xd 8 / 64

13. The Matrix at the Center of Everything!!! The Vector/Matrix Representation They are basically a N × d matrix like this A =          (x1)1 · · · (x1)j · · · (x1)d ... ... (xi)1 (xi)j (xi)d ... ... (xN )1 · · · (xN )j · · · (xN )d          A is a matrix with... N represents the thousands of documents... d represents the thousands of words in a dictionary..... 9 / 64

14. The Matrix at the Center of Everything!!! The Vector/Matrix Representation They are basically a N × d matrix like this A =          (x1)1 · · · (x1)j · · · (x1)d ... ... (xi)1 (xi)j (xi)d ... ... (xN )1 · · · (xN )j · · · (xN )d          A is a matrix with... N represents the thousands of documents... d represents the thousands of words in a dictionary..... 9 / 64

15. A Small Problem The matrix alone consumes... so much... You have 2 bytes per memory cell If we have N = 106 , d = 50, 000 We have 2 × N × d = 100 Gigabytes 10 / 64

16. A Small Problem The matrix alone consumes... so much... You have 2 bytes per memory cell If we have N = 106 , d = 50, 000 We have 2 × N × d = 100 Gigabytes 10 / 64

17. Danger!!! Will Robinson Lost in Space 11 / 64

18. We have a trick!!! Something Notable The Matrix is Highly SPARSE 12 / 64

19. Therefore If you are smart enough You start represent the matrix information using sparse techniques 5x5 Matrix Numeric Elements Empty Elements Sparse Matrix 13 / 64

20. Then If you are quite smart.... You discover that few of the eigenvalues provide some information... Every Matrix has a Singular Value Decomposition A = UΣV T The columns of U are an orthonormal basis for the column space. The columns of V are an orthonormal basis for the row space. The Σ is diagonal and the entries on its diagonal σi = Σii are positive real numbers, called the singular values of A. 14 / 64

24. How much compression can we get? The Matrix Sparse Representation It Achieves 90% Compression - We go from 100 Gigabytes to 10 Gigabytes From 50,000 dimensions/words we go to 300 dimensions Using the Singular Value Decomposition Making Possible to go from 100 Gigabytes to 2 × N × 300 = 0.6 Gigabytes 15 / 64

27. IMAGINE!!!! We have a crazy moment!!! All the stars we steal from the night sky Will never be enough Never be enough Towers of gold are still too little These hands could hold the world but it’ll Never be enough Never be enough For me 16 / 64

28. Then You go ambitious!!! You add a new dimension representing feelings!!! Feeling Dim ensionality 17 / 64

30. They have a somewhat short history!!! First Most They are abstract entities invariant under coordinate transformations. They were mentioned ﬁrst by Woldemar Wright in 1898 A German physicist, who taught at the Georg August University of Göttingen. He mentioned the tensors in a study about the physical properties of crystals. But Before That The Great Riemann introduced the concept of topological manifold... the beginning of the dream... Through a quadratic linear element to study its properties... ds2 = gijdxi dxj 19 / 64

35. Then Gregorio Ricci-Curbastro and Tullio Levi-Civita They wrote a paper in the Mathematische Annalen , Vol. 54 (1901) , entitled "Méthodes de calcul diﬀeréntiel absolu" A Monster Came Around 20 / 64

36. Then Gregorio Ricci-Curbastro and Tullio Levi-Civita They wrote a paper in the Mathematische Annalen , Vol. 54 (1901) , entitled "Méthodes de calcul diﬀeréntiel absolu" A Monster Came Around 20 / 64

37. “Every Genius has stood in the Shoulder of Giants” - Newton Einstein adopted the concepts at the paper And the Theory of General Relativity was born He renamed the entire ﬁeld from “calcul absolu” TENSOR CALCULUS 21 / 64

38. “Every Genius has stood in the Shoulder of Giants” - Newton Einstein adopted the concepts at the paper And the Theory of General Relativity was born He renamed the entire ﬁeld from “calcul absolu” TENSOR CALCULUS 21 / 64

40. First Principles... Imagine a linear coordinate system 23 / 64

41. We deﬁne A Coordinate System We deﬁne vectors in terms of a base v = vxe1 + vye2 = vx vy ∈ R2 v = v 2 = v2 x + v2 y 1 2 Note: This is important vectors are always the same thing no matter the coordinate thing 24 / 64

42. Therefore Imagine to represent the new basis in terms of an old basis e1 · v = vx = e1 · vxe1 + e1 · vye2 e2 · v = vy = e2 · vxe1 + e2 · vye2 Where ei · ej = Projection of ei onto ej 25 / 64

43. Therefore Imagine to represent the new basis in terms of an old basis e1 · v = vx = e1 · vxe1 + e1 · vye2 e2 · v = vy = e2 · vxe1 + e2 · vye2 Where ei · ej = Projection of ei onto ej 25 / 64

44. Using a Little bit of Notation We need a notation that is both more compact Let the indices i, j represent the numbers 1, 2 corresponding to the coordinates x, y Write components of v as vi and v i in the two coordinate system Then deﬁne aij = ei · ej Note: This deﬁne the “ROTATION” In fact are individually just the cosines of the angle between one axis and another 26 / 64

45. Using a Little bit of Notation We need a notation that is both more compact Let the indices i, j represent the numbers 1, 2 corresponding to the coordinates x, y Write components of v as vi and v i in the two coordinate system Then deﬁne aij = ei · ej Note: This deﬁne the “ROTATION” In fact are individually just the cosines of the angle between one axis and another 26 / 64

46. Therefore We can rewrite the entire transformation v i = 2 j=1 aij vj We will agree that whenever an index appears twice, we have a sum v i = aij vj 27 / 64

47. Therefore We can rewrite the entire transformation v i = 2 j=1 aij vj We will agree that whenever an index appears twice, we have a sum v i = aij vj 27 / 64

48. We have then... We can do the following v 1 v 2 = a11 a12 a21 a22 v1 v2 Then, we compress our notation more v = av 28 / 64

49. We have then... We can do the following v 1 v 2 = a11 a12 a21 a22 v1 v2 Then, we compress our notation more v = av 28 / 64

50. Then, we can redeﬁne our dot product The Basis of Projecting into other vectors v · w = vi wi = v i w i = aij aik vj wk Using the Kronecker Delta δij = 0 if i = j 1 if i = j Therefore, we have aij aik = δjk 29 / 64

53. Proving the Invariance of the dot product Therefore v i · w i = δjk vj wk = vj · wj 30 / 64

54. Then, we have A scalar is a number K It has the same value in diﬀerent coordinate systems. A vector is a set of numbers vi They Transform according to v i = aij vj A (Second Rank) Tensor is a set of numbers Tij They transform according to T ij = aik ajl Tkl 31 / 64

57. Then you can go higher For Example, tensors in Rank 3 32 / 64

59. Once, we have an idea of Tensor Do we have similar decompositions that the ones in SVD? We have them......!!! A Little Bit of History Tensor decompositions originated with Hitchcock in 1927 An American mathematician and physicist known for his formulation of the transportation problem in 1941. A multiway model is attributed to Cattell in 1944 A British and American psychologist, known for his psychometric research into intrapersonal psychological structure. But it is until Ledyard R. Tucker “Some mathematical notes on three-mode factor analysis,” Psychometrika, 31 (1966), pp. 279–311. 34 / 64

63. The Dream has been expanding beyond Physics In the last ten years 1 Signal Processing 2 Numerical Linear Algebra 3 Computer Vision 4 Data Mining 5 Graph analysis 6 Neurosciences 7 etc And we are going further The Dream of Representation is at full speed when dealing with BIG DATA!!! 35 / 64

71. Decomposition of Tensors Hitchcock Proposed such decomposition ﬁrst... then the deluge Name Proposed by Polyadic form of a tensor Hitchcock, 1927 Three-mode Tucker 1966 factor analysis PARAFAC (parallel factors) Harshman, 1970 CANDECOMP or CAND Carroll and Chang, 1970 (canonical decomposition) Topographic components Möcks, 1988 model CP (CANDECOMP/PARAFAC) Kiers, 2000 36 / 64

73. Look at the most modern on, 17 years ago... The CP decomposition factorizes a tensor into a sum of component rank-one tensors (Vectors!!!) X ≈ R r=1 ar ◦ br ◦ cr with X ∈ RI×J×K Where R is a positive integer ar ∈ RI br ∈ RJ cr ∈ RK 38 / 64

74. Look at the most modern on, 17 years ago... The CP decomposition factorizes a tensor into a sum of component rank-one tensors (Vectors!!!) X ≈ R r=1 ar ◦ br ◦ cr with X ∈ RI×J×K Where R is a positive integer ar ∈ RI br ∈ RJ cr ∈ RK 38 / 64

75. Then, Point Wise We have the following xijk = R r=1 airbjrccr Graphically 39 / 64

76. Then, Point Wise We have the following xijk = R r=1 airbjrccr Graphically 39 / 64

77. Therefore The rank of a tensor X, rank(X) It is deﬁned as the smallest number of rank-one tensors that generate X as their sum!!! Problem!!! The problem is NP-hard But that has not stopped us because We can use many of the methods in optimization to try to ﬁgure out the magical number R!!! From Approximation Techniques... To Branch and Bound... Even Naive techniques... 40 / 64

80. Why so much eﬀort? A Big Diﬀerence with SVD It is never unique unless we have a orthogonality between the columns or rows in the matrix. We have then That Tensors are way more general and less prone to problems!!! 41 / 64

81. Why so much eﬀort? A Big Diﬀerence with SVD It is never unique unless we have a orthogonality between the columns or rows in the matrix. We have then That Tensors are way more general and less prone to problems!!! 41 / 64

82. Now We introduce a little bit of more notation X ≈ R r=1 ar ◦ br ◦ cr = A, B, C CP Decompose the Tensor using the following Optimization min X X − X s.t. X = R r=1 λar ◦ br ◦ cr = λ; A, B, C 42 / 64

83. Now We introduce a little bit of more notation X ≈ R r=1 ar ◦ br ◦ cr = A, B, C CP Decompose the Tensor using the following Optimization min X X − X s.t. X = R r=1 λar ◦ br ◦ cr = λ; A, B, C 42 / 64

85. Here is why... Here a simulation by direct numerical simulation It can easily produce 100 GB to 1000 GB per DAY The data came from (CIRCA 2016) It a is called S3D, a massively parallel compressible reacting ﬂow solver developed at Sandia National Laboratories... For example, data came from 1 Autoignitive premixture of air and ethanol in Homogeneous Charge Compression Ignition (HCCI) 1 Each time step requires 111 MB of storage, and the entire dataset is 70 GB. 2 A temporally-evolving planar slot jet ﬂame with DME (dimethyl ether) as the fuel 1 Each time step requires 32 GB storage, so the entire dataset is 520 GB 44 / 64

89. Even in Machines like a Cray XC30 super- computer 5,576 dual-socket 12-core Intel “Ivy Bridge” (2.4 GHz) compute nodes. The peak ﬂop rate of each core is 19.2 GFLOPS. Each node has 64 GB of memory. This machines will go down Because the data representation is not eﬃcient... 45 / 64

90. Even in Machines like a Cray XC30 super- computer 5,576 dual-socket 12-core Intel “Ivy Bridge” (2.4 GHz) compute nodes. The peak ﬂop rate of each core is 19.2 GFLOPS. Each node has 64 GB of memory. This machines will go down Because the data representation is not eﬃcient... 45 / 64

91. Using the Tucker Decomposition 46 / 64

92. Furthermore... We have that for 550 Gigabytes compression’s as 1 5 Times 100 Gigs 2 16 Times 34 Gigs 3 55 Times 10 Gig 4 etc Improving Running times like crazy... from 3 seconds to 70 seconds when processing 15 TB of data... 47 / 64

94. We have a huge problem in Deep Neural Networks Modern Architectures They are consuming from 89% to 100% of the memory at host GPU and Machines Depending on the place the calculations are done!!! 49 / 64

95. Problem with such Architectures Recent studies show The weight matrix of the fully-connected layer is highly redundant. if you reduce the number of parameters, you could achieve A similar predictive power Possible making them less prone to over-ﬁtting or under-ﬁtting 50 / 64

96. Problem with such Architectures Recent studies show The weight matrix of the fully-connected layer is highly redundant. if you reduce the number of parameters, you could achieve A similar predictive power Possible making them less prone to over-ﬁtting or under-ﬁtting 50 / 64

97. Thus In the Paper Novikov, A., Podoprikhin, D., Osokin, A. and Vetrov, D.P., 2015. Tensorizing neural networks. In Advances in Neural Information Processing Systems (pp. 442-450). They Proposed the TT-Representation Where in a d−dimensional array (Tensor) A If for a each dimension k = 1, ..., d and each possible value of the kth dimension index jk = 1, ..., nk There exists a matrix Gk [jk] such that all the elements of A can be computed as a product of matrices. 51 / 64

100. Then The TT-Representation A (j1, j2 · · · , jd) = G1 [j1] G2 [j2] · · · Gd [jd] All matrices Gk [jk] related to the same dimension k are restricted to be of the same size rk−1 × rk. 52 / 64

101. Here a problem, we do not have a unique representation We then go for the lowest rank A (j1, j2 · · · , jd) = α0,...,αd G1 [j1] (α0, α1) · · · Gd [jd] (αd−1, αd) Where Gk [jk] (αk−1, αk) represent the element of the matrix Gk [jk] at position (α0, α1) 53 / 64

102. Here a problem, we do not have a unique representation We then go for the lowest rank A (j1, j2 · · · , jd) = α0,...,αd G1 [j1] (α0, α1) · · · Gd [jd] (αd−1, αd) Where Gk [jk] (αk−1, αk) represent the element of the matrix Gk [jk] at position (α0, α1) 53 / 64

103. With Memory Usage For full representation d k=1 nk and the TT-Representation d k=1 nkrk−1rk 54 / 64

104. With Memory Usage For full representation d k=1 nk and the TT-Representation d k=1 nkrk−1rk 54 / 64

105. Then They propose to store each layer in a TT-Representation W Where W are the weight of a fully connected layer Then, using our old back-propagation y = Wx + b With W ∈ RN×M and b ∈ RM In TT-Representation Y (i1, i2 · · · , id) = j1,...,jd G1 [i1, j1] ...Gd [id, jd] X (j1, j2 · · · , jd) + B (i1, i2 · · · , id) 55 / 64

108. This has the following complexity The previous representation allows to handle a larger number of parameters Without too much overhead... With the following complexities Operation Time Memory FC forward pass O(MN) O(MN) TT forward pass O dr2m max {M, N} O dr2 max {M, N} FC backward pass O(MN) O(MN) TT backward pass O dr2m max {M, N} O dr3 max {M, N} 56 / 64

109. This has the following complexity The previous representation allows to handle a larger number of parameters Without too much overhead... With the following complexities Operation Time Memory FC forward pass O(MN) O(MN) TT forward pass O dr2m max {M, N} O dr2 max {M, N} FC backward pass O(MN) O(MN) TT backward pass O dr2m max {M, N} O dr3 max {M, N} 56 / 64

110. Applications for this Manage Better The amount of memory being used in the devices Increase the size of the Deep Networks Although I have some thoughts about this... Implement CNN Networks into mobile devices Kim, Yong-Deok, Eunhyeok Park, Sungjoo Yoo, Taelim Choi, Lu Yang, and Dongjun Shin. "Compression of deep convolutional neural networks for fast and low power mobile applications." arXiv preprint arXiv:1511.06530 (2015). 57 / 64

114. Given that Something Notable Sparse tensors appear in many large-scale applications with multidimensional and sparse data. What support do we have for such situations? Liu, Bangtian, Chengyao Wen, Anand D. Sarwate, and Maryam Mehri Dehnavi. "A Uniﬁed Optimization Approach for Sparse Tensor Operations on GPUs." arXiv preprint arXiv:1705.09905 (2017). 59 / 64

115. Given that Something Notable Sparse tensors appear in many large-scale applications with multidimensional and sparse data. What support do we have for such situations? Liu, Bangtian, Chengyao Wen, Anand D. Sarwate, and Maryam Mehri Dehnavi. "A Uniﬁed Optimization Approach for Sparse Tensor Operations on GPUs." arXiv preprint arXiv:1705.09905 (2017). 59 / 64

116. They pointed out diﬀerent resources that you have around Shared memory systems The Tensor Toolbox [21], [4] and N-way Toolbox [22] are two widely used MATLAB The Cyclops Tensor Framework (CTF) is a C++ library which provides automatic parallelization for sparse tensor operations. etc Distributed memory systems Gigatensor handles tera-scale tensors using the MapReduce framework. Hypertensor is a sparse tensor library for SpMTTKRP on distributed-memory environments. etc 60 / 64

122. And the Grial GPU Li proposes a parallel algorithm and implementation of on GPUs via parallelizing certain algorithms on ﬁbers. TensorFlow... actually supports certain version of Tensor representation... Something Notable Eﬀorts to solve more problems are on the way The future looks promising 61 / 64

127. As Always We need people able to dream these new ways of doing stuﬀ... Therefore, a series of pieces of advise... Learn more than a simple framework... Learn the mathematics And more importantly Learn how to Model the Reality using such Mathematical Tools... 63 / 64

130. Thanks Any Questions? I repeat I am not an expert in Tensor Calculus.... 64 / 64

Tensor models and other dreams by PhD Andres Mendez-Vazquez

Recommended

Recommended

More Related Content

Similar to Tensor models and other dreams by PhD Andres Mendez-Vazquez

Similar to Tensor models and other dreams by PhD Andres Mendez-Vazquez (20)

More from DataLab Community

More from DataLab Community (11)

Recently uploaded

Recently uploaded (20)

Tensor models and other dreams by PhD Andres Mendez-Vazquez