- 1. Effective Numerical Computation in NumPy and SciPy Kimikazu Kato PyCon JP 2014 September 13, 2014 1 / 35
- 2. About Myself Kimikazu Kato Chief Scientists at Silver Egg Technology Co., Ltd. Ph.D in Computer Science Background in Mathematics, Numerical Computation, Algorithms, etc. <2 year experience in Python >10 year experience in numerical computation Now designing algorithms for recommendation system, and doing research about machine learning and data analysis. 2 / 35
- 3. This talk... is about effective usage of NumPy/SciPy is NOT exhaustive introduction of capabilities, but shows some case studies based on my experience and interest 3 / 35
- 4. Table of Contents Introduction Basics about NumPy Broadcasting Indexing Sparse matrix Usage of scipy.sparse Internal structure Case studies Conclusion 4 / 35
- 5. Numerical Computation Differential equations Simulations Signal processing Machine Learning etc... Why Numerical Computation in Python? Productivity Easy to write Easy to debug Connectivity with visualization tools Matplotlib IPython Connectivity with web system Many frameworks (Django, Pyramid, Flask, Bottle, etc.) 5 / 35
- 6. But Python is Very Slow! Code in C #include <stdio.h> int main() { int i; double s=0; for (i=1; i<=100000000; i++) s+=i; printf("%.0fn",s); } Code in Python s=0. for i in xrange(1,100000001): s+=i print s Both of the codes compute the sum of integers from 1 to 100,000,000. Result of benchmark in a certain environment: Above: 0.109 sec (compiled with -O3 option) Below: 8.657 sec (80+ times slower!!) 6 / 35
- 7. Better code import numpy as np a=np.arange(1,100000001) print a.sum() Now it takes 0.188 sec. (Measured by "time" command in Linux, loading time included) Still slower than C, but sufficiently fast as a script language. 7 / 35
- 8. Lessons Python is very slow when written badly Translate C (or Java, C# etc.) code into Python is often a bad idea. Python-friendly rewriting sometimes result in drastic performance improvement 8 / 35
- 9. Basic rules for better performance Avoid for-sentence as far as possible Utilize libraries' capabilities instead Forget about the cost of copying memory Typical C programmer might care about it, but ... 9 / 35
- 10. Basic techniques for NumPy Broadcasting Indexing 10 / 35
- 11. Broadcasting >>> import numpy as np >>> a=np.array([0,1,2]) >>> a*3 array([0, 3, 6]) >>> b=np.array([1,4,9]) >>> np.sqrt(b) array([ 1., 2., 3.]) A function which is applied to each element when applied to an array is called a universal function. 11 / 35
- 12. Broadcasting (2D) >>> import numpy as np >>> a=np.arange(9).reshape((3,3)) >>> b=np.array([1,2,3]) >>> a array([[0, 1, 2], [3, 4, 5], [6, 7, 8]]) >>> b array([1, 2, 3]) >>> a*b array([[ 0, 2, 6], [ 3, 8, 15], [ 6, 14, 24]]) 12 / 35
- 13. Indexing >>> import numpy as np >>> a=np.arange(10) >>> a array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) >>> indices=np.arange(0,10,2) >>> indices array([0, 2, 4, 6, 8]) >>> a[indices]=0 >>> a array([0, 1, 0, 3, 0, 5, 0, 7, 0, 9]) >>> b=np.arange(100,600,100) >>> b array([100, 200, 300, 400, 500]) >>> a[indices]=b >>> a array([100, 1, 200, 3, 300, 5, 400, 7, 500, 9]) 13 / 35
- 14. Refernces Gabriele Lanaro, "Python High Performance Programming," Packt Publishing, 2013. Stéfan van der Walt, Numpy Medkit 14 / 35
- 15. Sparse matrix Defined as a matrix in which most elements are zero Compressed data structure is used to express it, so that it will be... Space effective Time effective 15 / 35
- 16. scipy.sparse The class scipy.sparse has mainly three types as expressions of a sparse matrix. (There are other types but not mentioned here) lil_matrix : convenient to set data; setting a[i,j] is fast csr_matrix : convenient for computation, fast to retrieve a row csc_matrix : convenient for computation, fast to retrieve a column Usually, set the data into lil_matrix, and then, convert it to csc_matrix or csr_matrix. For csr_matrix, and csc_matrix, calcutaion of matrices of the same type is fast, but you should avoid calculation of different types. 16 / 35
- 17. Use case >>> from scipy.sparse import lil_matrix, csr_matrix >>> a=lil_matrix((3,3)) >>> a[0,0]=1.; a[0,2]=2. >>> a=a.tocsr() >>> print a (0, 0) 1.0 (0, 2) 2.0 >>> a.todense() matrix([[ 1., 0., 2.], [ 0., 0., 0.], [ 0., 0., 0.]]) >>> b=lil_matrix((3,3)) >>> b[1,1]=3.; b[2,0]=4.; b[2,2]=5. >>> b=b.tocsr() >>> b.todense() matrix([[ 0., 0., 0.], [ 0., 3., 0.], [ 4., 0., 5.]]) >>> c=a.dot(b) >>> c.todense() matrix([[ 8., 0., 10.], [ 0., 0., 0.], [ 0., 0., 0.]]) >>> d=a+b >>> d.todense() matrix([[ 1., 0., 2.], [ 0., 3., 0.], [ 4., 0., 5.]]) 17 / 35
- 18. Internal structure: csr_matrix >>> from scipy.sparse import lil_matrix, csr_matrix >>> a=lil_matrix((3,3)) >>> a[0,1]=1.; a[0,2]=2.; a[1,2]=3.; a[2,0]=4.; a[2,1]=5. >>> b=a.tocsr() >>> b.todense() matrix([[ 0., 1., 2.], [ 0., 0., 3.], [ 4., 5., 0.]]) >>> b.indices array([1, 2, 2, 0, 1], dtype=int32) >>> b.data array([ 1., 2., 3., 4., 5.]) >>> b.indptr array([0, 2, 3, 5], dtype=int32) 18 / 35
- 19. Internal structure: csc_matrix >>> from scipy.sparse import lil_matrix, csr_matrix >>> a=lil_matrix((3,3)) >>> a[0,1]=1.; a[0,2]=2.; a[1,2]=3.; a[2,0]=4.; a[2,1]=5. >>> b=a.tocsc() >>> b.todense() matrix([[ 0., 1., 2.], [ 0., 0., 3.], [ 4., 5., 0.]]) >>> b.indices array([2, 0, 2, 0, 1], dtype=int32) >>> b.data array([ 4., 1., 5., 2., 3.]) >>> b.indptr array([0, 1, 3, 5], dtype=int32) 19 / 35
- 20. Merit of knowing the internal structure Setting csr_matrix or csc_matrix with its internal structure is much faster than setting lil_matrix with indices. See the benchmark of setting ý ý ý 20 / 35
- 21. from scipy.sparse import lil_matrix, csr_matrix import numpy as np from timeit import timeit def set_lil(n): a=lil_matrix((n,n)) for i in xrange(n): a[i,i]=2. if i+1n: a[i,i+1]=1. return a def set_csr(n): data=np.empty(2*n-1) indices=np.empty(2*n-1,dtype=np.int32) indptr=np.empty(n+1,dtype=np.int32) # to be fair, for-sentence is intentionally used # (using indexing technique is faster) for i in xrange(n): indices[2*i]=i data[2*i]=2. if in-1: indices[2*i+1]=i+1 data[2*i+1]=1. indptr[i]=2*i indptr[n]=2*n-1 a=csr_matrix((data,indices,indptr),shape=(n,n)) return a print lil:,timeit(set_lil(10000), number=10,setup=from __main__ import set_lil) print csr:,timeit(set_csr(10000), number=10,setup=from __main__ import set_csr) 21 / 35
- 22. Result: lil: 11.6730761528 csr: 0.0562081336975 Remark When you deal with already sorted data, setting csr_matrix or csc_matrix with data, indices, indptr is much faster than setting lil_matrix But the code tend to be more complicated if you use the internal structure of csr_matrix or csc_matrix 22 / 35
- 23. Case Studies 23 / 35
- 24. Case 1: Norms If 2 is dense: norm=np.dot(v,v) Ï2 Ï % 2% Expressed as product of matrices. (dot means matrix product, but you don't have to take transpose explicitly.) When is sparse, suppose that is expressed as matrix: 2 2 g * norm=v.multiply(v).sum() (multiply() is element-wise product) This is because taking transpose of a sparse matrix changes the type. 24 / 35
- 25. Frobenius norm: norm=a.multiply(a).sum() ÏÏ'SP % % 25 / 35
- 26. Case 2: Applying a function to all of the elements of a sparse matrix A universal function can be applied to a dense matrix: import numpy as np a=np.arange(9).reshape((3,3)) a array([[0, 1, 2], [3, 4, 5], [6, 7, 8]]) np.tanh(a) array([[ 0. , 0.76159416, 0.96402758], [ 0.99505475, 0.9993293 , 0.9999092 ], [ 0.99998771, 0.99999834, 0.99999977]]) This is convenient and fast. However, we cannot do the same thing for a sparse matrix. 26 / 35
- 27. from scipy.sparse import lil_matrix a=lil_matrix((3,3)) a[0,0]=1. a[1,0]=2. b=a.tocsr() np.tanh(b) 3x3 sparse matrix of type 'type 'numpy.float64'' with 2 stored elements in Compressed Sparse Row format This is because, for an arbitrary function, its application to a sparse matrix is not necessarily sparse. However, if a universal function satisfies , the density is preserved. Then, how can we compute it? 27 / 35
- 28. Use the internal structure!! The positions of the non-zero elements are not changed after application of the function. Keep indices and indptr, and just change data. Solution: b = csr_matrix((np.tanh(a.data), a.indices, a.indptr), shape=a.shape) 28 / 35
- 29. Case 3: Formula which appears in a paper In the algorithm for recommendation system [1], the following formula appears: øø * g where is dense matrix, and D is a diagonal matrix defined from a given array as: % ý * Here, (which corresponds to the number of users or items) is big and (which means the number of latent factors) is small. [1] Hu et al. Collaborative Filtering for Implicit Feedback Datasets, ICDM, 2008. * 29 / 35
- 30. Solution 1: There is a special class dia_matrix to deal with a diagonal sparse matrix. import scipy.sparse as sparse import numpy as np def f(a,d): a: 2d array of shape (n,f), d: 1d array of length n dd=sparse.diags([d],[0]) return np.dot(a.T,dd.dot(a)) 30 / 35
- 31. Solution 2: Pack csr_matrix with data,indices,indptr data=d indices=[0,1,..,n] indptr=[0,1,...,n+1] def g(a,d): n,f=a.shape data=d indices=np.arange(n) indptr=np.arange(n+1) dd=sparse.csr_matrix((data,indices,indptr),shape=(n,n)) return np.dot(a.T,dd.dot(a)) 31 / 35
- 32. Solution 3: û ) û ) g g û ) û ) This is equivalent to the broadcasting! def h(a,d): return np.dot(a.T*d,a) ü ü ü * * û *) ý * ü ü g ü * * * * û *) * 32 / 35
- 33. Benchmark def datagen(n,f): np.random.seed(0) a=np.random.random((n,f)) d=np.random.random(n) return a,d from timeit import timeit print dia_matrix :,timeit(f(a,d),number=10, setup=from __main__ import f,datagen; a,d=datagen(1000000,10)) print csr_matrix :,timeit(g(a,d),number=10, setup=from __main__ import g,datagen; a,d=datagen(1000000,10)) print broadcasting :,timeit(h(a,d),number=10, setup=from __main__ import h,datagen; a,d=datagen(1000000,10)) Result: dia_matrix : 1.60458707809 csr_matrix : 1.32580018044 broadcasting : 1.30032682419 33 / 35
- 34. Conclusion Try not to use for-sentence, but use libraries' capabilities instead. Knowledge about the internal structure of the sparse matrix is useful to extract further performance. Mathematical derivation is important. The key is to find a mathematically equivalent and Python-friendly formula. Computational speed does not necessarily matter. Finding a better code in a short time is valuable. Otherwise, you shouldn't pursue too much. 34 / 35
- 35. Acknowledgment I would like to thank (@shima__shima) who gave me useful advice in Twitter. 35 / 35