Crushing the Head of the Snake by Robert Brewer PyData SV 2014

821 views

Published on

Big Data brings with it particular challenges in any language, mostly in performance. This talk will explain how to get immediate speedups in your Python code by exploiting both timeless programming techniques and fixes specific to Python. We will cover: I. Amongst Our Weaponry 1. How to Time and Profile Python 2. Extracting Loop invariants: constants, lookup tables, even methods! 3. Caching: memoization and heavier things II Gunfight at the O.K. Corral in Morse Code 1. Python functions vs C functions 2. Vector operations: NumPy 3. Reducing calls: loops, generators, recursion III. The Semaphore Version of Wuthering Heights 1. Using select instead of Queue 2. Serialization overhead 3. Parallelizing work

Published in: Technology, Business
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
821
On SlideShare
0
From Embeds
0
Number of Embeds
7
Actions
Shares
0
Downloads
5
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide

Crushing the Head of the Snake by Robert Brewer PyData SV 2014

  1. 1. Crushing the Head of the Snake Robert Brewer Chief Architect Crunch.io
  2. 2. How to Time from timeit import Timer >>> range(5) [0, 1, 2, 3, 4] >>> t = Timer("range(a)", "a = 1000000") >>> t.timeit(1) 0.028472900390625 >>> t.timeit(100) 1.8600409030914307 >>> t.timeit(1000) 18.056041955947876
  3. 3. Comparing algorithms >>> Timer("range(1000)").timeit(1 000 000) >>> Timer("range(1000)").timeit() 11.392634868621826 >>> Timer("xrange(1000)").timeit() 0.20040297508239746 >>> Timer("list(xrange(1000))").timeit() 12.207480907440186
  4. 4. Caveat: Overhead >>> Timer().timeit(1000000) 0.029289960861206055
  5. 5. Caveat: Wall time not CPU time >>> Timer("xrange(1000)").timeit() 0.20040297508239746 >>> Timer("xrange(1000)").repeat(3) [0.20735883712768555, 0.1968221664428711, 0.18882489204406738] take the minimum
  6. 6. How to Profile >>> import mod >>> import cProfile >>> cProfile.run("mod.b()", sort="cumulative")
  7. 7. How to Profile >>> import mod >>> import cProfile >>> cProfile.run("mod.b()", sort="cumulative") (make changes to module) >>> reload(mod) >>> cProfile.run("mod.b()", sort="cumulative")
  8. 8. How to Profile >>> cProfile.run("for i in xrange(3000): range(i).sort()", sort="cumulative") 6002 function calls in 0.093 seconds Ordered by: cumulative time ncalls tottime percall cumtime percall filename:lineno(func) 1 0.019 0.019 0.093 0.093 <string>:1(<module>) 3000 0.052 0.000 0.052 0.000 {list.sort} 3000 0.022 0.000 0.022 0.000 {range} 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
  9. 9. How to Profile 6002 function calls in 0.093 seconds ncalls tottime percall cumtime percall filename:lineno(func) 3000 0.052 0.000 0.052 0.000 {list.sort} 3000 0.022 0.000 0.022 0.000 {range}
  10. 10. Example: Standard Deviation >>> import numpy >>> n = 100 >>> a = numpy.array(xrange(n), dtype=float) >>> a.std(ddof=1) 29.011491975882016
  11. 11. Example: Standard Deviation >>> n = 4000000000 >>> a = numpy.array(xrange(n), dtype=float) Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: setting an array element with a sequence.
  12. 12. Example: Standard Deviation >>> n = 4000000000 >>> arr = numpy.zeros(n, dtype=float) Traceback (most recent call last): File "<stdin>", line 1, in <module> MemoryError
  13. 13. Example: Standard Deviation
  14. 14. Example: Standard Deviation Given array A broken in n parts a1...an and local variance V(ai) = Σj(aij - ai)2 V(a) + 2(Σaij)(ai - A) + |ai|(A2 - ai 2) |A| - ddof n Σi = 1 √
  15. 15. Example: Standard Deviation def run(): points = 400 000 (0000) segments = 100 part_len = points / segments partitions = [] for p in range(segments): part = range(part_len * p, part_len * (p + 1)) partitions.append(part) return stddev(partitions, ddof=1)
  16. 16. Example: Standard Deviation def stddev(partitions, ddof=0): final = 0.0 for part in partitions: m = total(part) / length(part) # Find the mean of the entire group. gtotal = total([total(p) for p in partitions]) glength = total([length(p) for p in partitions]) g = gtotal / glength adj = ((2 * total(part) * (m - g)) + ((g ** 2 - m ** 2) * length(part))) final += varsum(part) + adj return math.sqrt(final / (glength - ddof))
  17. 17. Example: Standard Deviation 2052106 function calls in 71.025 seconds ncalls tottime percall cumtime percall filename:lineno(func) 1 0.000 0.000 71.023 71.023 stddev.py:39(run) 1 0.006 0.006 71.013 71.013 stddev.py:22(stddev) 410400 63.406 0.000 70.490 0.000 stddev.py:4(total) 100 0.341 0.003 69.178 0.692 stddev.py:15(varsum) 410601 7.076 0.000 7.076 0.000 {range} 410200 0.151 0.000 0.174 0.000 stddev.py:11(length) 820700 0.042 0.000 0.042 0.000 {len} 100 0.000 0.000 0.000 0.000 {list.append} 1 0.000 0.000 0.000 0.000 {math.sqrt}
  18. 18. Example: Standard Deviation 400000 in 71.025 seconds Assuming no other effects of scale, it will take 197.3 hours (over 8 days) to calculate our 4 billion-row array.
  19. 19. Example: Standard Deviation Can we calculate our 4 billion-row array in 1 minute? That’s 400,000 in 6ms. All we need is an 11,837.5x speedup.
  20. 20. Optimization
  21. 21. Example: Standard Deviation 2052106 function calls in 71.025 seconds ncalls tottime percall cumtime percall filename:lineno(func) 1 0.000 0.000 71.023 71.023 stddev.py:39(run) 1 0.006 0.006 71.013 71.013 stddev.py:22(stddev) 410400 63.406 0.000 70.490 0.000 stddev.py:4(total) 100 0.341 0.003 69.178 0.692 stddev.py:15(varsum) 410601 7.076 0.000 7.076 0.000 {range} 410200 0.151 0.000 0.174 0.000 stddev.py:11(length) 820700 0.042 0.000 0.042 0.000 {len} 100 0.000 0.000 0.000 0.000 {list.append} 1 0.000 0.000 0.000 0.000 {math.sqrt}
  22. 22. Amongst Our Weaponry Extracting loop invariants
  23. 23. Extracting Loop Invariants def varsum(arr): vs = 0 for j in range(len(arr)): mean = (total(arr) / length(arr)) vs += (arr[j] - mean) ** 2 return vs
  24. 24. Extracting Loop Invariants def varsum(arr): vs = 0 mean = (total(arr) / length(arr)) for j in range(len(arr)): vs += (arr[j] - mean) ** 2 return vs
  25. 25. Extracting Loop Invariants 52606 calls in 1.944 seconds (36x) ncalls tottime percall cumtime percall filename:lineno(func) 1 0.000 0.000 1.942 1.942 stddev1.py:41(run) 1 0.006 0.006 1.932 1.932 stddev1.py:23(stddev) 10500 1.673 0.000 1.859 0.000 stddev1.py:4(total) 10701 0.196 0.000 0.196 0.000 {range} 100 0.062 0.001 0.081 0.001 stddev1.py:15(varsum) 10300 0.003 0.000 0.003 0.000 stddev1.py:11(length) 20900 0.001 0.000 0.001 0.000 {len} 100 0.000 0.000 0.000 0.000 {list.append} 1 0.000 0.000 0.000 0.000 {math.sqrt} still 5.4 hrs
  26. 26. Extracting Loop Invariants def stddev(partitions, ddof=0): final = 0.0 for part in partitions: m = total(part) / length(part) # Find the mean of the entire group. gtotal = total([total(p) for p in partitions]) glength = total([length(p) for p in partitions]) g = gtotal / glength adj = ((2 * total(part) * (m - g)) + ((g ** 2 - m ** 2) * length(part))) final += varsum(part) + adj return math.sqrt(final / (glength - ddof))
  27. 27. Extracting Loop Invariants def stddev(partitions, ddof=0): final = 0.0 # Find the mean of the entire group. gtotal = total([total(p) for p in partitions]) glength = total([length(p) for p in partitions]) g = gtotal / glength for part in partitions: m = total(part) / length(part) adj = ((2 * total(part) * (m - g)) + ((g ** 2 - m ** 2) * length(part))) final += varsum(part) + adj return math.sqrt(final / (glength - ddof))
  28. 28. Extracting Loop Invariants 2512 function calls in 0.142 seconds (13x) ncalls tottime percall cumtime percall filename:lineno(func) 1 0.000 0.000 0.140 0.140 stddev1.py:42(run) 1 0.000 0.000 0.136 0.136 stddev1.py:23(stddev) 100 0.063 0.001 0.082 0.001 stddev1.py:15(varsum) 402 0.064 0.000 0.071 0.000 stddev1.py:4(total) 603 0.013 0.000 0.013 0.000 {range} 400 0.000 0.000 0.000 0.000 stddev1.py:11(length) 902 0.000 0.000 0.000 0.000 {len} 100 0.000 0.000 0.000 0.000 {list.append} 1 0.000 0.000 0.000 0.000 {math.sqrt} still 23 minutes
  29. 29. Amongst Our Weaponry Use builtin Python functions whenever possible
  30. 30. Use Python Builtins def total(arr): s = 0 for j in range(len(arr)): s += arr[j] return s
  31. 31. Use Python Builtins def total(arr): s = 0 for j in range(len(arr)): s += arr[j] return s def total(arr): return sum(arr)
  32. 32. Use Python Builtins 2110 function calls in 0.096 seconds (1.47x) ncalls tottime percall cumtime percall filename:lineno(func) 1 0.000 0.000 0.093 0.093 stddev1.py:39(run) 1 0.000 0.000 0.083 0.083 stddev1.py:20(stddev) 100 0.065 0.001 0.070 0.001 stddev1.py:12(varsum) 402 0.000 0.000 0.015 0.000 stddev1.py:4(total) 402 0.015 0.000 0.015 0.000 {sum} 201 0.012 0.000 0.012 0.000 {range} 400 0.000 0.000 0.000 0.000 stddev1.py:8(length) 500 0.000 0.000 0.000 0.000 {len} 100 0.000 0.000 0.000 0.000 {list.append} 1 0.000 0.000 0.000 0.000 {math.sqrt} still 16 minutes
  33. 33. Use Python Builtins 2110 function calls in 0.096 seconds (1.47x) ncalls tottime percall cumtime percall filename:lineno(func) 1 0.000 0.000 0.093 0.093 stddev1.py:39(run) 1 0.000 0.000 0.083 0.083 stddev1.py:20(stddev) 100 0.065 0.001 0.070 0.001 stddev1.py:12(varsum) 402 0.000 0.000 0.015 0.000 stddev1.py:4(total) 402 0.015 0.000 0.015 0.000 {sum} 201 0.012 0.000 0.012 0.000 {range} 400 0.000 0.000 0.000 0.000 stddev1.py:8(length) 500 0.000 0.000 0.000 0.000 {len} 100 0.000 0.000 0.000 0.000 {list.append} 1 0.000 0.000 0.000 0.000 {math.sqrt}
  34. 34. Use Python Builtins def varsum(arr): vs = 0 mean = (total(arr) / length(arr)) for j in range(len(arr)): vs += (arr[j] - mean) ** 2 return vs
  35. 35. Use Python Builtins def varsum(arr): mean = (total(arr) / length(arr)) return sum((v - mean) ** 2 for v in arr)
  36. 36. Use Python Builtins 402110 function calls in 0.122 seconds 1.27x slower ncalls tottime percall cumtime percall filename:lineno(func) 1 0.000 0.000 0.120 0.120 stddev.py:36(run) 1 0.000 0.000 0.115 0.115 stddev.py:17(stddev) 502 0.044 0.000 0.114 0.000 {sum} 100 0.000 0.000 0.106 0.001 stddev.py:12(varsum) 400100 0.070 0.000 0.070 0.000 stddev.py:14(genexpr) 402 0.000 0.000 0.011 0.000 stddev.py:4(total) …
  37. 37. Amongst Our Weaponry Reduce function calls
  38. 38. Reduce Function Calls >>> Timer("sum(a)", "a = range(10)").repeat(3) [0.15801000595092773, 0.1406857967376709, 0.14577603340148926] >>> Timer("total(a)", "a = range(10); total = lambda x: sum(x)" ).repeat(3) [0.2066800594329834, 0.1998300552368164, 0.21536493301391602] 0.000000059 seconds per call
  39. 39. Reduce Function Calls def variances_squared(arr): mean = (total(arr) / length(arr)) for v in arr: yield (v - mean) ** 2
  40. 40. Reduce Function Calls def varsum(arr): mean = (total(arr) / length(arr)) return sum( (v - mean) ** 2 for v in arr ) def varsum(arr): mean = (total(arr) / length(arr)) return sum([(v - mean) ** 2 for v in arr])
  41. 41. Reduce Function Calls 2010 function calls in 0.082 seconds (1.17x) ncalls tottime percall cumtime percall filename:lineno(func) 1 0.000 0.000 0.080 0.080 stddev.py:36(run) 1 0.000 0.000 0.071 0.071 stddev.py:17(stddev) 100 0.050 0.001 0.056 0.001 stddev.py:12(varsum) 502 0.020 0.000 0.020 0.000 {sum} 402 0.000 0.000 0.016 0.000 stddev.py:4(total) 101 0.009 0.000 0.009 0.000 {range} 400 0.000 0.000 0.000 0.000 stddev.py:8(length) 400 0.000 0.000 0.000 0.000 {len} 100 0.000 0.000 0.000 0.000 {list.append} 1 0.000 0.000 0.000 0.000 {math.sqrt} still 13+ minutes
  42. 42. Amongst Our Weaponry Vector operations with NumPy
  43. 43. Vector Operations part = numpy.array( xrange(...), dtype=float) def total(arr): return arr.sum() def varsum(arr): return ( (arr - arr.mean()) ** 2).sum()
  44. 44. Vector Operations 3408 function calls in 0.057 seconds (1.43x) ncalls tottime percall cumtime percall filename:lineno(func) 1 0.000 0.000 0.057 0.057 stddev1.py:37(run) 200 0.051 0.000 0.051 0.000 {numpy...array} 1 0.001 0.001 0.006 0.006 stddev1.py:18(stddev) 500 0.003 0.000 0.003 0.000 {numpy.ufunc.reduce} 100 0.001 0.000 0.003 0.000 stddev1.py:14(varsum) 400 0.000 0.000 0.003 0.000 {numpy.ndarray.sum} 300 0.000 0.000 0.002 0.000 stddev1.py:6(total) 100 0.000 0.000 0.001 0.000 {numpy.ndarray.mean} … still 9.5 minutes
  45. 45. Vector Operations 3408 function calls in 0.057 seconds (1.43x) ncalls tottime percall cumtime percall filename:lineno(func) 1 0.000 0.000 0.057 0.057 stddev1.py:37(run) 200 0.051 0.000 0.051 0.000 {numpy...array} 1 0.001 0.001 0.006 0.006 stddev1.py:18(stddev) 500 0.003 0.000 0.003 0.000 {numpy.ufunc.reduce} 100 0.001 0.000 0.003 0.000 stddev1.py:14(varsum) 400 0.000 0.000 0.003 0.000 {numpy.ndarray.sum} 300 0.000 0.000 0.002 0.000 stddev1.py:6(total) 100 0.000 0.000 0.001 0.000 {numpy.ndarray.mean} … still 9.5 minutes
  46. 46. Vector Operations 3408 function calls in 0.006 seconds (13.6x) ncalls tottime percall cumtime percall filename:lineno(func) 1 0.001 0.001 0.006 0.006 stddev1.py:18(stddev) 500 0.003 0.000 0.003 0.000 {numpy.ufunc.reduce} 100 0.001 0.000 0.003 0.000 stddev1.py:14(varsum) 400 0.000 0.000 0.003 0.000 {numpy.ndarray.sum} 300 0.000 0.000 0.002 0.000 stddev1.py:6(total) 100 0.000 0.000 0.001 0.000 {numpy.ndarray.mean} … should be exactly 1 minute
  47. 47. Vector Operations Let’s try 4 billion! Bump up that N...
  48. 48. Vector Operations MemoryError Oh, yeah...
  49. 49. Amongst Our Weaponry Parallelization with multiprocessing
  50. 50. Parallelization from multiprocessing import Pool def run(): results = Pool().map( run_one, range(segments)) result = stddev(results) return result
  51. 51. Parallelization def run_one(i): p = numpy.memmap( 'stddev.%d' % i, dtype=float, mode='r', shape=(part_len,)) T, L = p.sum(), float(len(p)) m = T / L V = ((p - m) ** 2).sum() return T, L, V
  52. 52. Parallelization def stddev(TLVs, ddof=0): final = 0.0 totals = [T for T, L, V in TLVs] lengths = [L for T, L, V in TLVs] glength = sum(lengths) g = sum(totals) / glength for T, L, V in TLVs: m = T / L adj = ((2 * T * (m - g)) + ((g ** 2 - m ** 2) * L)) final += V + adj return math.sqrt(final / (glength - ddof))
  53. 53. Parallelization 3734 function calls in 0.024 seconds 6x slower ncalls tottime percall cumtime percall filename:lineno(func) 1 0.000 0.000 0.024 0.024 stddev.py:47(run) 4 0.000 0.000 0.011 0.003 threading.py:234(wait) 22 0.011 0.000 0.011 0.000 {thread.lock.acquire} 1 0.000 0.000 0.011 0.011 pool.py:222(map) 1 0.000 0.000 0.008 0.008 pool.py:113(__init__) 4 0.001 0.000 0.005 0.001 process.py:116(start) 1 0.003 0.003 0.005 0.005 stddev.py:11(stddev) 4 0.000 0.000 0.004 0.001 forking.py:115(init) 4 0.003 0.001 0.003 0.001 {posix.fork} ...
  54. 54. Parallelization Could that waiting be insignificant when we scale up to 4 billion? Let’s try it!
  55. 55. Parallelization 3766 function calls in 67.811 seconds ncalls tottime percall cumtime percall filename:lineno(func) 1 0.000 0.000 67.811 67.811 stddev.py:47(run) 4 0.000 0.000 67.747 16.930 threading.py:234(wait) 22 67.747 3.079 67.747 3.079 {thread.lock.acquire} 1 0.000 0.000 67.747 67.747 pool.py:222(map) 1 0.000 0.000 0.062 0.060 pool.py:113(__init__) 4 0.000 0.000 0.058 0.014 process.py:116(start) 4 0.057 0.014 0.057 0.014 {posix.fork} 1 0.003 0.003 0.005 0.005 stddev.py:11(stddev) 2 0.002 0.001 0.002 0.001 {sum} SO CLOSE! 1.13 minutes
  56. 56. Parallelization def run_one(i): if i == 50: cProfile.runctx(..., "prf.50") >>> import pstats >>> s = pstats.Stats("prf.50") >>> s.sort_stats("cumulative") <pstats.Stats instance at 0x2bddcb0> >>> _.print_stats()
  57. 57. Parallelization 57 function calls in 2.804 seconds ncalls tottime percall cumtime percall filename:lineno(func) 1 0.431 0.431 2.791 2.791 stddev.py:43(run_one) 2 0.000 0.000 2.360 1.180 numpy.ndarray.sum 2 2.360 1.180 2.360 1.180 numpy.ufunc.reduce 1 0.000 0.000 0.000 0.000 memmap.py:195(__new__)
  58. 58. Parallelization def run_one(i): p = numpy.memmap( 'stddev.%d' % i, dtype=float, mode='r', shape=(part_len,)) T, L = p.sum(), float(len(p)) m = T / L V = ((p - m) ** 2).sum() return T, L, V 200 seconds / 4 cores = 50
  59. 59. Parallelization? Serialization! 67.8 seconds for 4 billion rows, but -50 of those are loading data! 17.8 seconds to do the actual math.
  60. 60. Serialization import bloscpack as bp bargs = bp.args.DEFAULT_BLOSC_ARGS bargs['clevel'] = 6 bp.pack_ndarray_file( part, fname, blosc_args=bargs) part = bp.unpack_ndarray_file(fname)
  61. 61. Serialization Let’s try it!
  62. 62. I Crush Your Head!
  63. 63. I Crush Your Head! 1153 function calls in 26.166 seconds ncalls tottime percall cumtime percall filename:lineno(func) 1 0.000 0.000 26.166 26.166 stddev_bp.py:56(run) 4 0.000 0.000 26.134 6.53 threading.py:234(wait) 22 26.134 1.188 26.134 1.188 thread.lock.acquire 1 0.000 0.000 26.133 26.133 pool.py:222(map) 1 0.000 0.000 26.133 26.133 pool.py:521(get) 1 0.000 0.000 26.133 26.133 pool.py:513(wait) 1 0.003 0.003 0.030 0.030 __init__.py:227(Pool) 1 0.000 0.000 0.021 0.021 pool.py:113(__init__)
  64. 64. I Crush Your Head! With some time-tested general programming techniques: Extract loop invariants Use language builtins Reduce function calls
  65. 65. I Crush Your Head! And some Python libraries for architectural improvements: Use NumPy for vector ops Use multiprocessing for parallelization Use bloscpack for compression
  66. 66. I Crush Your Head! We sped up our calculation so that it runs in: 0.003% of the time or 27317 times faster 4.4 orders of magnitude
  67. 67. Crushing the Head of the Snake Any questions? @aminusfu bob@crunch.io

×