Big Data brings with it particular challenges in any language, mostly in performance. This talk will explain how to get immediate speedups in your Python code by exploiting both timeless programming techniques and fixes specific to Python. We will cover: I. Amongst Our Weaponry 1. How to Time and Profile Python 2. Extracting Loop invariants: constants, lookup tables, even methods! 3. Caching: memoization and heavier things II Gunfight at the O.K. Corral in Morse Code 1. Python functions vs C functions 2. Vector operations: NumPy 3. Reducing calls: loops, generators, recursion III. The Semaphore Version of Wuthering Heights 1. Using select instead of Queue 2. Serialization overhead 3. Parallelizing work
5. Caveat: Wall time not CPU time
>>> Timer("xrange(1000)").timeit()
0.20040297508239746
>>> Timer("xrange(1000)").repeat(3)
[0.20735883712768555,
0.1968221664428711,
0.18882489204406738]
take the minimum
6. How to Profile
>>> import mod
>>> import cProfile
>>> cProfile.run("mod.b()", sort="cumulative")
7. How to Profile
>>> import mod
>>> import cProfile
>>> cProfile.run("mod.b()", sort="cumulative")
(make changes to module)
>>> reload(mod)
>>> cProfile.run("mod.b()", sort="cumulative")
8. How to Profile
>>> cProfile.run("for i in xrange(3000): range(i).sort()",
sort="cumulative")
6002 function calls in 0.093 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(func)
1 0.019 0.019 0.093 0.093 <string>:1(<module>)
3000 0.052 0.000 0.052 0.000 {list.sort}
3000 0.022 0.000 0.022 0.000 {range}
1 0.000 0.000 0.000 0.000 {method 'disable' of
'_lsprof.Profiler'
objects}
9. How to Profile
6002 function calls in 0.093 seconds
ncalls tottime percall cumtime percall filename:lineno(func)
3000 0.052 0.000 0.052 0.000 {list.sort}
3000 0.022 0.000 0.022 0.000 {range}
10. Example: Standard Deviation
>>> import numpy
>>> n = 100
>>> a = numpy.array(xrange(n),
dtype=float)
>>> a.std(ddof=1)
29.011491975882016
11. Example: Standard Deviation
>>> n = 4000000000
>>> a = numpy.array(xrange(n),
dtype=float)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: setting an array element
with a sequence.
12. Example: Standard Deviation
>>> n = 4000000000
>>> arr = numpy.zeros(n, dtype=float)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
MemoryError
14. Example: Standard Deviation
Given array A broken in n parts a1...an
and local variance V(ai) = Σj(aij - ai)2
V(a) + 2(Σaij)(ai - A) + |ai|(A2 - ai
2)
|A| - ddof
n
Σi =
1
√
15. Example: Standard Deviation
def run():
points = 400 000 (0000)
segments = 100
part_len = points / segments
partitions = []
for p in range(segments):
part = range(part_len * p,
part_len * (p + 1))
partitions.append(part)
return stddev(partitions, ddof=1)
16. Example: Standard Deviation
def stddev(partitions, ddof=0):
final = 0.0
for part in partitions:
m = total(part) / length(part)
# Find the mean of the entire group.
gtotal = total([total(p) for p in partitions])
glength = total([length(p) for p in partitions])
g = gtotal / glength
adj = ((2 * total(part) * (m - g)) +
((g ** 2 - m ** 2) * length(part)))
final += varsum(part) + adj
return math.sqrt(final / (glength - ddof))
18. Example: Standard Deviation
400000 in 71.025 seconds
Assuming no other effects of scale,
it will take 197.3 hours (over 8 days)
to calculate our 4 billion-row array.
19. Example: Standard Deviation
Can we calculate
our 4 billion-row array in
1 minute?
That’s 400,000 in 6ms.
All we need is an 11,837.5x speedup.
23. Extracting Loop Invariants
def varsum(arr):
vs = 0
for j in range(len(arr)):
mean = (total(arr) / length(arr))
vs += (arr[j] - mean) ** 2
return vs
24. Extracting Loop Invariants
def varsum(arr):
vs = 0
mean = (total(arr) / length(arr))
for j in range(len(arr)):
vs += (arr[j] - mean) ** 2
return vs
26. Extracting Loop Invariants
def stddev(partitions, ddof=0):
final = 0.0
for part in partitions:
m = total(part) / length(part)
# Find the mean of the entire group.
gtotal = total([total(p) for p in partitions])
glength = total([length(p) for p in partitions])
g = gtotal / glength
adj = ((2 * total(part) * (m - g)) +
((g ** 2 - m ** 2) * length(part)))
final += varsum(part) + adj
return math.sqrt(final / (glength - ddof))
27. Extracting Loop Invariants
def stddev(partitions, ddof=0):
final = 0.0
# Find the mean of the entire group.
gtotal = total([total(p) for p in partitions])
glength = total([length(p) for p in partitions])
g = gtotal / glength
for part in partitions:
m = total(part) / length(part)
adj = ((2 * total(part) * (m - g)) +
((g ** 2 - m ** 2) * length(part)))
final += varsum(part) + adj
return math.sqrt(final / (glength - ddof))
39. Reduce Function Calls
>>> Timer("sum(a)", "a = range(10)").repeat(3)
[0.15801000595092773,
0.1406857967376709,
0.14577603340148926]
>>> Timer("total(a)",
"a = range(10); total = lambda x: sum(x)"
).repeat(3)
[0.2066800594329834,
0.1998300552368164,
0.21536493301391602]
0.000000059 seconds per call
40. Reduce Function Calls
def variances_squared(arr):
mean = (total(arr) / length(arr))
for v in arr:
yield (v - mean) ** 2
41. Reduce Function Calls
def varsum(arr):
mean = (total(arr) / length(arr))
return sum( (v - mean) ** 2
for v in arr )
def varsum(arr):
mean = (total(arr) / length(arr))
return sum([(v - mean) ** 2
for v in arr])
52. Parallelization
def run_one(i):
p = numpy.memmap(
'stddev.%d' % i, dtype=float,
mode='r', shape=(part_len,))
T, L = p.sum(), float(len(p))
m = T / L
V = ((p - m) ** 2).sum()
return T, L, V
53. Parallelization
def stddev(TLVs, ddof=0):
final = 0.0
totals = [T for T, L, V in TLVs]
lengths = [L for T, L, V in TLVs]
glength = sum(lengths)
g = sum(totals) / glength
for T, L, V in TLVs:
m = T / L
adj = ((2 * T * (m - g)) + ((g ** 2 - m ** 2) * L))
final += V + adj
return math.sqrt(final / (glength - ddof))
59. Parallelization
def run_one(i):
p = numpy.memmap(
'stddev.%d' % i, dtype=float,
mode='r', shape=(part_len,))
T, L = p.sum(), float(len(p))
m = T / L
V = ((p - m) ** 2).sum()
return T, L, V
200 seconds / 4 cores = 50
65. I Crush Your Head!
With some time-tested general
programming techniques:
Extract loop invariants
Use language builtins
Reduce function calls
66. I Crush Your Head!
And some Python libraries
for architectural improvements:
Use NumPy for vector ops
Use multiprocessing for parallelization
Use bloscpack for compression
67. I Crush Your Head!
We sped up our calculation
so that it runs in:
0.003% of the time
or 27317 times faster
4.4 orders of magnitude
68. Crushing the Head of the Snake
Any questions?
@aminusfu
bob@crunch.io