Processing biggish data on commodity hardware: simple Python patterns

4,386 views

Published on

Scipy 2013 talk on simple Python patterns to process efficiently large datasets using Python.

The talk focuses on the patterns and the concepts rather than the implementations. The implementations can be found by looking at the joblib and scikit-learn codebase

Published in: Technology

Processing biggish data on commodity hardware: simple Python patterns

  1. 1. Processing biggish dataon commodity hardwareSimple Python patternsGa¨el Varoquaux INRIA/Parietal – NeurospinDisclaimer: I’m French, I have opinionsWe’re in Texas, I hope y’all have left your guns outsideYeah, I know, Texas is bigger than France
  2. 2. “Big data”:Petabytes...Distributed storageComputing clusterMere mortals:Gigabytes...Python programmingOff-the-self computers∼ 16 CPUs, 32 Gb RAMG Varoquaux 2
  3. 3. My toolsPython, what else? + Numpy+ ScipyThe ndarray is underusedby the data communityG Varoquaux 3
  4. 4. My toolsPython, what else? Patterns in this presentation:scikit-learnMachine learning in PythonjoblibUsing Python functions aspipeline jobsG Varoquaux 3
  5. 5. Design philosophy1. Fail gracefullyEasy to debug. Robust to errors.2. Don’t solve hard problemsThe original problem can be bent.3. Dependencies suckDistribution is an age-old problem.4. Performance mattersWaiting kills productivity.G Varoquaux 4
  6. 6. Processing big dataSpeed ups in Hadoop, CPUs...Execution pipelinesdataflow programmingparallel computingData accessstoringcachingG Varoquaux 5
  7. 7. Processing big dataSpeed ups in Hadoop, CPUs...Execution pipelinesdataflow programmingparallel computingData accessstoringcachingPipelines can get messyDatabases are tediousG Varoquaux 5
  8. 8. 5 simple Python patterns for efficient data crunching1 On the fly data reduction2 On-line algorithms3 Parallel processing patterns4 Caching5 Fast I/OG Varoquaux 6
  9. 9. Big how?2 scenarios:Many observations –samplese.g. twitterMany descriptors per observation –featurese.g. brain scansG Varoquaux 7
  10. 10. 1 On the fly data reductionG Varoquaux 8
  11. 11. 1 On the fly data reductionBig data is often I/O boundLayer memory accessCPU cachesRAMLocal disksDistant storageLess data also means less workG Varoquaux 8
  12. 12. 1 Dropping dataNumber one technique used to handle large dataset1 loop: take a random fraction of the data2 run algorithm on that fraction3 aggregate results across sub-samplingsLooks like bagging: bootstrap aggregationPerformance tip: run the loop in parallelExploits redundancy across observationsGreat when the number of samples is largeG Varoquaux 9
  13. 13. 1 Dimension reductionOften individual features are low SNRRandom projections (will average features)sklearn.random projectionrandom linear combinations of the featuresFast –sub-optimal– clustering of featuressklearn.cluster.WardAgglomerationon images: super-pixel strategyHashing, when observations have varying size(e.g. words)sklearn.feature extraction.text.HashingVectorizerstateless: can be used in parallelG Varoquaux 10
  14. 14. 1 An example: randomized SVDsklearn.utils.extmath.randomized svdOne random projection + power iterationsX = np.random.normal(size=(50000, 200))%timeit lapack = linalg.svd(X, full matrices=False)1 loops, best of 3: 6.09 s per loop%timeit arpack=splinalg.svds(X, 10)1 loops, best of 3: 2.49 s per loop%timeit randomized = randomized svd(X, 10)1 loops, best of 3: 303 ms per looplinalg.norm(lapack[0][:, :10] - arpack[0]) / 20000.0022360679774997738linalg.norm(lapack[0][:, :10] - randomized[0]) / 20000.0022121161221386925G Varoquaux 11
  15. 15. 2 On-line algorithmsProcess the data one sample at a timeG Varoquaux 12
  16. 16. 2 On-line algorithmsCompute the mean of a gazillionnumbersHard?G Varoquaux 12
  17. 17. 2 On-line algorithmsCompute the mean of a gazillionnumbersHard?No: just do a running meanG Varoquaux 12
  18. 18. 2 Convergence: statistics and speedIf the data are i.i.d., converges to expectationsMini-batch = bunch observationsTrade-off between memory usage and vectorizationExample: K-Means clusteringX = np.random.normal(size=(10000, 200))scipy.cluster.vq.kmeans(X, 10,iter=2)11.33 ssklearn.cluster.MiniBatchKMeans(n clusters=10,n init=2).fit(X)0.62 sG Varoquaux 13
  19. 19. 3 Parallel processing patternsG Varoquaux 14
  20. 20. 3 Parallel processing patternsFocus on embarassingly parallel for loopsLife is too short to worry about deadlocksG Varoquaux 14
  21. 21. 3 Parallel processing patternsFocus on embarassingly parallel for loopsLife is too short to worry about deadlocksWorkers compete for data accessMemory bus is a bottleneckOn grids: distributed storageG Varoquaux 14
  22. 22. 3 Parallel processing patternsFocus on embarassingly parallel for loopsLife is too short to worry about deadlocksWorkers compete for data accessMemory bus is a bottleneckOn grids: distributed storageThe right grain of parallelismToo fine ⇒ overheadToo coarse ⇒ memory shortageScale by the relevant cache poolG Varoquaux 14
  23. 23. 3 Queues – the magic behind joblib.ParallelQueues: high-performance, concurrent-friendlyDifficulty: callback on result arrival⇒ multiple threads in caller + risk of deadlocksDispatch queue should fill up “slowly”⇒ pre dispatch in joblib⇒ Back and forth communicationDoor open to race conditionsG Varoquaux 15
  24. 24. 3 What happens where: grand-central dispatch?joblib design: Caller, dispatch queue, and collectqueue in same processBenefit: robustnessGrand-central dispatch design: dispatch queue hasa process of its ownBenefit: resource managment in nested for loopsG Varoquaux 16
  25. 25. 4 CachingFor reproducible science:avoid manually chained scripts (make-like usage)For performance:avoiding re-computing is the crux of optimizationG Varoquaux 17
  26. 26. 4 The joblib approachThe memoize patternmem = joblib.Memory(cachedir=’.’)g = mem.cache(f)b = g(a) # computes a using fc = g(a) # retrieves results from storeChallenges in the context of big dataa & b are bigDesign goalsa & b arbitrary Python objectsNo dependenciesDrop-in, framework-less code for cachingG Varoquaux 18
  27. 27. 4 Efficient input argument hashing – joblib.hashCompute md5 of input argumentsImplementation1. Create an md5 hash object2. Subclass the standard-library pickler= state machine that walks the object graph3. Walk the object graph:- ndarrays: pass data pointer to md5 algorithm(“update” method)- the rest: pickle4. Update the md5 with the picklemd5 is in the Python standard libraryG Varoquaux 19
  28. 28. 4 Fast, disk-based, concurrent, store – joblib.dumpPersisting arbritrary objectsOnce again sub-class the picklerUse .npy for large numpy arrays (np.save),pickle for the rest⇒ Multiple filesStore concurrency issuesStrategy: atomic operations + try/exceptRenaming a directory is atomicDirectory layout consistent with remove operationsGood performance, usable on shared disks (cluster)G Varoquaux 20
  29. 29. 5 Fast I/OFast read-outs, for out-of-core computingG Varoquaux 21
  30. 30. 5 Making I/O fastFast compressionCPU may be faster than disk accessChunk data for access patterns pytablesStandard library: zlib.compress with buffers(bypass gzip module to work online + in-memory)Avoiding copieszlib.compress needs C-contiguous buffersStore raw buffer + meta-information (strides, class...)- use reduce- rebuild: np.core.multiarray. reconstructnot in pytablesG Varoquaux 22
  31. 31. 5 Benchmarking to np.save and pytablesyaxisscale:1isnp.saveNeuroImaging data (MNI atlas)G Varoquaux 23
  32. 32. @GaelVaroquauxSumming up5 simple Python patterns for efficient data crunching1 On the fly data reduction2 On-line algorithms3 Parallel processing patterns4 Caching5 Fast I/O
  33. 33. @GaelVaroquauxCost of complexity underestimatedKnow your problem& solve it with simple primitivesPython modulesscikit-learn: machine learningjoblib: pipeline-ish patternsCome work with me!Positions available

×