Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

On the code of data science

3,653 views

Published on

Data science calls for rapid experimentation and building intuitions from the data. Yet, data science also underpins crucial decisions and operational logic. Writing production-ready and robust statistical analysis without cognitive overhead may seem a conundrum. I will explore simple, and less simple, practices for fast turn around and consolidation of data-science code. I will discuss how these considerations led to the design of scikit-learn, that enables easy machine learning yet is used in production. Finally, I will mention some scikit-learn gems, new or forgotten.

Published in: Technology

On the code of data science

  1. 1. On the code of data science Varoquaux Ga¨el import data science data science.discover()
  2. 2. On the code of data science Varoquaux Ga¨el import data science data science.discover() Disclaimer: A bit of a Python bias Key ideas carry over
  3. 3. What is data science? G Varoquaux 2
  4. 4. What is data science? G Varoquaux 2
  5. 5. Data science = statistics + code Larger datasets Automating insights Data science is disruptive when it enables real-time or personalized decisions G Varoquaux 3
  6. 6. 1 Writing code for data science 2 Machine learning in Python 3 Outlook: infrastructure Make you more productive as a data scientist G Varoquaux 4
  7. 7. 1 Writing code for data science Productivity & Quality G Varoquaux 5
  8. 8. 1 A data-science workflow Work based on intuition and experimentation Conjecture Experiment ñ Interactive & framework-less Yet needs consolidation keeping flexibility G Varoquaux 6
  9. 9. 1 A design pattern in computational experiments MVC pattern from Wikipedia: Model Manages the data and rules of the application View Output represen- tation Possibly several views Controller Accepts input and converts it to commands for model and view Photo-editing software Filters Canvas Tool palettes Typical web application Database Web-pages URLs G Varoquaux 7
  10. 10. 1 A design pattern in computational experiments MVC pattern from Wikipedia: Model Manages the data and rules of the application View Output represen- tation Possibly several views Controller Accepts input and converts it to commands for model and view For data science: Numerical, data- processing, & ex- perimental logic Results, as files. Data & plots Imperative API Avoid input as files: not expressive Module with functions Post-processing script CSV & data files Script ñ for loops G Varoquaux 7
  11. 11. 1 A design pattern in computational experiments MVC pattern from Wikipedia: Model Manages the data and rules of the application View Output represen- tation Possibly several views Controller Accepts input and converts it to commands for model and view For data science: Numerical, data- processing, & ex- perimental logic Results, as files. Data & plots Imperative API Avoid input as files: not expressive Module with functions Post-processing script CSV & data files Script ñ for loops A recipe 3 types of files: • modules • command scripts • post-processing scripts CSVs & intermediate data files Separate computation from analysis / plotting Code and text (and data) ñ version control G Varoquaux 7
  12. 12. 1 A design pattern in computational experiments MVC pattern from Wikipedia: Model Manages the data and rules of the application View Output represen- tation Possibly several views Controller Accepts input and converts it to commands for model and view For data science: Numerical, data- processing, & ex- perimental logic Results, as files. Data & plots Imperative API Avoid input as files: not expressive Module with functions Post-processing script CSV & data files Script ñ for loops A recipe 3 types of files: • modules • command scripts • post-processing scripts CSVs & intermediate data files Separate computation from analysis / plotting Code and text (and data) ñ version control Decouple steps Goals: Reuse code Mitigate compute time G Varoquaux 7
  13. 13. 1 How I work progressive consolidation Start with a script playing to understand the problem G Varoquaux 8
  14. 14. 1 How I work progressive consolidation Start with a script playing to understand the problem Identify blocks/operations ñ move to a function Use functions Obstacle: local scope requires identifying input and output variables That’s a good thing Solution: Interactive debugging / understanding inside a function: %debug in IPython Functions are the basic reusable abstraction G Varoquaux 8
  15. 15. 1 How I work progressive consolidation Start with a script playing to understand the problem Identify blocks/operations ñ move to a function As they stabilize, move to a module Modules enable sharing between experiments ñ avoid 1000 lines scripts + commented code enable testing Fast experiments as tests ñ gives confidence, hence refactorings G Varoquaux 8
  16. 16. 1 How I work progressive consolidation Start with a script playing to understand the problem Identify blocks/operations ñ move to a function As they stabilize, move to a module Clean: delete code & files you have version control Attentional load makes it impossible to find or understand things Where’s Waldo?G Varoquaux 8
  17. 17. 1 How I work progressive consolidation Start with a script playing to understand the problem Identify blocks/operations ñ move to a function As they stabilize, move to a module Clean: delete code & files you have version control Why is it hard? Long compute times make us unadventurous Know your tools Refactoring editor Version control G Varoquaux 8
  18. 18. 1 Caching to tame computation time The memoize pattern mem = joblib.Memory(cachedir=’.’) g = mem.cache(f) b = g(a) # computes a using f c = g(a) # retrieves results from memory For data science a & b can be big a & b arbitrary objects no change in workflow Results stored on disk G Varoquaux 9
  19. 19. 1 Caching to tame computation time The memoize pattern mem = joblib.Memory(cachedir=’.’) g = mem.cache(f) b = g(a) # computes a using f c = g(a) # retrieves results from memory For data science Fits in experimentation loop Helps decrease re-run times œ Black-boxy, persistence only implicit G Varoquaux 9
  20. 20. Adopting software-engineering best practices G Varoquaux 10
  21. 21. 1 The ladder of code quality Use linting/code analysis in your editor seriously G Varoquaux 11
  22. 22. 1 The ladder of code quality Use linting/code analysis in your editor seriously Coding convention, good naming Version control Use git + github Code review Unit testing If it’s not tested, it’s broken or soon will be Make a package controlled dependencies and compilation ... G Varoquaux 11
  23. 23. 1 The ladder of code qualityIncreasingcost ?İ Use linting/code analysis in your editor seriously Coding convention, good naming Version control Use git + github Code review Unit testing If it’s not tested, it’s broken or soon will be Make a package controlled dependencies and compilation ... Avoid premature software engineering G Varoquaux 11
  24. 24. 1 The ladder of code qualityIncreasingcost ?İ Use linting/code analysis in your editor seriously Coding convention, good naming Version control Use git + github Code review Unit testing If it’s not tested, it’s broken or soon will be Make a package controlled dependencies and compilation ... Avoid premature software engineering Over versus under engineering When the goal is generating insights Experimentation to develop intuitions ñ new ideas As the path becomes clear: consolidation Heavy engineering too early freezes bad ideas G Varoquaux 11
  25. 25. 1 LibrariesIncreasingcost ?İ Use linting/code analysis in your editor seriously Coding convention, good naming Version control Use git + github Code review Unit testing If it’s not tested, it’s broken or soon will be Make a package controlled dependencies and compilation ... A library G Varoquaux 12
  26. 26. 2 Machine learning in Python scikit-learn G Varoquaux 13
  27. 27. 2 Tradeoffs Experimentation Production- đ § scikit G Varoquaux 14
  28. 28. 2 My stack for data science Python, what else? General-purpose language Interactive Easy to read / write G Varoquaux 15
  29. 29. 2 My stack for data science The scientific Python stack numpy arrays Mostly a float** No annotation / structure Universal across applications Easily shared across languages 03878794797927 01790752701578 94071746124797 54970718717887 13653490495190 74754265358098 48721546349084 90345673245614 57187745620 03878794797927 01790752701578 94071746124797 54970718717887 13653490495190 74754265358098 48721546349084 90345673245614 7187745620 G Varoquaux 15
  30. 30. 2 My stack for data science The scientific Python stack numpy arrays Connecting to pandas Columnar data scikit-image Images scipy Numerics, signal processing ... G Varoquaux 15
  31. 31. 2 Machine learning in a nutshell Machine learning is about making predictions from data e.g. learning to distinguish apples from oranges G Varoquaux 16
  32. 32. 2 Machine learning in a nutshell Machine learning is about making predictions from data e.g. learning to distinguish apples from oranges Prediction is very difficult, especially about the future. Niels Bohr Learn as much as possible from the data but not too much G Varoquaux 16
  33. 33. 2 Machine learning in a nutshell Machine learning is about making predictions from data e.g. learning to distinguish apples from oranges Prediction is very difficult, especially about the future. Niels Bohr Learn as much as possible from the data but not too much x y x y Which model do you prefer? G Varoquaux 16
  34. 34. 2 Machine learning in a nutshell Machine learning is about making predictions from data e.g. learning to distinguish apples from oranges Prediction is very difficult, especially about the future. Niels Bohr Learn as much as possible from the data but not too much x y x y Minimizing train error ‰ generalization : overfit G Varoquaux 16
  35. 35. 2 Machine learning in a nutshell Machine learning is about making predictions from data e.g. learning to distinguish apples from oranges Prediction is very difficult, especially about the future. Niels Bohr Learn as much as possible from the data but not too much x y x y Adapting model complexity to data – regularization G Varoquaux 16
  36. 36. 2 Machine learning without learning the machinery G Varoquaux 17
  37. 37. 2 Machine learning without learning the machinery A library, not a program More expressive and flexible Easy to include in an ecosystem let’s disrupt something new G Varoquaux 17
  38. 38. 2 Machine learning without learning the machinery A library, not a program More expressive and flexible Easy to include in an ecosystem let’s disrupt something new As easy as py from s k l e a r n import svm c l a s s i f i e r = svm.SVC() c l a s s i f i e r . f i t ( X t r a i n , y t r a i n ) Y t e s t = c l a s s i f i e r . p r e d i c t ( X t e s t ) G Varoquaux 17
  39. 39. 2 Show me your data: the samples ˆ features matrix Data input: a 2D numerical array Requires transforming your problem 03078090707907 00790752700578 94071006000797 00970008007000 10000400400090 00050205008000 samples features G Varoquaux 18
  40. 40. 2 Show me your data: the samples ˆ features matrix Data input: a 2D numerical array Requires transforming your problem With text documents: 03078090707907 00790752700578 94071006000797 00970008007000 10000400400090 00050205008000 documents the Python performance profiling module is code can a sklearn.feature extraction.text.TfIdfVectorizer G Varoquaux 18
  41. 41. “Big” data Engineering efficient processing pipelines Many samples or 03078090707907 00790752700578 94071006000797 00970008007000 10000400400090 00050205008000 samples features Many features 03078090707907 00790752700578 94071006000797 00970008007000 10000400400090 00050205008000 samples features 03078090707907 00790752700578 94071006000797 00970008007000 10000400400090 00050205008000 See also: http://www.slideshare.net/GaelVaroquaux/processing- biggish-data-on-commodity-hardware-simple-python-patterns G Varoquaux 19
  42. 42. 2 Many samples: on-line algorithms e s t i m a t o r . p a r t i a l f i t ( X t r a i n , Y t r a i n )0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 G Varoquaux 20
  43. 43. 2 Many samples: on-line algorithms e s t i m a t o r . p a r t i a l f i t ( X t r a i n , Y t r a i n ) Supervised models: predicting sklearn.naive bayes... sklearn.linear model.SGDRegressor sklearn.linear model.SGDClassifier Clustering: grouping samples sklearn.cluster.MiniBatchKMeans sklearn.cluster.Birch Linear decompositions: finding new representations sklearn.decompositions.IncrementalPCA sklearn.decompositions.MiniBatchDictionaryLearning sklearn.decompositions.LatentDirichletAllocation G Varoquaux 20
  44. 44. 2 Many features: on-the-fly data reduction ñ Reduce the data as it is loaded X s m a l l = e s t i m a t o r . t r a n s f o r m ( X big , y) G Varoquaux 21
  45. 45. 2 Many features: on-the-fly data reduction Random projections (will average features) sklearn.random projection random linear combinations of the features Fast clustering of features sklearn.cluster.FeatureAgglomeration on images: super-pixel strategy Hashing when observations have varying size (e.g. words) sklearn.feature extraction.text. HashingVectorizer stateless: can be used in parallel G Varoquaux 21
  46. 46. More gems in scikit-learn SAG: linear model.LogisticRegression(solver=’sag’) Fast linear model on biggish data G Varoquaux 22
  47. 47. More gems in scikit-learn SAG: linear model.LogisticRegression(solver=’sag’) Fast linear model on biggish data PCA == RandomizedPCA: (0.18) Heuristic to switch PCA to random linear algebra Fights global warming Huge speed gains for biggish data G Varoquaux 22
  48. 48. More gems in scikit-learn New cross-validation objects (0.18) from s k l e a r n . c r o s s v a l i d a t i o n import S t r a t i f i e d K F o l d cv = S t r a t i f i e d K F o l d (y , n f o l d s =2) for t r a i n , t e s t in cv : X t r a i n = X[ t r a i n ] y t r a i n = y[ t r a i n ] Data-independent better nested-CV G Varoquaux 22
  49. 49. More gems in scikit-learn New cross-validation objects (0.18) from s k l e a r n . m o d e l s e l e c t i o n import S t r a t i f i e d K F o l d cv = S t r a t i f i e d K F o l d ( n f o l d s =2) for t r a i n , t e s t in cv . s p l i t (X, y): X t r a i n = X[ t r a i n ] y t r a i n = y[ t r a i n ] Data-independent ñ better nested-CV G Varoquaux 22
  50. 50. More gems in scikit-learn Outlier detection and isolation forests (0.18) G Varoquaux 22
  51. 51. 3 Outlook: infrastructure No strings attached G Varoquaux 23
  52. 52. 3 Dataflow is key to scale Array computing CPU 03878794797927 01790752701578 03878794797927 01790752701578 Data parallel 03878794797927 03878794797927 Streaming 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 Parallel computing Data + code transfer Out-of-memory persistence These patterns can yield horrible code G Varoquaux 24
  53. 53. 3 Parallel-computing engine: joblib sklearn.Estimator(n jobs=2) G Varoquaux 25
  54. 54. 3 Parallel-computing engine: joblib sklearn.Estimator(n jobs=2) Under the hood: joblib ąąąąąąąąą from joblib import Parallel, delayed ąąąąąąąąą Parallel(n jobs=2)(delayed(sqrt)(i**2) ... for i in range(8)) [0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0] Threading and multiprocessing mode G Varoquaux 25
  55. 55. 3 Parallel-computing engine: joblib sklearn.Estimator(n jobs=2) Under the hood: joblib ąąąąąąąąą from joblib import Parallel, delayed ąąąąąąąąą Parallel(n jobs=2)(delayed(sqrt)(i**2) ... for i in range(8)) [0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0] Threading and multiprocessing mode G Varoquaux 25
  56. 56. 3 Parallel-computing engine: joblib sklearn.Estimator(n jobs=2) Under the hood: joblib ąąąąąąąąą from joblib import Parallel, delayed ąąąąąąąąą Parallel(n jobs=2)(delayed(sqrt)(i**2) ... for i in range(8)) [0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0] Threading and multiprocessing mode New: distributed computing backends: Yarn, dask.distributed, IPython.parallel import distributed.joblib from joblib import Parallel, parallel backend with parallel backend(’dask.distributed’, scheduler host=’HOST:PORT’): # normal Joblib code G Varoquaux 25
  57. 57. 3 Parallel-computing engine: joblib sklearn.Estimator(n jobs=2) Under the hood: joblib ąąąąąąąąą from joblib import Parallel, delayed ąąąąąąąąą Parallel(n jobs=2)(delayed(sqrt)(i**2) ... for i in range(8)) [0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0] Threading and multiprocessing mode New: distributed computing backends: Yarn, dask.distributed, IPython.parallel import distributed.joblib from joblib import Parallel, parallel backend with parallel backend(’dask.distributed’, scheduler host=’HOST:PORT’): # normal Joblib code Algorithmic plain-Python code with optional bells and whistles G Varoquaux 25
  58. 58. 3 Persistence to disk Persisting any Python object fast, with little overhead Streamed compression I/O from/in open file handles ñ In S3, HDFS G Varoquaux 26
  59. 59. 3 joblib.Memory as a storage pool in dev S3/HDFS/cloud backend: joblib.Memory(’uri’, backend=’s3’) https://github.com/joblib/joblib/pull/397 G Varoquaux 27
  60. 60. 3 joblib.Memory as a storage pool in dev S3/HDFS/cloud backend: joblib.Memory(’uri’, backend=’s3’) https://github.com/joblib/joblib/pull/397 Cache replacement: mem = joblib.Memory(’uri’, bytes limit=’1G’) mem.reduce size() # Remove oldest G Varoquaux 27
  61. 61. 3 joblib.Memory as a storage pool in dev S3/HDFS/cloud backend: joblib.Memory(’uri’, backend=’s3’) https://github.com/joblib/joblib/pull/397 Cache replacement: mem = joblib.Memory(’uri’, bytes limit=’1G’) mem.reduce size() # Remove oldest Out-of-memory computing ąąąąąąąąą result = mem.cache(g).call and shelve(a) ąąąąąąąąą result MemorizedResult(cachedir=”...”, func=”g”, argument hash=”...”) ąąąąąąąąą c = result.get() G Varoquaux 27
  62. 62. 3 joblib as bricks of a compute engine: vision 03878794797927 03878794797927 G Varoquaux 28
  63. 63. 3 joblib as bricks of a compute engine: vision 03878794797927 03878794797927 Parallel on a cloud/cluster Memory + shelving on distributed stores as out-of-core & common computation Simple algorithmic code Using infrastructure without buying into it G Varoquaux 28
  64. 64. Time to wrap up
  65. 65. Time to wrap up Code, code, code
  66. 66. Scipy-lectures: learning numerical Python Many problems are better solved by documentation than new code
  67. 67. Scipy-lectures: learning numerical Python Comprehensive document: numpy, scipy, ... 1. Getting started with Python for science 2. Advanced topics 3. Packages and applications http://scipy-lectures.org
  68. 68. Scipy-lectures: learning numerical Python Code examples
  69. 69. On the code of data science @GaelVaroquaux I believe in code
  70. 70. On the code of data science @GaelVaroquaux I believe in code without compromises Libraries with side-effect free code tight APIs decoupling of analysis and plotting Software engineering version control
  71. 71. On the code of data science @GaelVaroquaux I believe in code without compromises Code empowers enables automation opens new applications Disrupt something new We use scikit-learn for markers of neuropsychiatric diseases
  72. 72. On the code of data science @GaelVaroquaux I believe in code without compromises Code empowers Software purity holds back Interactivity: brittle and costly but necessary for insights Refusing syntax shortcuts ñ verbosity
  73. 73. On the code of data science @GaelVaroquaux I believe in code without compromises Code empowers Software purity holds back The cost is worth the benefit in the long run

×