The document discusses building a cutting-edge data processing environment on a budget. It describes the author's background growing up as a "penniless academic" doing quantum physics research with limited resources, which shaped their vision of computing. The author discusses patterns in data processing workflows and a design philosophy for software that is easy to use, robust, and high quality. The vision is described as machine learning without learning the machinery, with an architecture that separates data from operations while keeping an imperative programming style.
Building a cutting-edge data processing environment on a budgetGael Varoquaux
As a penniless academic I wanted to do "big data" for science. Open source, Python, and simple patterns were the way forward. Staying on top of todays growing datasets is an arm race. Data analytics machinery —clusters, NOSQL, visualization, Hadoop, machine learning, ...— can spread a team's resources thin. Focusing on simple patterns, lightweight technologies, and a good understanding of the applications gets us most of the way for a fraction of the cost.
I will present a personal perspective on ten years of scientific data processing with Python. What are the emerging patterns in data processing? How can modern data-mining ideas be used without a big engineering team? What constraints and design trade-offs govern software projects like scikit-learn, Mayavi, or joblib? How can we make the most out of distributed hardware with simple framework-less code?
Succeeding in academia despite doing good_softwareGael Varoquaux
Hacking academia for fun and profit
Thoughts on succeeding in academia despite doing good software
Keynote I gave at the Scipyconf Argentina 2014 conference
The advancement of science is a noble cause, and academia a fierce battlefield for tenure. Software is seen as a mere technicality, not worth a line on an academic CV. I claim that, on the opposite software, is the new medium of scientific method. I claim that succeeding in academia can be achieved not despite writing good software but via such an accomplishment. The key is to choose the right battles and to win them.
What is the emerging role of software in the scientific workflow? Which are the software challenges that can have impact? How to balance software quality assurance and the quick turn-around random-walk of research? What does "good design" mean for research software? What Python patterns can boost productivity and reuse in exploratory scientific computing?
I will try to answer these questions, based on my personal experience of growing up to become an academic Pythonista.
Slides for my keynote at Scipy 2017
https://youtu.be/eVDDL6tgsv8
Computing has been driving forward a revolution in how science and technology can solve new problems. Python has grown to be a central player in this game, from computational physics to data science. I would like to explore some lessons learned doing science with Python as well as doing Python libraries for science. What are the ingredients that the scientists need? What technical and project-management choices drove the success of projects I've been involved with? How do these demands and offers shape our ecosystem?
In this talk, I'd like to share a few thoughts on how we code for science and innovation, with the modest goal of changing the world.
Personal point of view on scikit-learn: past, present, and future.
This talks gives a bit of history, mentions exciting development, and a personal vision on the future.
Slides from the TensorFlow meetup at eBay NYC 06/07/2016 based on my blog https://medium.com/@st553/using-transfer-learning-to-classify-images-with-tensorflow-b0f3142b9366
A simplified way of approaching machine learning and deep learning from the ground up. The case for deep learning and an attempt to develop intuition for how/why it works. Advantages, state-of-the-art, and trends.
Presented at NYU Center for Genomics for NY Deep Learning Meetup
Building a cutting-edge data processing environment on a budgetGael Varoquaux
As a penniless academic I wanted to do "big data" for science. Open source, Python, and simple patterns were the way forward. Staying on top of todays growing datasets is an arm race. Data analytics machinery —clusters, NOSQL, visualization, Hadoop, machine learning, ...— can spread a team's resources thin. Focusing on simple patterns, lightweight technologies, and a good understanding of the applications gets us most of the way for a fraction of the cost.
I will present a personal perspective on ten years of scientific data processing with Python. What are the emerging patterns in data processing? How can modern data-mining ideas be used without a big engineering team? What constraints and design trade-offs govern software projects like scikit-learn, Mayavi, or joblib? How can we make the most out of distributed hardware with simple framework-less code?
Succeeding in academia despite doing good_softwareGael Varoquaux
Hacking academia for fun and profit
Thoughts on succeeding in academia despite doing good software
Keynote I gave at the Scipyconf Argentina 2014 conference
The advancement of science is a noble cause, and academia a fierce battlefield for tenure. Software is seen as a mere technicality, not worth a line on an academic CV. I claim that, on the opposite software, is the new medium of scientific method. I claim that succeeding in academia can be achieved not despite writing good software but via such an accomplishment. The key is to choose the right battles and to win them.
What is the emerging role of software in the scientific workflow? Which are the software challenges that can have impact? How to balance software quality assurance and the quick turn-around random-walk of research? What does "good design" mean for research software? What Python patterns can boost productivity and reuse in exploratory scientific computing?
I will try to answer these questions, based on my personal experience of growing up to become an academic Pythonista.
Slides for my keynote at Scipy 2017
https://youtu.be/eVDDL6tgsv8
Computing has been driving forward a revolution in how science and technology can solve new problems. Python has grown to be a central player in this game, from computational physics to data science. I would like to explore some lessons learned doing science with Python as well as doing Python libraries for science. What are the ingredients that the scientists need? What technical and project-management choices drove the success of projects I've been involved with? How do these demands and offers shape our ecosystem?
In this talk, I'd like to share a few thoughts on how we code for science and innovation, with the modest goal of changing the world.
Personal point of view on scikit-learn: past, present, and future.
This talks gives a bit of history, mentions exciting development, and a personal vision on the future.
Slides from the TensorFlow meetup at eBay NYC 06/07/2016 based on my blog https://medium.com/@st553/using-transfer-learning-to-classify-images-with-tensorflow-b0f3142b9366
A simplified way of approaching machine learning and deep learning from the ground up. The case for deep learning and an attempt to develop intuition for how/why it works. Advantages, state-of-the-art, and trends.
Presented at NYU Center for Genomics for NY Deep Learning Meetup
Suggestions:
1) For best quality, download the PDF before viewing.
2) Open at least two windows: One for the Youtube video, one for the screencast (link below), and optionally one for the slides themselves.
3) The Youtube video is shown on the first page of the slide deck, for slides, just skip to page 2.
Screencast: http://youtu.be/VoL7JKJmr2I
Video recording: http://youtu.be/CJRvb8zxRdE (Thanks to Al Friedrich!)
In this talk, we take Deep Learning to task with real world data puzzles to solve.
Data:
- Higgs binary classification dataset (10M rows, 29 cols)
- MNIST 10-class dataset
- Weather categorical dataset
- eBay text classification dataset (8500 cols, 500k rows, 467 classes)
- ECG heartbeat anomaly detection
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Squeezing Deep Learning Into Mobile PhonesAnirudh Koul
A practical talk by Anirudh Koul aimed at how to run Deep Neural Networks to run on memory and energy constrained devices like smart phones. Highlights some frameworks and best practices.
Presentation given at the Stockholm R useR Group (SRUG) meetup on Dec 6, 2016. Contains a general overview of deep learning, material on using Tensorflow in R etc.
Using Deep Learning to do Real-Time Scoring in Practical Applications - 2015-...Greg Makowski
This talk covers 4 configurations of deep learning to solve different types of application needs. Also, strategies for speed up and real-time scoring are discussed.
Building Interpretable & Secure AI Systems using PyTorchgeetachauhan
Slides from my talk at Deep Learning World 2020. The talk covered use cases, special challenges and solutions for building Interpretable and Secure AI systems using Pytorch.
- Tools for building Interpretable models
- How to build secure, privacy preserving AI models with Pytorch
- Use cases and insights from the field
Deep Learning for Data Scientists - Data Science ATL Meetup Presentation, 201...Andrew Gardner
Note: these are the slides from a presentation at Lexis Nexis in Alpharetta, GA, on 2014-01-08 as part of the DataScienceATL Meetup. A video of this talk from Dec 2013 is available on vimeo at http://bit.ly/1aJ6xlt
Note: Slideshare mis-converted the images in slides 16-17. Expect a fix in the next couple of days.
---
Deep learning is a hot area of machine learning named one of the "Breakthrough Technologies of 2013" by MIT Technology Review. The basic ideas extend neural network research from past decades and incorporate new discoveries in statistical machine learning and neuroscience. The results are new learning architectures and algorithms that promise disruptive advances in automatic feature engineering, pattern discovery, data modeling and artificial intelligence. Empirical results from real world applications and benchmarking routinely demonstrate state-of-the-art performance across diverse problems including: speech recognition, object detection, image understanding and machine translation. The technology is employed commercially today, notably in many popular Google products such as Street View, Google+ Image Search and Android Voice Recognition.
In this talk, we will present an overview of deep learning for data scientists: what it is, how it works, what it can do, and why it is important. We will review several real world applications and discuss some of the key hurdles to mainstream adoption. We will conclude by discussing our experiences implementing and running deep learning experiments on our own hardware data science appliance.
Jay Yagnik at AI Frontiers : A History Lesson on AIAI Frontiers
We have reached a remarkable point in history with the evolution of AI, from applying this technology to incredible use cases in healthcare, to addressing the world's biggest humanitarian and environmental issues. Our ability to learn task-specific functions for vision, language, sequence and control tasks is getting better at a rapid pace. This talk will survey some of the current advances in AI, compare AI to other fields that have historically developed over time, and calibrate where we are in the relative advancement timeline. We will also speculate about the next inflection points and capabilities that AI can offer down the road, and look at how those might intersect with other emergent fields, e.g. Quantum computing.
MLconf - Distributed Deep Learning for Classification and Regression Problems...Sri Ambati
Video recording (no audio?): http://new.livestream.com/accounts/7874891/events/3565981/videos/68114143 from 32:00 to 54:30
Deep Learning has been dominating recent machine learning competitions with better predictions. Unlike the neural networks of the past, modern Deep Learning methods have cracked the code for training stability and generalization. Deep Learning is not only the leader in image and speech recognition tasks, but is also emerging as the algorithm of choice for highest predictive performance in traditional business analytics. This talk introduces Deep Learning and implementation concepts in the open-source H2O in-memory prediction engine. Designed for the solution of business-critical problems on distributed compute clusters, it offers advanced features such as adaptive learning rate, dropout regularization, parameter tuning and a fully-featured R interface. World record performance on the classic MNIST dataset, best-in-class accuracy for a high-dimensional eBay text classification problem and other relevant datasets showcase the power of this game-changing technology. A whole new ecosystem of Intelligent Applications is emerging with Deep Learning at its core.
Bio:
Prior to joining 0xdata as Physicist & Hacker, Arno was a founding Senior MTS at Skytree where he designed and implemented high-performance machine learning algorithms. He has over a decade of experience in HPC with C++/MPI and had access to the world’s largest supercomputers as a Staff Scientist at SLAC National Accelerator Laboratory where he participated in US DOE scientific computing initiatives. While at SLAC, he authored the first curvilinear finite-element simulation code for space-charge dominated relativistic free electrons and scaled it to thousands of compute nodes. He also led a collaboration with CERN to model the electromagnetic performance of CLIC, a ginormous e+e- collider and potential successor of LHC. Arno has authored dozens of scientific papers and was a sought-after academic conference speaker. He holds a PhD and Masters summa cum laude in Physics from ETH Zurich. Arno was named 2014 Big Data All-Star by Fortune Magazine.
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Alex Tellez's slides on Deep Learning Applications, including using auto-encoders, finding better Bordeaux wine, and fighting crime in Chicago, from the 3/11/15 Meetup at H2O.ai HQ and the 3/12/15 Meetup at Mills College.
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Deep Learning in the Wild with Arno CandelSri Ambati
"Deep Learning in the Wild" Meetup at H2O, Mountain View
Livestream: http://t.co/o7p2hYcWgy (includes part 2 with Alex Tellez)
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Talk given at PYCON Stockholm 2015
Intro to Deep Learning + taking pretrained imagenet network, extracting features, and RBM on top = 97 Accuracy after 1 hour (!) of training (in top 10% of kaggle cat vs dog competition)
Artificial Intelligence Workshop, Collegio universitario Bertoni, Milano, 20 May 2017.
Audience of the workshop: undergraduate students without neural networks background.
Summary:
- Deep Learning Showcase
- What is deep learning and how it works
- How to start with deep learning
- Live demo: image recognition with Nvidia DIGITS
- Playground
Duration: 2 hours.
Data science calls for rapid experimentation and building intuitions from the data. Yet, data science also underpins crucial decisions and operational logic. Writing production-ready and robust statistical analysis without cognitive overhead may seem a conundrum. I will explore simple, and less simple, practices for fast turn around and consolidation of data-science code. I will discuss how these considerations led to the design of scikit-learn, that enables easy machine learning yet is used in production. Finally, I will mention some scikit-learn gems, new or forgotten.
The Art Of Performance Tuning - with presenter notes!Jonathan Ross
A somewhat more verbose version of https://www.slideshare.net/JonathanRoss74/the-art-of-performance-tuning.
Presented at JavaOne 2017 [CON4027], this presentation takes a practical, hands-on look at Java performance tuning. It discusses methodology (spoiler: it’s the scientific method) and how to apply it to Java SE systems (on any budget). Exploring concrete examples with tools such as the Oracle Java Mission Control feature of Oracle Java SE Advanced, VisualVM, YourKit, and JMH, the presentation focuses on ways of measuring performance, how to interpret data, ways of eliminating bottlenecks, and even how to avoid future performance regressions.
Suggestions:
1) For best quality, download the PDF before viewing.
2) Open at least two windows: One for the Youtube video, one for the screencast (link below), and optionally one for the slides themselves.
3) The Youtube video is shown on the first page of the slide deck, for slides, just skip to page 2.
Screencast: http://youtu.be/VoL7JKJmr2I
Video recording: http://youtu.be/CJRvb8zxRdE (Thanks to Al Friedrich!)
In this talk, we take Deep Learning to task with real world data puzzles to solve.
Data:
- Higgs binary classification dataset (10M rows, 29 cols)
- MNIST 10-class dataset
- Weather categorical dataset
- eBay text classification dataset (8500 cols, 500k rows, 467 classes)
- ECG heartbeat anomaly detection
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Squeezing Deep Learning Into Mobile PhonesAnirudh Koul
A practical talk by Anirudh Koul aimed at how to run Deep Neural Networks to run on memory and energy constrained devices like smart phones. Highlights some frameworks and best practices.
Presentation given at the Stockholm R useR Group (SRUG) meetup on Dec 6, 2016. Contains a general overview of deep learning, material on using Tensorflow in R etc.
Using Deep Learning to do Real-Time Scoring in Practical Applications - 2015-...Greg Makowski
This talk covers 4 configurations of deep learning to solve different types of application needs. Also, strategies for speed up and real-time scoring are discussed.
Building Interpretable & Secure AI Systems using PyTorchgeetachauhan
Slides from my talk at Deep Learning World 2020. The talk covered use cases, special challenges and solutions for building Interpretable and Secure AI systems using Pytorch.
- Tools for building Interpretable models
- How to build secure, privacy preserving AI models with Pytorch
- Use cases and insights from the field
Deep Learning for Data Scientists - Data Science ATL Meetup Presentation, 201...Andrew Gardner
Note: these are the slides from a presentation at Lexis Nexis in Alpharetta, GA, on 2014-01-08 as part of the DataScienceATL Meetup. A video of this talk from Dec 2013 is available on vimeo at http://bit.ly/1aJ6xlt
Note: Slideshare mis-converted the images in slides 16-17. Expect a fix in the next couple of days.
---
Deep learning is a hot area of machine learning named one of the "Breakthrough Technologies of 2013" by MIT Technology Review. The basic ideas extend neural network research from past decades and incorporate new discoveries in statistical machine learning and neuroscience. The results are new learning architectures and algorithms that promise disruptive advances in automatic feature engineering, pattern discovery, data modeling and artificial intelligence. Empirical results from real world applications and benchmarking routinely demonstrate state-of-the-art performance across diverse problems including: speech recognition, object detection, image understanding and machine translation. The technology is employed commercially today, notably in many popular Google products such as Street View, Google+ Image Search and Android Voice Recognition.
In this talk, we will present an overview of deep learning for data scientists: what it is, how it works, what it can do, and why it is important. We will review several real world applications and discuss some of the key hurdles to mainstream adoption. We will conclude by discussing our experiences implementing and running deep learning experiments on our own hardware data science appliance.
Jay Yagnik at AI Frontiers : A History Lesson on AIAI Frontiers
We have reached a remarkable point in history with the evolution of AI, from applying this technology to incredible use cases in healthcare, to addressing the world's biggest humanitarian and environmental issues. Our ability to learn task-specific functions for vision, language, sequence and control tasks is getting better at a rapid pace. This talk will survey some of the current advances in AI, compare AI to other fields that have historically developed over time, and calibrate where we are in the relative advancement timeline. We will also speculate about the next inflection points and capabilities that AI can offer down the road, and look at how those might intersect with other emergent fields, e.g. Quantum computing.
MLconf - Distributed Deep Learning for Classification and Regression Problems...Sri Ambati
Video recording (no audio?): http://new.livestream.com/accounts/7874891/events/3565981/videos/68114143 from 32:00 to 54:30
Deep Learning has been dominating recent machine learning competitions with better predictions. Unlike the neural networks of the past, modern Deep Learning methods have cracked the code for training stability and generalization. Deep Learning is not only the leader in image and speech recognition tasks, but is also emerging as the algorithm of choice for highest predictive performance in traditional business analytics. This talk introduces Deep Learning and implementation concepts in the open-source H2O in-memory prediction engine. Designed for the solution of business-critical problems on distributed compute clusters, it offers advanced features such as adaptive learning rate, dropout regularization, parameter tuning and a fully-featured R interface. World record performance on the classic MNIST dataset, best-in-class accuracy for a high-dimensional eBay text classification problem and other relevant datasets showcase the power of this game-changing technology. A whole new ecosystem of Intelligent Applications is emerging with Deep Learning at its core.
Bio:
Prior to joining 0xdata as Physicist & Hacker, Arno was a founding Senior MTS at Skytree where he designed and implemented high-performance machine learning algorithms. He has over a decade of experience in HPC with C++/MPI and had access to the world’s largest supercomputers as a Staff Scientist at SLAC National Accelerator Laboratory where he participated in US DOE scientific computing initiatives. While at SLAC, he authored the first curvilinear finite-element simulation code for space-charge dominated relativistic free electrons and scaled it to thousands of compute nodes. He also led a collaboration with CERN to model the electromagnetic performance of CLIC, a ginormous e+e- collider and potential successor of LHC. Arno has authored dozens of scientific papers and was a sought-after academic conference speaker. He holds a PhD and Masters summa cum laude in Physics from ETH Zurich. Arno was named 2014 Big Data All-Star by Fortune Magazine.
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Alex Tellez's slides on Deep Learning Applications, including using auto-encoders, finding better Bordeaux wine, and fighting crime in Chicago, from the 3/11/15 Meetup at H2O.ai HQ and the 3/12/15 Meetup at Mills College.
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Deep Learning in the Wild with Arno CandelSri Ambati
"Deep Learning in the Wild" Meetup at H2O, Mountain View
Livestream: http://t.co/o7p2hYcWgy (includes part 2 with Alex Tellez)
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Talk given at PYCON Stockholm 2015
Intro to Deep Learning + taking pretrained imagenet network, extracting features, and RBM on top = 97 Accuracy after 1 hour (!) of training (in top 10% of kaggle cat vs dog competition)
Artificial Intelligence Workshop, Collegio universitario Bertoni, Milano, 20 May 2017.
Audience of the workshop: undergraduate students without neural networks background.
Summary:
- Deep Learning Showcase
- What is deep learning and how it works
- How to start with deep learning
- Live demo: image recognition with Nvidia DIGITS
- Playground
Duration: 2 hours.
Data science calls for rapid experimentation and building intuitions from the data. Yet, data science also underpins crucial decisions and operational logic. Writing production-ready and robust statistical analysis without cognitive overhead may seem a conundrum. I will explore simple, and less simple, practices for fast turn around and consolidation of data-science code. I will discuss how these considerations led to the design of scikit-learn, that enables easy machine learning yet is used in production. Finally, I will mention some scikit-learn gems, new or forgotten.
The Art Of Performance Tuning - with presenter notes!Jonathan Ross
A somewhat more verbose version of https://www.slideshare.net/JonathanRoss74/the-art-of-performance-tuning.
Presented at JavaOne 2017 [CON4027], this presentation takes a practical, hands-on look at Java performance tuning. It discusses methodology (spoiler: it’s the scientific method) and how to apply it to Java SE systems (on any budget). Exploring concrete examples with tools such as the Oracle Java Mission Control feature of Oracle Java SE Advanced, VisualVM, YourKit, and JMH, the presentation focuses on ways of measuring performance, how to interpret data, ways of eliminating bottlenecks, and even how to avoid future performance regressions.
DARMDN: Deep autoregressive mixture density nets for dynamical system mode...Balázs Kégl
Unlike computers, physical engineering systems (such as data center cooling or wireless network control) do not get faster with time. This is arguably one of the main reasons why recent beautiful advances in deep reinforcement learning (RL) stay mostly in the realm of simulated worlds and do not immediately translate to practical success in the real world. In order to make the best use of the small data sets these systems generate, we develop data-driven neural simulators to model the system and apply model-based control to optimize them. In this talk I will present the first step of this research agenda, a new versatile system modelling tool called deep autoregressive mixture density net (DARMDN – pronounced darm-dee-en). We argue that the performance of model-based reinforcement learning is partly limited by the approximation capacity of the currently used conditional density models and show how DARMDN alleviates these limitations. The model, combined with a random shooting controller, establishes a new state of the art on the popular Acrobot benchmark. Our most interesting and counter-intuitive finding is that the “sincos” Acrobot system which requires no multimodal posterior predictives, can be solved with a deterministic model, but only if it is trained as a probabilistic model. A deterministic model that is trained to minimize MSE leads to prediction error accumulation.
Better neuroimaging data processing: driven by evidence, open communities, an...Gael Varoquaux
My current thoughts about methods validity and design in brain imaging.
Data processing is a significant part of a neuroimaging study. The choice of corresponding methods and tools is crucial. I will give an opinionated view how on a path to building better data processing for neuroimaging. I will take examples on endeavors that I contributed to: defining standards for functional-connectivity analysis, the nilearn neuroimaging tool, the scikit-learn machine-learning toolbox -an industry standard with a million regular users. I will cover not only the technical process -statistics, signal processing, software engineering- but also the epistemology of methods development. Methods govern our results, they are more than a technical detail.
Balancing Infrastructure with Optimization and Problem FormulationAlex D. Gaudio
- How do we currently think about Data Science?
- Why is infrastructure important to our field?
- Two tools we've built on Sailthru's Data Science team to deal with these problems are "Stolos" and "Relay.Mesos".
Computational practices for reproducible scienceGael Varoquaux
Reconciling bleeding-edge scientific results and reproducible research may seem a conundrum in our fast-paced high-pressure academic world. I discuss the practices that I found useful in computational work. At a high level, it is important to navigate the space between rapid experimentation and industrial-grade software development. I advocate adopting more and more software-engineering best practices as a project matures. I will also discuss how to turn the computational work into libraries, and to ensure the quality of the resulting libraries. And I conclude on how those libraries need to fit in the larger picture of the exercise of research to give better science.
A late upload. This slide was presented on Aug 31, 2019, when I delivered a talk for AIoT seminar in University of Lambung Mangkurat, Banjarbaru. It's part of Republic of IoT 2019 event.
Artificial Intelligence in practice - Gerbert Kaandorp - Codemotion Amsterdam...Codemotion
In this talk Gerbert will give an overview of Artificial Intelligence, outline the current state of the art in research and explain what it takes to actually do an AI project. Using practical cases and tools he will give you insight in the phases of an AI project and explain some of the problems you might encounter along the way and how you might be able to solve them.
Building frameworks: from concept to completionRuben Goncalves
What are considerations when building a framework/library? How does that apply to OutSystems components? In this session, we’ll do a deep dive into the importance of addressing certain concepts like code granularity, and architecture, in order to create useful, future-proof and coherent frameworks that deliver the best possible developer experience.
A brief overview of Real-Time Analytics at Netflix and the challenges we've faced in designing and deploying production ready products based on real-time data.
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...PyData
At this workshop, you will build your own messaging insights system - data ingestion from a live data source (Reddit), queueing, deploying a machine learning model, and serving messages with insights to your mobile phone!
Unit testing data with marbles - Jane Stewart Adams, Leif WalshPyData
In the same way that we need to make assertions about how code functions, we need to make assertions about data, and unit testing is a promising framework. In this talk, we'll explore what is unique about unit testing data, and see how Two Sigma's open source library Marbles addresses these unique challenges in several real-world scenarios.
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiPyData
TileDB is an open-source storage manager for multi-dimensional sparse and dense array data. It has a novel architecture that addresses some of the pain points in storing array data on “big-data” and “cloud” storage architectures. This talk will highlight TileDB’s design and its ability to integrate with analysis environments relevant to the PyData community such as Python, R, Julia, etc.
Using Embeddings to Understand the Variance and Evolution of Data Science... ...PyData
In this talk I will discuss exponential family embeddings, which are methods that extend the idea behind word embeddings to other data types. I will describe how we used dynamic embeddings to understand how data science skill-sets have transformed over the last 3 years using our large corpus of jobs. The key takeaway is that these models can enrich analysis of specialized datasets.
Deploying Data Science for Distribution of The New York Times - Anne BauerPyData
How many newspapers should be distributed to each store for sale every day? The data science group at The New York Times addresses this optimization problem using custom time series modeling and analytical solutions, while also incorporating qualitative business concerns. I'll describe our modeling and data engineering approaches, written in Python and hosted on Google Cloud Platform.
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaPyData
However, the the graph theory jargon can make graph analytics seem more intimidating for self-study than is necessary. In this talk, the audience will be exposed to some of the basic concepts of graph theory (no prerequisite math knowledge needed!) and a few of the Python tools available for graph analysis.
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...PyData
To productionize data science work (and have it taken seriously by software engineers, CTOs, clients, or the open source community), you need to write tests! Except… how can you test code that performs nondeterministic tasks like natural language parsing and modeling? This talk presents an approach to testing probabilistic functions in code, illustrated with concrete examples written for Pytest.
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroPyData
Those of us who use TensorFlow often focus on building the model that's most predictive, not the one that's most deployable. So how to put that hard work to work? In this talk, we'll walk through a strategy for taking your machine learning models from Jupyter Notebook into production and beyond.
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...PyData
In September 2017, dockless bikeshare joined the transportation options in the District of Columbia. In March 2018, scooter share followed. During the pilot of these technologies, Python has helped District Department of Transportation answer some critical questions. This talk will discuss how Python was used to answer research questions and how it supported the evaluation of this demonstration.
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottPyData
There are many stories of developers creating databases that don't operate at scale. The application is good, but the database won't work the realistic volumes of data. It's like a horror movie where they never looked behind the door, ran into the dark forest and night, and discovered the database was the monster killing their application. How can we leverage Python to avoid scaling problems?
Machine learning often requires us to think spatially and make choices about what it means for two instances to be close or far apart. So which is best - Euclidean? Manhattan? Cosine? It all depends! In this talk, we'll explore open source tools and visual diagnostic strategies for picking good distance metrics when doing machine learning on text.
End-to-End Machine learning pipelines for Python driven organizations - Nick ...PyData
The recent advances in machine learning and artificial intelligence are amazing! Yet, in order to have real value within a company, data scientists must be able to get their models off of their laptops and deployed within a company’s data pipelines and infrastructure. In this session, I'll demonstrate how one-off experiments can be transformed into scalable ML pipelines with minimal effort.
We will be using Beautiful Soup to Webscrape the IMDB website and create a function that will allow you to create a dictionary object on specific metadata of the IMDB profile for any IMDB ID you pass through as an argument.
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...PyData
This talk describes an experimental approach to time series modeling using 1D convolution filter layers in a neural network architecture. This approach was developed at System1 for forecasting marketplace value of online advertising categories.
Extending Pandas with Custom Types - Will AydPyData
Pandas v.0.23 brought to life a new extension interface through which you can extend NumPy's type system. This talk will explain what that means in more detail and provide practical examples of how the new interface can be leveraged to drastically improve your reporting.
Machine learning models are increasingly used to make decisions that affect people’s lives. With this power comes a responsibility to ensure that model predictions are fair. In this talk I’ll introduce several common model fairness metrics, discuss their tradeoffs, and finally demonstrate their use with a case study analyzing anonymized data from one of Civis Analytics’s client engagements.
What's the Science in Data Science? - Skipper SeaboldPyData
The gold standard for validating any scientific assumption is to run an experiment. Data science isn’t any different. Unfortunately, it’s not always possible to design the perfect experiment. In this talk, we’ll take a realistic look at measurement using tools from the social sciences to conduct quasi-experiments with observational data.
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...PyData
Forecasting time-series data has applications in many fields, including finance, health, etc. There are potential pitfalls when applying classic statistical and machine learning methods to time-series problems. This talk will give folks the basic toolbox to analyze time-series data and perform forecasting using statistical and machine learning models, as well as interpret and convey the outputs.
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardPyData
A historical text may now be unreadable, because its language is unknown, or its script forgotten (or both), or because it was deliberately enciphered. Deciphering needs two steps: Identify the language, then map the unknown script to a familiar one. I’ll present an algorithm to solve a cartoon version of this problem, where the language is known, and the cipher is alphabet rearrangement.
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...PyData
Artificial intelligence is emerging as a new paradigm in materials science. This talk describes how physical intuition and (insightful) machine learning can solve the complicated task of structure recognition in materials at the nanoscale.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Generating a custom Ruby SDK for your web service or Rails API using Smithyg2nightmarescribd
Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux
1. Building a cutting-edge data processing
environment on a budget
Ga¨el Varoquaux
This talk is not about
rocket science!
2. Building a cutting-edge data processing
environment on a budget
Ga¨el Varoquaux
Disclaimer: this talk is as much about people
and projects as it is about code and algorithms.
3. Growing up as a penniless academic
I did a PhD in
quantum physics
4. Growing up as a penniless academic
I did a PhD in
quantum physics
Vacuum (leaks)
Electronics (shorts)
Lasers (mis-alignment)
Best training ever
for agile project
management
5. Growing up as a penniless academic
I did a PhD in
quantum physics
Vacuum (leaks)
Electronics (shorts)
Lasers (mis-alignment)
Computers were only one
of the many moving parts
Matlab
Instrument control
6. Growing up as a penniless academic
I did a PhD in
quantum physics
Vacuum (leaks)
Electronics (shorts)
Lasers (mis-alignment)
Computers were only one
of the many moving parts
Matlab
Instrument controlShaped my vision
of computing as a
means to an end
7. Growing up as a penniless academic
2011
Tenured researcher
in computer science
8. Growing up as a penniless academic
2011
Tenured researcher
in computer science
Today
Growing team with
data science
rock stars
9. 1 Using machine learning to
understand brain function
Link neural activity to thoughts and cognition
G Varoquaux 6
12. 1 Encoding models of stimuli
Predicting neural response
ñ a window into brain representations of stimuli
“feature engineering” a description of the world
G Varoquaux 9
14. 1 Data processing feats
Visual image reconstruction from human brain activity
[Miyawaki, et al. (2008)]
“brain reading”
G Varoquaux 11
15. 1 Data processing feats
Visual image reconstruction from human brain activity
[Miyawaki, et al. (2008)]
“if it’s not open and verifiable by others, it’s not
science, or engineering...” Stodden, 2010
G Varoquaux 11
16. 1 Data processing feats
Visual image reconstruction from human brain activity
[Miyawaki, et al. (2008)]
Make it work, make it right, make it boring
17. 1 Data processing feats
Visual image reconstruction from human brain activity
[Miyawaki, et al. (2008)]
Make it work, make it right, make it boring
http://nilearn.github.io/auto examples/
plot miyawaki reconstruction.html
Code, data, ... just worksTM
http://nilearn.github.io
ni
G Varoquaux 11
18. 1 Data processing feats
Visual image reconstruction from human brain activity
[Miyawaki, et al. (2008)]
Make it work, make it right, make it boring
http://nilearn.github.io/auto examples/
plot miyawaki reconstruction.html
Code, data, ... just worksTM
http://nilearn.github.io
ni
Software development challenge
G Varoquaux 11
19. 1 Data accumulation
When data processing is routine... “big data”
for rich models of
brain function
Accumulation of scientific knowledge
and learning formal representations
G Varoquaux 12
20. 1 Data accumulation
When data processing is routine... “big data”
for rich models of
brain function
Accumulation of scientific knowledge
and learning formal representations
“A theory is a good theory if it satisfies two requirements:
It must accurately describe a large class of observa-
tions on the basis of a model that contains only a few
arbitrary elements, and it must make definite predic-
tions about the results of future observations.”
Stephen Hawking, A Brief History of Time.
G Varoquaux 12
21. 1 Petty day-to-day technicalities
Buggy code
Slow code
Lead data scientist leaves
New intern to train
I don’t understand the
code I have written a year ago
G Varoquaux 13
22. 1 Petty day-to-day technicalities
Buggy code
Slow code
Lead data scientist leaves
New intern to train
I don’t understand the
code I have written a year ago
A lab is no different from a startup
Difficulties
Recruitment
Limited resources
(people & hardware)
Risks
Bus factor
Technical dept
G Varoquaux 13
23. 1 Petty day-to-day technicalities
Buggy code
Slow code
Lead data scientist leaves
New intern to train
I don’t understand the
code I have written a year ago
A lab is no different from a startup
Difficulties
Recruitment
Limited resources
(people & hardware)
Risks
Bus factor
Technical dept
Our mission is to revolutionize brain data processing
on a tight budget
G Varoquaux 13
25. 2 The data processing workflow agile
Interaction...
Ñ script...
Ñ module...
ý interaction again...
Consolidation,
progressively
Low tech and short
turn-around times
G Varoquaux 15
26. 2 From statistics to statistical learning
Paradigm shift as the
dimensionality of data
grows
# features,
not only # samples
From parameter
inference to prediction
Statistical learning is
spreading everywhere
x
y
G Varoquaux 16
27. 3 Let’s just make software
to solve all these problems.
c Theodore W. Gray
G Varoquaux 17
28. 3 Design philosophy
1. Don’t solve hard problems
The original problem can be bent.
2. Easy setup, works out of the box
Installing software sucks.
Convention over configuration.
3. Fail gracefully
Robust to errors. Easy to debug.
4. Quality, quality, quality
What’s not excellent won’t be used.
G Varoquaux 18
29. 3 Design philosophy
1. Don’t solve hard problems
The original problem can be bent.
2. Easy setup, works out of the box
Installing software sucks.
Convention over configuration.
3. Fail gracefully
Robust to errors. Easy to debug.
4. Quality, quality, quality
What’s not excellent won’t be used.
Not “one software to rule them all”
Break down projects by expertise
G Varoquaux 18
31. Vision
Machine learning without learning the machinery
Black box that can be opened
Right trade-off between ”just works” and versatility
(think Apple vs Linux)
G Varoquaux 19
32. Vision
Machine learning without learning the machinery
Black box that can be opened
Right trade-off between ”just works” and versatility
(think Apple vs Linux)
We’re not going to solve all the problems for you
I don’t solve hard problems
Feature-engineering, domain-specific cases...
Python is a programming language. Use it.
Cover all the 80% usecases in one package
G Varoquaux 19
33. 3 Performance in high-level programming
High-level programming
is what keeps us
alive and kicking
G Varoquaux 20
34. 3 Performance in high-level programming
The secret sauce
Optimize algorithmes, not for loops
Know perfectly Numpy and Scipy
- Significant data should be arrays/memoryviews
- Avoid memory copies, rely on blas/lapack
line-profiler/memory-profiler
scipy-lectures.github.io
Cython not C/C++
G Varoquaux 20
35. 3 Performance in high-level programming
The secret sauce
Optimize algorithmes, not for loops
Know perfectly Numpy and Scipy
- Significant data should be arrays/memoryviews
- Avoid memory copies, rely on blas/lapack
line-profiler/memory-profiler
scipy-lectures.github.io
Cython not C/C++
Hierarchical clustering PR #2199
1. Take the 2 closest clusters
2. Merge them
3. Update the distance matrix
...
Faster with constraints: sparse distance matrix
- Keep a heap queue of distances: cheap minimum
- Need sparse growable structure for neighborhoods
skip-list in Cython!
Oplog nq insert, remove, access
bind C++ map[int, float] with Cython
Fast traversal, possibly in Cython, for step 3.
G Varoquaux 20
36. 3 Performance in high-level programming
The secret sauce
Optimize algorithmes, not for loops
Know perfectly Numpy and Scipy
- Significant data should be arrays/memoryviews
- Avoid memory copies, rely on blas/lapack
line-profiler/memory-profiler
scipy-lectures.github.io
Cython not C/C++
Hierarchical clustering PR #2199
1. Take the 2 closest clusters
2. Merge them
3. Update the distance matrix
...
Faster with constraints: sparse distance matrix
- Keep a heap queue of distances: cheap minimum
- Need sparse growable structure for neighborhoods
skip-list in Cython!
Oplog nq insert, remove, access
bind C++ map[int, float] with Cython
Fast traversal, possibly in Cython, for step 3.
G Varoquaux 20
38. 3 Architecture of a data-manipulation toolkit
Separate data from operations,
but keep an imperative-like language
Object API exposes a data-processing language
fit, predict, transform, score, partial fit
Instantiated without data but with all the parameters
Objects pipeline, merging, etc...
G Varoquaux 21
39. 3 Architecture of a data-manipulation toolkit
Separate data from operations,
but keep an imperative-like language
Object API exposes a data-processing language
fit, predict, transform, score, partial fit
Instantiated without data but with all the parameters
Objects pipeline, merging, etc...
configuration/run pattern traits, pyre
curry in functional programming functools.partial
Ideas from MVC pattern
G Varoquaux 21
41. 4 Big data on small hardware
Biggish
smallish
“Big data”:
Petabytes...
Distributed storage
Computing cluster
Mere mortals:
Gigabytes...
Python programming
Off-the-self computers
G Varoquaux 22
42. 4 On-line algorithms
Process the data one sample at a time
Compute the mean of a gazillion
numbers
Hard?
G Varoquaux 23
43. 4 On-line algorithms
Process the data one sample at a time
Compute the mean of a gazillion
numbers
Hard?
No: just do a running mean
G Varoquaux 23
44. 4 On-line algorithms
Converges to expectations
Mini-batch = bunch observations for vectorization
Example: K-Means clustering
X = np.random.normal(size=(10 000, 200))
scipy.cluster.vq.
kmeans(X, 10,
iter=2)
11.33 s
sklearn.cluster.
MiniBatchKMeans(n clusters=10,
n init=2).fit(X)
0.62 s
G Varoquaux 23
45. 4 On-the-fly data reduction
Big data is often I/O bound
Layer memory access
CPU caches
RAM
Local disks
Distant storage
Less data also means less work
G Varoquaux 24
46. 4 On-the-fly data reduction
Dropping data
1 loop: take a random fraction of the data
2 run algorithm on that fraction
3 aggregate results across sub-samplings
Looks like bagging: bootstrap aggregation
Exploits redundancy across observations
Run the loop in parallel
G Varoquaux 24
47. 4 On-the-fly data reduction
Random projections (will average features)
sklearn.random projection
random linear combinations of the features
Fast clustering of features
sklearn.cluster.WardAgglomeration
on images: super-pixel strategy
Hashing when observations have varying size
(e.g. words)
sklearn.feature extraction.text.
HashingVectorizer
stateless: can be used in parallel
G Varoquaux 24
48. 4 On-the-fly data reduction
Example: randomized SVD Random projection
sklearn.utils.extmath.randomized svd
X = np.random.normal(size=(50000, 200))
%timeit lapack = linalg.svd(X, full matrices=False)
1 loops, best of 3: 6.09 s per loop
%timeit arpack=splinalg.svds(X, 10)
1 loops, best of 3: 2.49 s per loop
%timeit randomized = randomized svd(X, 10)
1 loops, best of 3: 303 ms per loop
linalg.norm(lapack[0][:, :10] - arpack[0]) / 2000
0.0022360679774997738
linalg.norm(lapack[0][:, :10] - randomized[0]) / 2000
0.0022121161221386925
G Varoquaux 24
49. 4 Biggish iron
Our new box: 15 ke
48 cores
384G RAM
70T storage
(SSD cache on RAID controller)
Gets our work done faster than our 800 CPU cluster
It’s the access patterns!
“Nobody ever got fired for using Hadoop on a cluster”
A. Rowstron et al., HotCDP ’12
G Varoquaux 25
51. 5 Parallel processing big picture
Focus on embarassingly parallel for loops
Life is too short to worry about deadlocks
Workers compete for data access
Memory bus is a bottleneck
The right grain of parallelism
Too fine ñ overhead
Too coarse ñ memory shortage
Scale by the relevant cache pool
G Varoquaux 27
52. 5 Parallel processing joblib
Focus on embarassingly parallel for loops
Life is too short to worry about deadlocks
>>> from joblib import Parallel, delayed
>>> Parallel(n jobs=2)(delayed(sqrt)(i**2)
... for i in range(8))
[0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0]
G Varoquaux 27
53. 5 Parallel processing joblib
IPython, multiprocessing, celery, MPI?
joblib is higher-level
No dependencies, works everywhere
Better traceback reporting
Memmaping arrays to share memory (O. Grisel)
On-the-fly dispatch of jobs – memory-friendly
Threads or processes backend
G Varoquaux 27
54. 5 Parallel processing joblib
IPython, multiprocessing, celery, MPI?
joblib is higher-level
No dependencies, works everywhere
Better traceback reporting
Memmaping arrays to share memory (O. Grisel)
On-the-fly dispatch of jobs – memory-friendly
Threads or processes backend
G Varoquaux 27
55. 5 Parallel processing Queues
Queues: high-performance, concurrent-friendly
Difficulty: callback on result arrival
ñ multiple threads in caller ` risk of deadlocks
Dispatch queue should fill up “slowly”
ñ pre dispatch in joblib
ñ Back and forth communication
Door open to race conditions
G Varoquaux 28
56. 5 Parallel processing: what happens where
joblib design: Caller, dispatch queue, and collect
queue in same process
Benefit: robustness
Grand-central dispatch design: dispatch queue has
a process of its own
Benefit: resource managment in nested for loops
G Varoquaux 29
57. 5 Caching
For reproducibility:
avoid manually chained scripts (make-like usage)
For performance:
avoiding re-computing is the crux of optimization
G Varoquaux 30
58. 5 Caching The joblib approach
For reproducibility:
avoid manually chained scripts (make-like usage)
For performance:
avoiding re-computing is the crux of optimization
Memoize pattern
mem = joblib.Memory(cachedir=’.’)
g = mem.cache(f)
b = g(a) # computes a using f
c = g(a) # retrieves results from store
G Varoquaux 30
59. 5 Caching The joblib approach
For reproducibility:
avoid manually chained scripts (make-like usage)
For performance:
avoiding re-computing is the crux of optimization
Memoize pattern
mem = joblib.Memory(cachedir=’.’)
g = mem.cache(f)
b = g(a) # computes a using f
c = g(a) # retrieves results from store
Challenges in the context of big data
a & b are big
Design goals
a & b arbitrary Python objects
No dependencies
Drop-in, framework-less code
G Varoquaux 30
60. 5 Caching The joblib approach
For reproducibility:
avoid manually chained scripts (make-like usage)
For performance:
avoiding re-computing is the crux of optimization
Memoize pattern
mem = joblib.Memory(cachedir=’.’)
g = mem.cache(f)
b = g(a) # computes a using f
c = g(a) # retrieves results from store
Lego bricks for out-of-core algorithms coming soon
ąąąąąąąąą result = g.call and shelve(a)
ąąąąąąąąą result
MemorizedResult(cachedir=”...”, func=”g...”, argument hash=”...”)
ąąąąąąąąą c = result.get()
G Varoquaux 30
61. 5 Efficient input argument hashing – joblib.hash
Compute md5‹
of input arguments
Trade-off between features and cost
Black boxy
Robust and completely generic
G Varoquaux 31
62. 5 Efficient input argument hashing – joblib.hash
Compute md5‹
of input arguments
Implementation
1. Create an md5 hash object
2. Subclass the standard-library pickler
= state machine that walks the object graph
3. Walk the object graph:
- ndarrays: pass data pointer to md5 algorithm
(“update” method)
- the rest: pickle
4. Update the md5 with the pickle
‹ md5 is in the Python standard library
G Varoquaux 31
63. 5 Fast, disk-based, concurrent, store – joblib.dump
Persisting arbritrary objects
Once again sub-class the pickler
Use .npy for large numpy arrays (np.save),
pickle for the rest
ñ Multiple files
Store concurrency issues
Strategy: atomic operations ` try/except
Renaming a directory is atomic
Directory layout consistent with remove operations
Good performance, usable on shared disks (cluster)
G Varoquaux 32
64. 5 Making I/O fast
Fast compression
CPU may be faster than disk access
in particular in parallel
Standard library: zlib.compress with buffers
(bypass gzip module to work online + in-memory)
G Varoquaux 33
65. 5 Making I/O fast
Fast compression
CPU may be faster than disk access
in particular in parallel
Standard library: zlib.compress with buffers
(bypass gzip module to work online + in-memory)
Avoiding copies
zlib.compress: C-contiguous buffers
Copyless storage of raw buffer
+ meta-information (strides, class...)
G Varoquaux 33
66. 5 Making I/O fast
Fast compression
CPU may be faster than disk access
in particular in parallel
Standard library: zlib.compress with buffers
(bypass gzip module to work online + in-memory)
Avoiding copies
zlib.compress: C-contiguous buffers
Copyless storage of raw buffer
+ meta-information (strides, class...)
Single file dump coming soon
File opening is slow on cluster
Challenge: streaming the above for memory usage
G Varoquaux 33
67. 5 Making I/O fast
Fast compression
CPU may be faster than disk access
in particular in parallel
Standard library: zlib.compress with buffers
(bypass gzip module to work online + in-memory)
Avoiding copies
zlib.compress: C-contiguous buffers
Copyless storage of raw buffer
+ meta-information (strides, class...)
Single file dump coming soon
File opening is slow on cluster
Challenge: streaming the above for memory usage
What matters on large systems
Numbers of bytes stored
brings network/SATA bus down
Memory usage
brings compute nodes down
Number of atomic file access
brings shared storage down
G Varoquaux 33
68. 5 Benchmarking to np.save and pytables
yaxisscale:1isnp.save
NeuroImaging data (MNI atlas)G Varoquaux 34
69. 6 The bigger picture: building
an ecosystem
Helping your future self
G Varoquaux 35
70. 6 Community-based development in scikit-learn
Huge feature set:
benefits of a large team
Project growth:
More than 200 contributors
„ 12 core contributors
1 full-time INRIA programmer
from the start
Estimated cost of development: $ 6 millions
COCOMO model,
http://www.ohloh.net/p/scikit-learn
G Varoquaux 36
71. 6 The economics of open source
Code maintenance too expensive to be alone
scikit-learn „ 300 email/month nipy „ 45 email/month
joblib „ 45 email/month mayavi „ 30 email/month
“Hey Gael, I take it you’re too
busy. That’s okay, I spent a day
trying to install XXX and I think
I’ll succeed myself. Next time
though please don’t ignore my
emails, I really don’t like it. You
can say, ‘sorry, I have no time to
help you.’ Just don’t ignore.”
G Varoquaux 37
72. 6 The economics of open source
Code maintenance too expensive to be alone
scikit-learn „ 300 email/month nipy „ 45 email/month
joblib „ 45 email/month mayavi „ 30 email/month
Your “benefits” come from a fraction of the code
Data loading? Maybe?
Standard algorithms? Nah
Share the common code...
...to avoid dying under code
Code becomes less precious with time
And somebody might contribute features
G Varoquaux 37
73. 6 Many eyes makes code fast
Bench WiseRF anybody?
L. Buitinck, O. Grisel, A. Joly, G. Louppe, J. Nothman, P. Prettenhofer
G Varoquaux 38
74. 6 6 steps to a community-driven project
1 Focus on quality
2 Build great docs and examples
3 Use github
4 Limit the technicality of your codebase
5 Releasing and packaging matter
6 Focus on your contributors,
give them credit, decision power
http://www.slideshare.net/GaelVaroquaux/
scikit-learn-dveloppement-communautaire
G Varoquaux 39
75. 6 Core project contributors
Normalized number of commits
since 2009-06
Numberofcommits
Individual committer
Credit: Fernando Perez, Gist 5843625
G Varoquaux 40
76. 6 The tragedy of the commons
Individuals, acting independently and rationally accord-
ing to each one’s self-interest, behave contrary to the
whole group’s long-term best interests by depleting
some common resource.
Wikipedia
Make it work, make it right, make it boring
Core projects (boring) taken for granted
ñ Hard to fund, less excitement
They need citation, in papers & on corporate web pages
G Varoquaux 41
77. @GaelVaroquaux
Solving problems that matter
The 80/20 rule
80% of the usecases can be solved
with 20% of the lines of code
scikit-learn, joblib, nilearn, ... I hope
79. @GaelVaroquaux
Cutting-edge ... environment ... on a budget
1 Set the goals right
2 Use the simplest technological solutions possible
Be very technically sophisticated
Don’t use that sophistication
80. @GaelVaroquaux
Cutting-edge ... environment ... on a budget
1 Set the goals right
2 Use the simplest technological solutions possible
3 Don’t forget the human factors
With your users (documentation)
With your contributors
81. @GaelVaroquaux
Cutting-edge ... environment ... on a budget
1 Set the goals right
2 Use the simplest technological solutions possible
3 Don’t forget the human factors
A perfect
design?