As a penniless academic I wanted to do "big data" for science. Open source, Python, and simple patterns were the way forward. Staying on top of todays growing datasets is an arm race. Data analytics machinery —clusters, NOSQL, visualization, Hadoop, machine learning, ...— can spread a team's resources thin. Focusing on simple patterns, lightweight technologies, and a good understanding of the applications gets us most of the way for a fraction of the cost.
I will present a personal perspective on ten years of scientific data processing with Python. What are the emerging patterns in data processing? How can modern data-mining ideas be used without a big engineering team? What constraints and design trade-offs govern software projects like scikit-learn, Mayavi, or joblib? How can we make the most out of distributed hardware with simple framework-less code?
Processing biggish data on commodity hardware: simple Python patternsGael Varoquaux
Scipy 2013 talk on simple Python patterns to process efficiently large datasets using Python.
The talk focuses on the patterns and the concepts rather than the implementations. The implementations can be found by looking at the joblib and scikit-learn codebase
Succeeding in academia despite doing good_softwareGael Varoquaux
Hacking academia for fun and profit
Thoughts on succeeding in academia despite doing good software
Keynote I gave at the Scipyconf Argentina 2014 conference
The advancement of science is a noble cause, and academia a fierce battlefield for tenure. Software is seen as a mere technicality, not worth a line on an academic CV. I claim that, on the opposite software, is the new medium of scientific method. I claim that succeeding in academia can be achieved not despite writing good software but via such an accomplishment. The key is to choose the right battles and to win them.
What is the emerging role of software in the scientific workflow? Which are the software challenges that can have impact? How to balance software quality assurance and the quick turn-around random-walk of research? What does "good design" mean for research software? What Python patterns can boost productivity and reuse in exploratory scientific computing?
I will try to answer these questions, based on my personal experience of growing up to become an academic Pythonista.
Slides for my keynote at Scipy 2017
https://youtu.be/eVDDL6tgsv8
Computing has been driving forward a revolution in how science and technology can solve new problems. Python has grown to be a central player in this game, from computational physics to data science. I would like to explore some lessons learned doing science with Python as well as doing Python libraries for science. What are the ingredients that the scientists need? What technical and project-management choices drove the success of projects I've been involved with? How do these demands and offers shape our ecosystem?
In this talk, I'd like to share a few thoughts on how we code for science and innovation, with the modest goal of changing the world.
Data science calls for rapid experimentation and building intuitions from the data. Yet, data science also underpins crucial decisions and operational logic. Writing production-ready and robust statistical analysis without cognitive overhead may seem a conundrum. I will explore simple, and less simple, practices for fast turn around and consolidation of data-science code. I will discuss how these considerations led to the design of scikit-learn, that enables easy machine learning yet is used in production. Finally, I will mention some scikit-learn gems, new or forgotten.
Scikit-learn and nilearn: Democratisation of machine learning for brain imagingGael Varoquaux
This talk describe our efforts to bring easily usable machine learning to brain mapping. It covers both questions that machine learning can answer as well as two softwares developed to facilitate machine learning and it's application to neuroimaging.
Computational practices for reproducible scienceGael Varoquaux
Reconciling bleeding-edge scientific results and reproducible research may seem a conundrum in our fast-paced high-pressure academic world. I discuss the practices that I found useful in computational work. At a high level, it is important to navigate the space between rapid experimentation and industrial-grade software development. I advocate adopting more and more software-engineering best practices as a project matures. I will also discuss how to turn the computational work into libraries, and to ensure the quality of the resulting libraries. And I conclude on how those libraries need to fit in the larger picture of the exercise of research to give better science.
Processing biggish data on commodity hardware: simple Python patternsGael Varoquaux
Scipy 2013 talk on simple Python patterns to process efficiently large datasets using Python.
The talk focuses on the patterns and the concepts rather than the implementations. The implementations can be found by looking at the joblib and scikit-learn codebase
Succeeding in academia despite doing good_softwareGael Varoquaux
Hacking academia for fun and profit
Thoughts on succeeding in academia despite doing good software
Keynote I gave at the Scipyconf Argentina 2014 conference
The advancement of science is a noble cause, and academia a fierce battlefield for tenure. Software is seen as a mere technicality, not worth a line on an academic CV. I claim that, on the opposite software, is the new medium of scientific method. I claim that succeeding in academia can be achieved not despite writing good software but via such an accomplishment. The key is to choose the right battles and to win them.
What is the emerging role of software in the scientific workflow? Which are the software challenges that can have impact? How to balance software quality assurance and the quick turn-around random-walk of research? What does "good design" mean for research software? What Python patterns can boost productivity and reuse in exploratory scientific computing?
I will try to answer these questions, based on my personal experience of growing up to become an academic Pythonista.
Slides for my keynote at Scipy 2017
https://youtu.be/eVDDL6tgsv8
Computing has been driving forward a revolution in how science and technology can solve new problems. Python has grown to be a central player in this game, from computational physics to data science. I would like to explore some lessons learned doing science with Python as well as doing Python libraries for science. What are the ingredients that the scientists need? What technical and project-management choices drove the success of projects I've been involved with? How do these demands and offers shape our ecosystem?
In this talk, I'd like to share a few thoughts on how we code for science and innovation, with the modest goal of changing the world.
Data science calls for rapid experimentation and building intuitions from the data. Yet, data science also underpins crucial decisions and operational logic. Writing production-ready and robust statistical analysis without cognitive overhead may seem a conundrum. I will explore simple, and less simple, practices for fast turn around and consolidation of data-science code. I will discuss how these considerations led to the design of scikit-learn, that enables easy machine learning yet is used in production. Finally, I will mention some scikit-learn gems, new or forgotten.
Scikit-learn and nilearn: Democratisation of machine learning for brain imagingGael Varoquaux
This talk describe our efforts to bring easily usable machine learning to brain mapping. It covers both questions that machine learning can answer as well as two softwares developed to facilitate machine learning and it's application to neuroimaging.
Computational practices for reproducible scienceGael Varoquaux
Reconciling bleeding-edge scientific results and reproducible research may seem a conundrum in our fast-paced high-pressure academic world. I discuss the practices that I found useful in computational work. At a high level, it is important to navigate the space between rapid experimentation and industrial-grade software development. I advocate adopting more and more software-engineering best practices as a project matures. I will also discuss how to turn the computational work into libraries, and to ensure the quality of the resulting libraries. And I conclude on how those libraries need to fit in the larger picture of the exercise of research to give better science.
Towards new solutions for scientific computing: the case of JuliaMaurizio Tomasi
The year 2018 marks the consolidation of Julia (https://julialang.org/), a programming language designed for scientific computing, as the first stable version (1.0) has been released, in August 2018. Among its main features, expressiveness and high execution speeds are the most prominent: the performance of Julia code is similar to
statically compiled languages, yet Julia provides a nice interactive shell and fully supports Jupyter; moreover, it can transparently call external codes written in C, Fortran, and even Python and R without the need of wrappers. The usage of Julia in the astronomical community is growing, and a GitHub oganization named JuliaAstro takes care of coordinating the development of packages. In this ADASS talk we present the features and shortcomings of this language, and discuss its application in astronomy and astrophysics.
Jay Yagnik at AI Frontiers : A History Lesson on AIAI Frontiers
We have reached a remarkable point in history with the evolution of AI, from applying this technology to incredible use cases in healthcare, to addressing the world's biggest humanitarian and environmental issues. Our ability to learn task-specific functions for vision, language, sequence and control tasks is getting better at a rapid pace. This talk will survey some of the current advances in AI, compare AI to other fields that have historically developed over time, and calibrate where we are in the relative advancement timeline. We will also speculate about the next inflection points and capabilities that AI can offer down the road, and look at how those might intersect with other emergent fields, e.g. Quantum computing.
Scientific research is based on the central idea of a hypothesis, meant to be established or refuted. Over time and from multiple sources we collect evidence that may be loosely in favor or against them. Hypothesis management, therefore, is closely related to the management of probabilistic data. It finds a broader field of application in the context of big data, which is in need of usable systems to address the emerging culture of data-driven decision making.
In this talk I will focus on a specific problem in that landscape, which is the automatic synthesis of a (U-relational) probabilistic database out of tentative sets of mathematical equations --- a previously existing user knowledge specification. The synthesized database is ensured to be normalized according to the identified uncertainty factors. This is in favor of both (1) predictive analytics, supporting the user to keep track of correlations between input and output data; and (2) conditioning, allowing for probability distribution updates to be performed by bayesian inference in the presence of observations.
Scientist meets web dev: how Python became the language of dataGael Varoquaux
Python started as a scripting language, but now it is the new trend everywhere and in particular for data science, the latest rage of computing. It didn’t get there by chance: tools and concepts built by nerdy scientists and geek sysadmins provide foundations for what is said to be the sexiest job: data scientist.
In this talk I give a personal perspective on the progress of the scientific Python ecosystem, from numerical physics to data mining. What made Python suitable for science; Why the cultural gap between scientific Python and the broader Python community turned out to be a gold mine; And where this richness might lead us.
The talk will discuss low-level and high-level technical aspects, such as how the Python world makes it easy to move large chunks of number across code. It will touch upon current technical details that make scikit-learn and joblib stand.
Brain maps from machine learning? Spatial regularizationsGael Varoquaux
Pattern Recognition for NeuroImaging (PR4NI)
We will show empirically how the pattern recognition techniques-commonly used, such as SVMs, provide low-quality brain maps, eventhough they give very good prediction accuracy. We will give an overview of recently developed techniques to impose priors on patterns particularly well suited to neuroimaging: selecting a small number of spatially-structured predictive brain regions. These tools reconcile machine learning with
brain mapping by giving maps more useful to draw neuroscientific conclusions. In addition, they are more robust to cross-individuals spatial variability and thus generalize well across subjects.
Scikit-learn for easy machine learning: the vision, the tool, and the projectGael Varoquaux
Scikit-learn is a popular machine learning tool. What can it do for you?Why you you want to use it? What can you do with it? Where is it going?In this talk, I will discuss why and how scikit-learn became popular. Iwill argue that it is successful because of its vision: it fills an important slot in the rich ecosystem of data science. I will demonstrate how scikit-learn makes predictive analysis easy and yet versatile.I will shed some light on our development process: how do we, as a community, ensure the quality and the growth of scikit-learn?
Inter-site autism biomarkers from resting state fMRIGael Varoquaux
We present an automated pipeline to learn predictive biomarkers from resting-state fMRI. We apply it to classifying autism on unseen sites, demonstrating the feasibility of biomarkers on weakly standardized functional imaging data.
We study the steps of the pipeline that are important to predict and can show that 1) the choice of atlas is the most important choice. Ideally the atlas should be made of functional regions learned from the data. 2) "tangent space" parametrization of the connectivity is the best performer.
We conclude on general recommendations for predictive biomarkers from resting-state fMRI
Machine learning and cognitive neuroimaging: new tools can answer new questionsGael Varoquaux
Machine learning is geared towards prediction. However, aside diagnosis or prognosis in the clinics, cognitive neuroimaging strives for uncovering insights from the data, rather than minimizing prediction error. I review various inferences on brain function that have been drawn using pattern recognition techniques, focusing on decoding. In particular, I discuss using generalization as a test for information, multivariate analysis to interpret overlapping activation patterns, and decoding for principled reverse inference. I give each time a statistical view and a cognitive imaging view.
Connectomics: Parcellations and Network Analysis MethodsGael Varoquaux
Simple tutorial on methods for functional connectome analysis: learning regions, extracting functional signal, inferring the network structure, and comparing it across subjects.
Scikit learn: apprentissage statistique en PythonGael Varoquaux
Présentation au niveau sur "scikit-learn", un toolkit d'apprentissage statistique (machine learning) en Python.
Philosophie et strategie du projet, ainsi que API et très bref examples de code.
Personal point of view on scikit-learn: past, present, and future.
This talks gives a bit of history, mentions exciting development, and a personal vision on the future.
Brain reading, compressive sensing, fMRI and statistical learning in PythonGael Varoquaux
Talk given at Gipsa-lab on using machine learning to learn from fMRI brain patterns and regions related to behavior. This talks focuses on the signal and inverse-problem aspects of the equation, as well as on the software.
El enigmático Albert Dellschau… autor del no menos intrigante diario
En 1969, durante la celebración en la Universidad de Saint Thomas (Houston) de una exposición sobre aeronáutica, Navarro halló un antiguo álbum con recortes y notas coloristas pertenecientes a un desconocido autor. En las amarillentas páginas de aquellos documentos, que se asemejaban a un diario personal, se observaban naves aéreas dibujadas con una exquisita precisión, enmarcadas entre recortes de prensa de la época referidos a la incipiente ciencia aeronáutica.
Towards new solutions for scientific computing: the case of JuliaMaurizio Tomasi
The year 2018 marks the consolidation of Julia (https://julialang.org/), a programming language designed for scientific computing, as the first stable version (1.0) has been released, in August 2018. Among its main features, expressiveness and high execution speeds are the most prominent: the performance of Julia code is similar to
statically compiled languages, yet Julia provides a nice interactive shell and fully supports Jupyter; moreover, it can transparently call external codes written in C, Fortran, and even Python and R without the need of wrappers. The usage of Julia in the astronomical community is growing, and a GitHub oganization named JuliaAstro takes care of coordinating the development of packages. In this ADASS talk we present the features and shortcomings of this language, and discuss its application in astronomy and astrophysics.
Jay Yagnik at AI Frontiers : A History Lesson on AIAI Frontiers
We have reached a remarkable point in history with the evolution of AI, from applying this technology to incredible use cases in healthcare, to addressing the world's biggest humanitarian and environmental issues. Our ability to learn task-specific functions for vision, language, sequence and control tasks is getting better at a rapid pace. This talk will survey some of the current advances in AI, compare AI to other fields that have historically developed over time, and calibrate where we are in the relative advancement timeline. We will also speculate about the next inflection points and capabilities that AI can offer down the road, and look at how those might intersect with other emergent fields, e.g. Quantum computing.
Scientific research is based on the central idea of a hypothesis, meant to be established or refuted. Over time and from multiple sources we collect evidence that may be loosely in favor or against them. Hypothesis management, therefore, is closely related to the management of probabilistic data. It finds a broader field of application in the context of big data, which is in need of usable systems to address the emerging culture of data-driven decision making.
In this talk I will focus on a specific problem in that landscape, which is the automatic synthesis of a (U-relational) probabilistic database out of tentative sets of mathematical equations --- a previously existing user knowledge specification. The synthesized database is ensured to be normalized according to the identified uncertainty factors. This is in favor of both (1) predictive analytics, supporting the user to keep track of correlations between input and output data; and (2) conditioning, allowing for probability distribution updates to be performed by bayesian inference in the presence of observations.
Scientist meets web dev: how Python became the language of dataGael Varoquaux
Python started as a scripting language, but now it is the new trend everywhere and in particular for data science, the latest rage of computing. It didn’t get there by chance: tools and concepts built by nerdy scientists and geek sysadmins provide foundations for what is said to be the sexiest job: data scientist.
In this talk I give a personal perspective on the progress of the scientific Python ecosystem, from numerical physics to data mining. What made Python suitable for science; Why the cultural gap between scientific Python and the broader Python community turned out to be a gold mine; And where this richness might lead us.
The talk will discuss low-level and high-level technical aspects, such as how the Python world makes it easy to move large chunks of number across code. It will touch upon current technical details that make scikit-learn and joblib stand.
Brain maps from machine learning? Spatial regularizationsGael Varoquaux
Pattern Recognition for NeuroImaging (PR4NI)
We will show empirically how the pattern recognition techniques-commonly used, such as SVMs, provide low-quality brain maps, eventhough they give very good prediction accuracy. We will give an overview of recently developed techniques to impose priors on patterns particularly well suited to neuroimaging: selecting a small number of spatially-structured predictive brain regions. These tools reconcile machine learning with
brain mapping by giving maps more useful to draw neuroscientific conclusions. In addition, they are more robust to cross-individuals spatial variability and thus generalize well across subjects.
Scikit-learn for easy machine learning: the vision, the tool, and the projectGael Varoquaux
Scikit-learn is a popular machine learning tool. What can it do for you?Why you you want to use it? What can you do with it? Where is it going?In this talk, I will discuss why and how scikit-learn became popular. Iwill argue that it is successful because of its vision: it fills an important slot in the rich ecosystem of data science. I will demonstrate how scikit-learn makes predictive analysis easy and yet versatile.I will shed some light on our development process: how do we, as a community, ensure the quality and the growth of scikit-learn?
Inter-site autism biomarkers from resting state fMRIGael Varoquaux
We present an automated pipeline to learn predictive biomarkers from resting-state fMRI. We apply it to classifying autism on unseen sites, demonstrating the feasibility of biomarkers on weakly standardized functional imaging data.
We study the steps of the pipeline that are important to predict and can show that 1) the choice of atlas is the most important choice. Ideally the atlas should be made of functional regions learned from the data. 2) "tangent space" parametrization of the connectivity is the best performer.
We conclude on general recommendations for predictive biomarkers from resting-state fMRI
Machine learning and cognitive neuroimaging: new tools can answer new questionsGael Varoquaux
Machine learning is geared towards prediction. However, aside diagnosis or prognosis in the clinics, cognitive neuroimaging strives for uncovering insights from the data, rather than minimizing prediction error. I review various inferences on brain function that have been drawn using pattern recognition techniques, focusing on decoding. In particular, I discuss using generalization as a test for information, multivariate analysis to interpret overlapping activation patterns, and decoding for principled reverse inference. I give each time a statistical view and a cognitive imaging view.
Connectomics: Parcellations and Network Analysis MethodsGael Varoquaux
Simple tutorial on methods for functional connectome analysis: learning regions, extracting functional signal, inferring the network structure, and comparing it across subjects.
Scikit learn: apprentissage statistique en PythonGael Varoquaux
Présentation au niveau sur "scikit-learn", un toolkit d'apprentissage statistique (machine learning) en Python.
Philosophie et strategie du projet, ainsi que API et très bref examples de code.
Personal point of view on scikit-learn: past, present, and future.
This talks gives a bit of history, mentions exciting development, and a personal vision on the future.
Brain reading, compressive sensing, fMRI and statistical learning in PythonGael Varoquaux
Talk given at Gipsa-lab on using machine learning to learn from fMRI brain patterns and regions related to behavior. This talks focuses on the signal and inverse-problem aspects of the equation, as well as on the software.
El enigmático Albert Dellschau… autor del no menos intrigante diario
En 1969, durante la celebración en la Universidad de Saint Thomas (Houston) de una exposición sobre aeronáutica, Navarro halló un antiguo álbum con recortes y notas coloristas pertenecientes a un desconocido autor. En las amarillentas páginas de aquellos documentos, que se asemejaban a un diario personal, se observaban naves aéreas dibujadas con una exquisita precisión, enmarcadas entre recortes de prensa de la época referidos a la incipiente ciencia aeronáutica.
Better neuroimaging data processing: driven by evidence, open communities, an...Gael Varoquaux
My current thoughts about methods validity and design in brain imaging.
Data processing is a significant part of a neuroimaging study. The choice of corresponding methods and tools is crucial. I will give an opinionated view how on a path to building better data processing for neuroimaging. I will take examples on endeavors that I contributed to: defining standards for functional-connectivity analysis, the nilearn neuroimaging tool, the scikit-learn machine-learning toolbox -an industry standard with a million regular users. I will cover not only the technical process -statistics, signal processing, software engineering- but also the epistemology of methods development. Methods govern our results, they are more than a technical detail.
The Art Of Performance Tuning - with presenter notes!Jonathan Ross
A somewhat more verbose version of https://www.slideshare.net/JonathanRoss74/the-art-of-performance-tuning.
Presented at JavaOne 2017 [CON4027], this presentation takes a practical, hands-on look at Java performance tuning. It discusses methodology (spoiler: it’s the scientific method) and how to apply it to Java SE systems (on any budget). Exploring concrete examples with tools such as the Oracle Java Mission Control feature of Oracle Java SE Advanced, VisualVM, YourKit, and JMH, the presentation focuses on ways of measuring performance, how to interpret data, ways of eliminating bottlenecks, and even how to avoid future performance regressions.
DARMDN: Deep autoregressive mixture density nets for dynamical system mode...Balázs Kégl
Unlike computers, physical engineering systems (such as data center cooling or wireless network control) do not get faster with time. This is arguably one of the main reasons why recent beautiful advances in deep reinforcement learning (RL) stay mostly in the realm of simulated worlds and do not immediately translate to practical success in the real world. In order to make the best use of the small data sets these systems generate, we develop data-driven neural simulators to model the system and apply model-based control to optimize them. In this talk I will present the first step of this research agenda, a new versatile system modelling tool called deep autoregressive mixture density net (DARMDN – pronounced darm-dee-en). We argue that the performance of model-based reinforcement learning is partly limited by the approximation capacity of the currently used conditional density models and show how DARMDN alleviates these limitations. The model, combined with a random shooting controller, establishes a new state of the art on the popular Acrobot benchmark. Our most interesting and counter-intuitive finding is that the “sincos” Acrobot system which requires no multimodal posterior predictives, can be solved with a deterministic model, but only if it is trained as a probabilistic model. A deterministic model that is trained to minimize MSE leads to prediction error accumulation.
What is AI, Machine Learning, Neural Networks, Deep Learning and Data ScienceSom Shahapurkar
Intro to AI, Machine Learning, Neural Networks, Deep Learning, and Data Science for everyone. Demystifying jargon and busting myths. It can be viewed with no prior knowledge.
"Big Data made easy with a Spark" is the presentation I gave for ATO (AllThingsOpen) 2018.
In this hands-on session, you will learn how to do a full Big Data scenario from ingestion to publication. You will see how we can use Java and Apache Spark to ingest data, perform some transformations, save the data. You will then perform a second lab where you will run your very first Machine Learning algorithm!
A brief overview of Real-Time Analytics at Netflix and the challenges we've faced in designing and deploying production ready products based on real-time data.
Evaluating machine learning models and their diagnostic valueGael Varoquaux
Model evaluation is, in my opinion, the most overlooked step of the machine-learning pipeline. Reliably estimating a model's performance for a given purpose is crucial and difficult. In this talk, I first discuss choosing metric informative for the application, stressing the importance of the class prevalence in classification settings. I will then discussing procedures to estimate the generalization performance, drawing a distinction between evaluating a learning procedure or a prediction rule, and discussing how to give confidence intervals to the performance estimates.
Measuring mental health with machine learning and brain imagingGael Varoquaux
The study of mental health relies vastly on behavior testing and questionnaires. I discuss how
machine learning on large brain-imaging cohorts can open new alleys for markers of mental health. My
claims are that challenges are the amount of diagnosed conditions rather than heterogeneity of the
conditions and that we should turn to proxy labels. I discuss another fundamental challenge to this
agenda: the external and construct validity of brain-imaging based markers.
A tutorial on machine learning to build prediction models with missing values.
The slides cover both theoretical results (statistical learning) and practical advice, with a focus on implementation in Python with scikit-learn
Dirty data science machine learning on non-curated dataGael Varoquaux
These slides are a one-hour course on machine learning with non-curated data.
According to industry surveys, the number one hassle of data scientists is cleaning the data to analyze it. Here, I survey what "dirtyness" forces time-consuming cleaning. We will then cover two specific aspects of dirty data: non-normalized entries and missing values. I show how, for these two problems, machine-learning practice can be adapted to work directly on a data table without curation. The normalization problem can be tackled by adapting methods from natural language processing. The missing-values problem will lead us to revisit classic statistical results in the setting of supervised learning.
Representation learning in limited-data settingsGael Varoquaux
A 4-hour long didactic course on simple notions of representations and how to use them in limited-data settings:
- A supervised learning point of view, giving intuitions and math on what are representations are why they matter
- Building simple unsupervised learning models to extract representation: from matrix decomposition for signals to embeddings of entities
- Evaluating models in limited-data settings, often a bottleneck
This slide-deck was given as a course at the 2021 DeepLearn summer school.
Functional-connectome biomarkers to meet clinical needs?Gael Varoquaux
Extracting Functional-Connectome Biomarkers with Machine Learning: a talk in the symposium on how do current predictive connectivity models meet clinician’s needs?
This talk is a bit provocative and first sets visions, before bringing a few technical suggestions
Atlases of cognition with large-scale human brain mappingGael Varoquaux
Cognitive neuroscience uses neuroimaging to identify brain systems engaged in specific cognitive tasks. However, linking unequivocally brain systems with cognitive functions is difficult: each task probes only a small number of facets of cognition, while brain systems are often engaged in many tasks. We develop a new approach to generate a functional atlas of cognition, demonstrating brain systems selectively associated with specific cognitive functions. This approach relies upon an ontology that defines specific cognitive functions and the relations between them, along with an analysis scheme tailored to this ontology. Using a database of thirty neuroimaging studies, we show that this approach provides a highly-specific atlas of mental functions, and that it can decode the mental processes engaged in new tasks.
Similarity encoding for learning on dirty categorical variablesGael Varoquaux
For statistical learning, categorical variables in a table are usually considered as discrete entities and encoded separately to feature vectors, e.g., with one-hot encoding. "Dirty" non-curated data gives rise to categorical variables with a very high cardinality but redundancy: several categories reflect the same entity. In databases, this issue is typically solved with a deduplication step. We show that a simple approach that exposes the redundancy to the learning algorithm brings significant gains. We study a generalization of one-hot encoding, similarity encoding, that builds feature vectors from similarities across categories. We perform a thorough empirical validation on non-curated tables, a problem seldom studied in machine learning. Results on seven real-world datasets show that similarity encoding brings significant gains in prediction in comparison with known encoding methods for categories or strings, notably one-hot encoding and bag of character n-grams. We draw practical recommendations for encoding dirty categories: 3-gram similarity appears to be a good choice to capture morphological resemblance. For very high-cardinality, dimensionality reduction significantly reduces the computational cost with little loss in performance: random projections or choosing a subset of prototype categories still outperforms classic encoding approaches.
Machine learning for functional connectomesGael Varoquaux
A tutorial on using machine-learning for functional-connectomes, for instance on resting-state fMRI. This is typically useful for population imaging: comparing traits or conditions across subjects.
Towards psychoinformatics with machine learning and brain imagingGael Varoquaux
Informatics in the psychological sciences brings fascinating challenges as mental processes or pathologies have fuzzy definition and are hard to quantify. Brain imaging brings rich data on the neural substrate of these concepts, yet it is a non trivial link.
The goal of this presentation is to put forward basic ideas of "psychoinformatics", using advanced processing on brain images to quantify better the elements of psychology.
It discusses how machine learning can bridge brain images to behavior: to describe better mental processes involved in brain activity, or to extract biomarkers of pathologies, individual traits, or cognition.
Simple representations for learning: factorizations and similarities Gael Varoquaux
Real-life data seldom comes in the ideal form for statistical learning.
This talk focuses on high-dimensional problems for signals and
discrete entities: when dealing with many, correlated, signals or
entities, it is useful to extract representations that capture these
correlations.
Matrix factorization models provide simple but powerful representations. They are used for recommender systems across discrete entities such as users and products, or to learn good dictionaries to represent images. However they entail large computing costs on very high-dimensional data, databases with many products or high-resolution images. I will present an
algorithm to factorize huge matrices based on stochastic subsampling that gives up to 10-fold speed-ups [1].
With discrete entities, the explosion of dimensionality may be due to variations in how a smaller number of categories are represented. Such a problem of "dirty categories" is typical of uncurated data sources. I will discuss how encoding this data based on similarities recovers a useful category structure with no preprocessing. I will show how it interpolates between one-hot encoding and techniques used in character-level natural language processing.
[1] Stochastic subsampling for factorizing huge matrices, A Mensch, J Mairal, B Thirion, G Varoquaux, IEEE Transactions on Signal Processing 66 (1), 113-128
[2] Similarity encoding for learning with dirty categorical variables. P Cerda, G Varoquaux, B Kégl Machine Learning (2018): 1-18
A tutorial on Machine Learning, with illustrations for MR imagingGael Varoquaux
Machine learning builds predictive models from the data. It is massive used on medical images these days, for a variety of applications ranging from segmentation to diagnosis.
This is an introductory tutorial to machine learning from giving intuitions on the statistical point of view. It introduce the methodology, the concepts behind the central models, the validation framework and some caveats to look for.
It also discusses some applications to drawing conclusions from brain imaging, and use these applications to highlight various technical aspects to running machine learning models on high-dimensional data such as medical imaging.
Estimating Functional Connectomes: Sparsity’s Strength and LimitationsGael Varoquaux
Talk given at the OHBM 2017 education course.
I present the challenges and techniques to estimating meaningful brain functional connectomes from fMRI: why sparsity in inverse covariance leads to models that can interpreted as interactions between regions.
Then I discuss the limitations of sparse estimators and introduce shrinkage as an alternative. Finally, I discuss how to compare multiple functional connectomes.
Talk giving at PRNI 2016 for the paper https://arxiv.org/pdf/1606.06439v1.pdf
Abstract — Spatially-sparse predictors are good models for
brain decoding: they give accurate predictions and their weight
maps are interpretable as they focus on a small number of
regions. However, the state of the art, based on total variation or
graph-net, is computationally costly. Here we introduce sparsity
in the local neighborhood of each voxel with social-sparsity, a
structured shrinkage operator. We find that, on brain imaging
classification problems, social-sparsity performs almost as well as
total-variation models and better than graph-net, for a fraction
of the computational cost. It also very clearly outlines predictive
regions. We give details of the model and the algorithm
Scikit-learn: apprentissage statistique en Python. Créer des machines intelli...Gael Varoquaux
High-level talk about machine learning: the statistical and computational challenges, as well as how they can be answer by the scikit-learn Python toolkit. In French
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
20 Comprehensive Checklist of Designing and Developing a WebsitePixlogix Infotech
Dive into the world of Website Designing and Developing with Pixlogix! Looking to create a stunning online presence? Look no further! Our comprehensive checklist covers everything you need to know to craft a website that stands out. From user-friendly design to seamless functionality, we've got you covered. Don't miss out on this invaluable resource! Check out our checklist now at Pixlogix and start your journey towards a captivating online presence today.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIVladimir Iglovikov, Ph.D.
Presented by Vladimir Iglovikov:
- https://www.linkedin.com/in/iglovikov/
- https://x.com/viglovikov
- https://www.instagram.com/ternaus/
This presentation delves into the journey of Albumentations.ai, a highly successful open-source library for data augmentation.
Created out of a necessity for superior performance in Kaggle competitions, Albumentations has grown to become a widely used tool among data scientists and machine learning practitioners.
This case study covers various aspects, including:
People: The contributors and community that have supported Albumentations.
Metrics: The success indicators such as downloads, daily active users, GitHub stars, and financial contributions.
Challenges: The hurdles in monetizing open-source projects and measuring user engagement.
Development Practices: Best practices for creating, maintaining, and scaling open-source libraries, including code hygiene, CI/CD, and fast iteration.
Community Building: Strategies for making adoption easy, iterating quickly, and fostering a vibrant, engaged community.
Marketing: Both online and offline marketing tactics, focusing on real, impactful interactions and collaborations.
Mental Health: Maintaining balance and not feeling pressured by user demands.
Key insights include the importance of automation, making the adoption process seamless, and leveraging offline interactions for marketing. The presentation also emphasizes the need for continuous small improvements and building a friendly, inclusive community that contributes to the project's growth.
Vladimir Iglovikov brings his extensive experience as a Kaggle Grandmaster, ex-Staff ML Engineer at Lyft, sharing valuable lessons and practical advice for anyone looking to enhance the adoption of their open-source projects.
Explore more about Albumentations and join the community at:
GitHub: https://github.com/albumentations-team/albumentations
Website: https://albumentations.ai/
LinkedIn: https://www.linkedin.com/company/100504475
Twitter: https://x.com/albumentations
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
UiPath Test Automation using UiPath Test Suite series, part 5DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 5. In this session, we will cover CI/CD with devops.
Topics covered:
CI/CD with in UiPath
End-to-end overview of CI/CD pipeline with Azure devops
Speaker:
Lyndsey Byblow, Test Suite Sales Engineer @ UiPath, Inc.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
Building a cutting-edge data processing environment on a budget
1. Building a cutting-edge data processing
environment on a budget
Ga¨l Varoquaux
e
This talk is not about
rocket science!
2. Building a cutting-edge data processing
environment on a budget
Ga¨l Varoquaux
e
Disclaimer: this talk is as much about people
and projects as it is about code and algorithms.
3. Growing up as a penniless academic
I did a PhD in
quantum physics
4. Growing up as a penniless academic
I did a PhD in
quantum physics
Vacuum (leaks)
Electronics (shorts)
Lasers (mis-alignment)
Best training ever
for agile project
management
5. Growing up as a penniless academic
I did a PhD in
quantum physics
Vacuum (leaks)
Electronics (shorts)
Lasers (mis-alignment)
Computers were only one
of the many moving parts
Matlab
Instrument control
6. Growing up as a penniless academic
I did a PhD in
quantum physics
Vacuum (leaks)
Electronics (shorts)
Lasers (mis-alignment)
Shaped my vision
of computing as a
means to an end
Computers were only one
of the many moving parts
Matlab
Instrument control
7. Growing up as a penniless academic
2011
Tenured researcher
in computer science
8. Growing up as a penniless academic
2011
Today
Tenured researcher
in computer science
Growing team with
data science
rock stars
9. 1 Using machine learning to
understand brain function
Link neural activity to thoughts and cognition
G Varoquaux
6
12. 1 Encoding models of stimuli
Predicting neural response
ñ a window into brain representations of stimuli
“feature engineering” a description of the world
G Varoquaux
9
14. 1 Data processing feats
Visual image reconstruction from human brain activity
[Miyawaki, et al. (2008)]
“brain reading”
G Varoquaux
11
15. 1 Data processing feats
Visual image reconstruction from human brain activity
[Miyawaki, et al. (2008)]
“if it’s not open and verifiable by others, it’s not
science, or engineering...”
Stodden, 2010
G Varoquaux
11
16. 1 Data processing feats
Visual image reconstruction from human brain activity
[Miyawaki, et al. (2008)]
Make it work, make it right, make it boring
17. 1 Data processing feats
Visual image reconstruction from human brain activity
[Miyawaki, et al. (2008)]
Make it work, make it right, make it boring
http://nilearn.github.io/auto examples/
plot miyawaki reconstruction.html
Code, data, ... just worksTM
G Varoquaux
http://nilearn.github.io
ni
11
18. 1 Data processing feats
Visual image reconstruction from human brain activity
[Miyawaki, et al. (2008)]
Make it work, make it right, make it boring
ge
len
al
ch
nt
me
p
elo
ev
ed
arhttp://nilearn.github.io/auto examples/
ftw
plot miyawaki reconstruction.html
So
Code, data, ... just worksTM
G Varoquaux
http://nilearn.github.io
ni
11
19. 1 Data accumulation
When data processing is routine...
“big data”
for rich models of
brain function
Accumulation of scientific knowledge
and learning formal representations
G Varoquaux
12
20. 1 Data accumulation
When data processing is routine...
“big data”
for rich models of
brain function
“A theory is a good theory if it satisfies two requirements:
It must accurately describe a large class of observations on the basis of a model that contains only a few
arbitrary elements, and it must make definite predictions about the results of future observations.”
Stephen Hawking, A Brief History of Time.
Accumulation of scientific knowledge
and learning formal representations
G Varoquaux
12
21. 1 Petty day-to-day technicalities
Buggy code
Slow code
Lead data scientist leaves
New intern to train
I don’t understand the
code I have written a year ago
G Varoquaux
13
22. 1 Petty day-to-day technicalities
Buggy code
A lab is no different from a startup
Slow code
Difficulties
Risks
LeadRecruitment leaves
data scientist
Bus factor
Technical dept
New Limited resources
intern to train
(people & hardware)
I don’t understand the
code I have written a year ago
G Varoquaux
13
23. 1 Petty day-to-day technicalities
Buggy code
A lab is no different from a startup
Slow code
Difficulties
Risks
LeadRecruitment leaves
data scientist
Bus factor
Technical dept
New Limited resources
intern to train
(people & hardware)
I don’t understand the
code I have written a year ago
Our mission is to revolutionize brain data processing
on a tight budget
G Varoquaux
13
25. 2 The data processing workflow
agile
Interaction...
Ñ script...
Ñ module...
ý interaction again...
Consolidation,
progressively
Low tech and short
turn-around times
G Varoquaux
15
26. Paradigm shift as the
dimensionality of data
grows
y
2 From statistics to statistical learning
# features,
not only # samples
From parameter
inference to prediction
x
Statistical learning is
spreading everywhere
G Varoquaux
16
27. 3 Let’s just make software
to solve all these problems.
G Varoquaux
c Theodore W. Gray
17
28. 3 Design philosophy
1. Don’t solve hard problems
The original problem can be bent.
2. Easy setup, works out of the box
Installing software sucks.
Convention over configuration.
3. Fail gracefully
Robust to errors. Easy to debug.
4. Quality, quality, quality
What’s not excellent won’t be used.
G Varoquaux
18
29. 3 Design philosophy
1. Don’t solve hard problems
The original problem can be bent.
2. Easy setup, works out of the box
Installing software sucks.
Not “one software to rule them all”
Convention over configuration.
Break down projects by expertise
3. Fail gracefully
Robust to errors. Easy to debug.
4. Quality, quality, quality
What’s not excellent won’t be used.
G Varoquaux
18
31. Vision
Machine learning without learning the machinery
Black box that can be opened
Right trade-off between ”just works” and versatility
(think Apple vs Linux)
G Varoquaux
19
32. Vision
Machine learning without learning the machinery
Black box that can be opened
Right trade-off between ”just works” and versatility
(think Apple vs Linux)
We’re not going to solve all the problems for you
I don’t solve hard problems
Feature-engineering, domain-specific cases...
Python is a programming language. Use it.
Cover all the 80% usecases in one package
G Varoquaux
19
33. 3 Performance in high-level programming
High-level programming
is what keeps us
alive and kicking
G Varoquaux
20
34. 3 Performance in high-level programming
The secret sauce
Optimize algorithmes, not for loops
Know perfectly Numpy and Scipy
- Significant data should be arrays/memoryviews
- Avoid memory copies, rely on blas/lapack
line-profiler/memory-profiler
scipy-lectures.github.io
Cython
G Varoquaux
not C/C++
20
35. 3 Performance in high-level programming
Hierarchical clustering
PR #2199
The secret sauce
1.Optimize algorithmes,clustersloops
Take the 2 closest not for
2. Merge them
3. Update the distance matrix
Know perfectly Numpy and Scipy
...
- Significant data should be arrays/memoryviews
Faster with constraints: sparse distance matrix
- Avoid memory copies, rely on blas/lapack
- Keep a heap queue of distances: cheap minimum
line-profiler/memory-profiler
- Need sparse growable structure for neighborhoods
scipy-lectures.github.io
skip-list in Cython!
Oplog nq insert, remove, access
Cython
not C/C++
bind C++ map[int, float] with Cython
Fast traversal, possibly in Cython, for step 3.
G Varoquaux
20
36. 3 Performance in high-level programming
Hierarchical clustering
PR #2199
The secret sauce
1.Optimize algorithmes,clustersloops
Take the 2 closest not for
2. Merge them
3. Update the distance matrix
Know perfectly Numpy and Scipy
...
- Significant data should be arrays/memoryviews
Faster with constraints: sparse distance matrix
- Avoid memory copies, rely on blas/lapack
- Keep a heap queue of distances: cheap minimum
line-profiler/memory-profiler
- Need sparse growable structure for neighborhoods
scipy-lectures.github.io
skip-list in Cython!
Oplog nq insert, remove, access
Cython
not C/C++
bind C++ map[int, float] with Cython
Fast traversal, possibly in Cython, for step 3.
G Varoquaux
20
38. 3 Architecture of a data-manipulation toolkit
Separate data from operations,
but keep an imperative-like language
Object API exposes a data-processing language
fit, predict, transform, score, partial fit
Instantiated without data but with all the parameters
Objects pipeline, merging, etc...
G Varoquaux
21
39. 3 Architecture of a data-manipulation toolkit
Separate data from operations,
but keep an imperative-like language
Object API exposes a data-processing language
fit, predict, transform, score, partial fit
Instantiated without data but with all the parameters
Objects pipeline, merging, etc...
configuration/run pattern
curry in functional programming
Ideas from MVC pattern
G Varoquaux
traits, pyre
functools.partial
21
41. h
h
isdata on smallishardware
ll
g
4 Big
a
Big
sm
“Big data”:
Petabytes...
Distributed storage
Computing cluster
G Varoquaux
Mere mortals:
Gigabytes...
Python programming
Off-the-self computers
22
42. 4 On-line algorithms
Process the data one sample at a time
Compute the mean of a gazillion
numbers
Hard?
G Varoquaux
23
43. 4 On-line algorithms
Process the data one sample at a time
Compute the mean of a gazillion
numbers
Hard?
No: just do a running mean
G Varoquaux
23
44. 4 On-line algorithms
Converges to expectations
Mini-batch = bunch observations for vectorization
Example: K-Means clustering
X = np.random.normal(size=(10 000, 200))
scipy.cluster.vq. sklearn.cluster.
MiniBatchKMeans(n clusters=10,
kmeans(X, 10,
n init=2).fit(X)
iter=2)
11.33 s
0.62 s
G Varoquaux
23
45. 4 On-the-fly data reduction
Big data is often I/O bound
Layer memory access
CPU caches
RAM
Local disks
Distant storage
Less data also means less work
G Varoquaux
24
46. 4 On-the-fly data reduction
Dropping data
1 loop: take a random fraction of the data
2
run algorithm on that fraction
3 aggregate results across sub-samplings
Looks like bagging: bootstrap aggregation
Exploits redundancy across observations
Run the loop in parallel
G Varoquaux
24
47. 4 On-the-fly data reduction
Random projections (will average features)
sklearn.random projection
random linear combinations of the features
Fast clustering of features
sklearn.cluster.WardAgglomeration
on images: super-pixel strategy
Hashing
when observations have varying size
(e.g. words)
sklearn.feature extraction.text.
HashingVectorizer
stateless: can be used in parallel
G Varoquaux
24
48. 4 On-the-fly data reduction
Example: randomized SVD
Random projection
sklearn.utils.extmath.randomized svd
X = np.random.normal(size=(50000, 200))
%timeit lapack = linalg.svd(X, full matrices=False)
1 loops, best of 3: 6.09 s per loop
%timeit arpack=splinalg.svds(X, 10)
1 loops, best of 3: 2.49 s per loop
%timeit randomized = randomized svd(X, 10)
1 loops, best of 3: 303 ms per loop
linalg.norm(lapack[0][:, :10] - arpack[0]) / 2000
0.0022360679774997738
linalg.norm(lapack[0][:, :10] - randomized[0]) / 2000
0.0022121161221386925
G Varoquaux
24
49. 4 Biggish iron
Our new box:
48 cores
384G RAM
70T storage
15 ke
(SSD cache on RAID controller)
Gets our work done faster than our 800 CPU cluster
It’s the access patterns!
“Nobody ever got fired for using Hadoop on a cluster”
A. Rowstron et al., HotCDP ’12
G Varoquaux
25
51. 5 Parallel processing
big picture
Focus on embarassingly parallel for loops
Life is too short to worry about deadlocks
Workers compete for data access
Memory bus is a bottleneck
The right grain of parallelism
Too fine ñ overhead
Too coarse ñ memory shortage
Scale by the relevant cache pool
G Varoquaux
27
52. 5 Parallel processing
joblib
Focus on embarassingly parallel for loops
Life is too short to worry about deadlocks
>>> from joblib import Parallel, delayed
>>> Parallel(n jobs=2)(delayed(sqrt)(i**2)
...
for i in range(8))
[0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0]
G Varoquaux
27
53. 5 Parallel processing
joblib
IPython, multiprocessing, celery, MPI?
joblib is higher-level
No dependencies, works everywhere
Better traceback reporting
Memmaping arrays to share memory (O. Grisel)
On-the-fly dispatch of jobs – memory-friendly
Threads or processes backend
G Varoquaux
27
54. 5 Parallel processing
joblib
IPython, multiprocessing, celery, MPI?
joblib is higher-level
No dependencies, works everywhere
Better traceback reporting
Memmaping arrays to share memory (O. Grisel)
On-the-fly dispatch of jobs – memory-friendly
Threads or processes backend
G Varoquaux
27
55. 5 Parallel processing
Queues
Queues: high-performance, concurrent-friendly
Difficulty: callback on result arrival
ñ multiple threads in caller ` risk of deadlocks
Dispatch queue should fill up “slowly”
ñ pre dispatch in joblib
ñ Back and forth communication
Door open to race conditions
G Varoquaux
28
56. 5 Parallel processing:
what happens where
joblib design: Caller, dispatch queue, and collect
queue in same process
Benefit: robustness
Grand-central dispatch design: dispatch queue has
a process of its own
Benefit: resource managment in nested for loops
G Varoquaux
29
57. 5 Caching
For reproducibility:
avoid manually chained scripts (make-like usage)
For performance:
avoiding re-computing is the crux of optimization
G Varoquaux
30
58. 5 Caching
The joblib approach
For reproducibility:
avoid manually chained scripts (make-like usage)
For performance:
avoiding re-computing is the crux of optimization
Memoize pattern
mem = joblib.Memory(cachedir=’.’)
g = mem.cache(f)
b = g(a)
# computes a using f
c = g(a)
# retrieves results from store
G Varoquaux
30
59. 5 Caching
The joblib approach
Challenges in the context of big data
For reproducibility:
avoid b are big chained scripts (make-like usage)
a & manually
For performance:
Design goals
avoiding re-computing is the crux of optimization
a & b arbitrary Python objects
No dependencies
Drop-in, framework-less code
Memoize pattern
mem = joblib.Memory(cachedir=’.’)
g = mem.cache(f)
b = g(a)
# computes a using f
c = g(a)
# retrieves results from store
G Varoquaux
30
60. 5 Caching
The joblib approach
For bricks for out-of-core algorithms coming soon
Lego reproducibility:
avoid manually chained scripts
ąąą result = g.call and shelve(a)(make-like usage)
For performance:
ąąą result
avoiding re-computing is the crux argument hash=”...”)
MemorizedResult(cachedir=”...”, func=”g...”,of optimization
ąąą c = result.get()
Memoize pattern
mem = joblib.Memory(cachedir=’.’)
g = mem.cache(f)
b = g(a)
# computes a using f
c = g(a)
# retrieves results from store
G Varoquaux
30
61. 5 Efficient input argument hashing
–
joblib.hash
Compute md5‹ of input arguments
Trade-off between features and cost
Black boxy
Robust and completely generic
G Varoquaux
31
62. 5 Efficient input argument hashing
–
joblib.hash
Compute md5‹ of input arguments
Implementation
1. Create an md5 hash object
2. Subclass the standard-library pickler
= state machine that walks the object graph
3. Walk the object graph:
- ndarrays: pass data pointer to md5 algorithm
(“update” method)
- the rest: pickle
4. Update the md5 with the pickle
‹ md5 is in the Python standard library
G Varoquaux
31
63. 5 Fast, disk-based, concurrent, store – joblib.dump
Persisting arbritrary objects
Once again sub-class the pickler
Use .npy for large numpy arrays (np.save),
pickle for the rest
ñ Multiple files
Store concurrency issues
Strategy: atomic operations ` try/except
Renaming a directory is atomic
Directory layout consistent with remove operations
Good performance, usable on shared disks (cluster)
G Varoquaux
32
64. 5 Making I/O fast
Fast compression
CPU may be faster than disk access
in particular in parallel
Standard library: zlib.compress with buffers
(bypass gzip module to work online + in-memory)
G Varoquaux
33
65. 5 Making I/O fast
Fast compression
CPU may be faster than disk access
in particular in parallel
Standard library: zlib.compress with buffers
(bypass gzip module to work online + in-memory)
Avoiding copies
zlib.compress: C-contiguous buffers
Copyless storage of raw buffer
+ meta-information (strides, class...)
G Varoquaux
33
66. 5 Making I/O fast
Fast compression
CPU may be faster than disk access
in particular in parallel
Standard library: zlib.compress with buffers
(bypass gzip module to work online + in-memory)
Avoiding copies
zlib.compress: C-contiguous buffers
Copyless storage of raw buffer
+ meta-information (strides, class...)
Single file dump
coming soon
File opening is slow on cluster
Challenge: streaming the above for memory usage
G Varoquaux
33
67. 5 Making I/O fast
Fast compression
CPU may be faster than disk access
in particular in parallel
StandardWhat matters on large with buffers
library: zlib.compress systems
(bypass gzip module to stored
Numbers of bytes work online + in-memory)
brings network/SATA bus down
Avoiding copies
zlib.compress: C-contiguous buffers
Memory usage
Copyless storage brings buffer
of raw compute nodes down
+ meta-information (strides, class...)
Number of atomic file access
Single file dump brings shared storage down soon
coming
File opening is slow on cluster
Challenge: streaming the above for memory usage
G Varoquaux
33
68. y axis scale: 1 is np.save
5 Benchmarking to np.save and pytables
G Varoquaux
NeuroImaging data (MNI atlas)
34
69. 6 The bigger picture: building
an ecosystem
Helping your future self
G Varoquaux
35
70. 6 Community-based development in scikit-learn
Huge feature set:
benefits of a large team
Project growth:
More than 200 contributors
„ 12 core contributors
1 full-time INRIA programmer
from the start
Estimated cost of development: $ 6 millions
COCOMO model,
http://www.ohloh.net/p/scikit-learn
G Varoquaux
36
71. 6 The economics of open source
Code maintenance too expensive to be alone
scikit-learn „ 300 email/month nipy „ 45 email/month
joblib „ 45 email/month
mayavi „ 30 email/month
“Hey Gael, I take it you’re too
busy. That’s okay, I spent a day
trying to install XXX and I think
I’ll succeed myself. Next time
though please don’t ignore my
emails, I really don’t like it. You
can say, ‘sorry, I have no time to
help you.’ Just don’t ignore.”
G Varoquaux
37
72. 6 The economics of open source
Code maintenance too expensive to be alone
scikit-learn „ 300 email/month nipy „ 45 email/month
joblib „ 45 email/month
mayavi „ 30 email/month
Your “benefits” come from a fraction of the code
Data loading?
Maybe?
Standard algorithms?
Nah
Share the common code...
...to avoid dying under code
Code becomes less precious with time
And somebody might contribute features
G Varoquaux
37
73. 6 Many eyes makes code fast
Bench WiseRF anybody?
L. Buitinck, O. Grisel, A. Joly, G. Louppe, J. Nothman, P. Prettenhofer
G Varoquaux
38
74. 6 6 steps to a community-driven project
1 Focus on quality
2 Build great docs and examples
3 Use github
4 Limit the technicality of your codebase
5 Releasing and packaging matter
6 Focus on your contributors,
give them credit, decision power
http://www.slideshare.net/GaelVaroquaux/
scikit-learn-dveloppement-communautaire
G Varoquaux
39
75. 6 Core project contributors
Number of commits
Normalized number of commits
since 2009-06
Individual committer
G Varoquaux
Credit: Fernando Perez, Gist 5843625
40
76. 6 The tragedy of the commons
Individuals, acting independently and rationally according to each one’s self-interest, behave contrary to the
whole group’s long-term best interests by depleting
some common resource.
Wikipedia
Make it work, make it right, make it boring
Core projects (boring) taken for granted
ñ Hard to fund, less excitement
They need citation, in papers & on corporate web pages
G Varoquaux
41
77. Solving problems that matter
The 80/20 rule
80% of the usecases can be solved
with 20% of the lines of code
scikit-learn, joblib, nilearn, ...
@GaelVaroquaux
I hope
78. Cutting-edge ... environment ... on a budget
1 Set the goals right
Don’t solve hard problems
What’s your original problem?
@GaelVaroquaux
79. Cutting-edge ... environment ... on a budget
1 Set the goals right
2 Use the simplest technological solutions possible
Be very technically sophisticated
Don’t use that sophistication
@GaelVaroquaux
80. Cutting-edge ... environment ... on a budget
1 Set the goals right
2 Use the simplest technological solutions possible
3 Don’t forget the human factors
With your users (documentation)
With your contributors
@GaelVaroquaux
81. Cutting-edge ... environment ... on a budget
1 Set the goals right
2 Use the simplest technological solutions possible
3 Don’t forget the human factors
A perfect
design?
@GaelVaroquaux