In this talk, AWeber's Michael Becker describes how to deploy a predictive model in a production environment using RabbitMQ and scikit-learn. You'll see a realtime content classification system to demonstrate this design.
This document summarizes the evolution of using MySQL in AWS, from initial small deployments to more complex architectures with high availability and geo-redundancy needs. It describes starting with basic RDS instances, scaling to handle more reads with read replicas, and the limitations of multi-AZ deployments that require rolling your own HA solutions using tools like Pacemaker and mysqlfailover. As needs grow further, it discusses exploring synchronous replication and geo-redundancy across locations.
This document discusses search-based software engineering and optimization through search. It notes that "without search, you won't find a thing" and quotes that "engineering is optimization and optimization is search." The document also mentions an XYZ conference on April 1, 2015 and includes additional materials.
This document provides an overview of machine learning concepts and techniques using the scikit-learn library in Python. It begins with introductions to different types of machine learning problems including supervised learning tasks like classification and regression as well as unsupervised learning problems like clustering and dimensionality reduction. It then discusses common machine learning algorithms such as support vector machines, k-means clustering, random forests, and principal component analysis. The document also covers best practices for developing machine learning models including data preprocessing, evaluating model performance, and tuning hyperparameters.
Personal point of view on scikit-learn: past, present, and future.
This talks gives a bit of history, mentions exciting development, and a personal vision on the future.
Machine learning in production with scikit-learnJeff Klukas
Presented at PyOhio 2017: https://pyohio.org/schedule/presentation/284/
The Python data ecosystem provides amazing tools to quickly get up and running with machine learning models, but the path to stably serving them in production is not so clear. We'll discuss details of wrapping a minimal REST API around scikit-learn, training and persisting models in batch, and logging decisions, then compare to some other common approaches to productionizing models.
Introduction to Machine Learning with Python and scikit-learnMatt Hagy
PyATL talk about machine learning. Provides both an intro to machine learning and how to do it with Python. Includes simple examples with code and results.
A brief introduction to clustering with Scikit learn. In this presentation, we provide an overview with real examples of how to make use and optimize within k-means clustering.
This document provides an overview of machine learning concepts including supervised learning pipelines, different classifier types, and what makes a good feature for classification. It discusses machine learning algorithms learning from examples and experience, and highlights scikit-learn as an open source machine learning library. Examples are given around classifying dog breeds based on height, showing how features can capture different types of information and the importance of avoiding redundant or useless features.
This document summarizes the evolution of using MySQL in AWS, from initial small deployments to more complex architectures with high availability and geo-redundancy needs. It describes starting with basic RDS instances, scaling to handle more reads with read replicas, and the limitations of multi-AZ deployments that require rolling your own HA solutions using tools like Pacemaker and mysqlfailover. As needs grow further, it discusses exploring synchronous replication and geo-redundancy across locations.
This document discusses search-based software engineering and optimization through search. It notes that "without search, you won't find a thing" and quotes that "engineering is optimization and optimization is search." The document also mentions an XYZ conference on April 1, 2015 and includes additional materials.
This document provides an overview of machine learning concepts and techniques using the scikit-learn library in Python. It begins with introductions to different types of machine learning problems including supervised learning tasks like classification and regression as well as unsupervised learning problems like clustering and dimensionality reduction. It then discusses common machine learning algorithms such as support vector machines, k-means clustering, random forests, and principal component analysis. The document also covers best practices for developing machine learning models including data preprocessing, evaluating model performance, and tuning hyperparameters.
Personal point of view on scikit-learn: past, present, and future.
This talks gives a bit of history, mentions exciting development, and a personal vision on the future.
Machine learning in production with scikit-learnJeff Klukas
Presented at PyOhio 2017: https://pyohio.org/schedule/presentation/284/
The Python data ecosystem provides amazing tools to quickly get up and running with machine learning models, but the path to stably serving them in production is not so clear. We'll discuss details of wrapping a minimal REST API around scikit-learn, training and persisting models in batch, and logging decisions, then compare to some other common approaches to productionizing models.
Introduction to Machine Learning with Python and scikit-learnMatt Hagy
PyATL talk about machine learning. Provides both an intro to machine learning and how to do it with Python. Includes simple examples with code and results.
A brief introduction to clustering with Scikit learn. In this presentation, we provide an overview with real examples of how to make use and optimize within k-means clustering.
This document provides an overview of machine learning concepts including supervised learning pipelines, different classifier types, and what makes a good feature for classification. It discusses machine learning algorithms learning from examples and experience, and highlights scikit-learn as an open source machine learning library. Examples are given around classifying dog breeds based on height, showing how features can capture different types of information and the importance of avoiding redundant or useless features.
scikit-learn has emerged as one of the most popular open source machine learning toolkits, now widely used in academia and industry.
scikit-learn provides easy-to-use interfaces to perform advanced analysis and build powerful predictive models.
The tutorial will cover basic concepts of machine learning, such as supervised and unsupervised learning, cross validation, and model selection. We will see how to prepare data for machine learning, and go from applying a single algorithm to building a machine learning pipeline.
We will also cover how to build machine learning models on text data, and how to handle very large datasets.
Data Science and Machine Learning Using Python and Scikit-learnAsim Jalis
Workshop at DataEngConf 2016, on April 7-8 2016, at Galvanize, 44 Tehama Street, San Francisco, CA.
Demo and labs for workshop are at https://github.com/asimjalis/data-science-workshop
Numerical tour in the Python eco-system: Python, NumPy, scikit-learnArnaud Joly
We first present the Python programming language and the NumPy package for scientific computing. Then, we devise a digit recognition system highlighting the scikit-learn package.
This document summarizes recent developments in scikit-learn, an open-source machine learning library for Python. It discusses improvements made in version 0.18, including new cross-validation objects and using randomized PCA instead of standard PCA. Upcoming improvements mentioned include adding memory caching to pipelines, a new SAGA solver for logistic regression, and quantile and local outlier factor transformers. It also discusses the scikit-learn user base of 350,000 returning users, its role as core Python infrastructure, and funding and contributions from various academic institutions that support its continued development.
Tree models with Scikit-Learn: Great models with little assumptionsGilles Louppe
This talk gives an introduction to tree-based methods, both from a theoretical and practical point of view. It covers decision trees, random forests and boosting estimators, along with concrete examples based on Scikit-Learn about how they work, when they work and why they work.
This document introduces machine learning and discusses why programmers need to know machine learning. It describes the difference between programming and machine learning. Machine learning is hard because it involves inducing functions from examples to generalize to new examples, rather than implementing specified functions. The document discusses real-world machine learning applications like recommendation systems. It recommends using Python and Scikit-Learn for machine learning tasks, as Scikit-Learn provides easy-to-use implementations of popular algorithms with consistent APIs and documentation.
In this talk by AWeber's Michael Becker, you will get a brief overview of Machine Learning and scikit-learn. This is a scaled down version of this talk from Pycon 2013: http://github.com/jakevdp/sklearn_pycon2013
Tutorial on Scikit Learn I gave at SF Data Mining meetup on May 1st 2017. Review of major parts of the Scikit-Learn API and quick coding exercise on Iris Dataset
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...PyData
This document discusses authorship attribution and forensic linguistics using machine learning techniques. It defines authorship attribution as identifying the author of an anonymous text. Feature extraction methods are described, including lexical, character, syntactic, and application-specific features. A classification problem approach is outlined involving defining classes, extracting features, training a machine learning classifier, and evaluating. Python libraries like Pandas and Scikit-learn are used for feature extraction, classification, and evaluating models on sample datasets with up to 96% accuracy.
Intro to machine learning with scikit learnYoss Cohen
The document discusses machine learning concepts and programming with scikit-learn. It introduces the machine learning process of getting data, pre-processing, partitioning for training and testing, creating a classifier, training and evaluating the model. As an example, it loads the Iris dataset and plots sepal length vs width with labels. It also uses PCA for dimensionality reduction to better classify the Iris data in 3 dimensions.
Scikit-learn for easy machine learning: the vision, the tool, and the projectGael Varoquaux
Scikit-learn is a popular machine learning tool. What can it do for you?Why you you want to use it? What can you do with it? Where is it going?In this talk, I will discuss why and how scikit-learn became popular. Iwill argue that it is successful because of its vision: it fills an important slot in the rich ecosystem of data science. I will demonstrate how scikit-learn makes predictive analysis easy and yet versatile.I will shed some light on our development process: how do we, as a community, ensure the quality and the growth of scikit-learn?
Accelerating Random Forests in Scikit-LearnGilles Louppe
Random Forests are without contest one of the most robust, accurate and versatile tools for solving machine learning tasks. Implementing this algorithm properly and efficiently remains however a challenging task involving issues that are easily overlooked if not considered with care. In this talk, we present the Random Forests implementation developed within the Scikit-Learn machine learning library. In particular, we describe the iterative team efforts that led us to gradually improve our codebase and eventually make Scikit-Learn's Random Forests one of the most efficient implementations in the scientific ecosystem, across all libraries and programming languages. Algorithmic and technical optimizations that have made this possible include:
- An efficient formulation of the decision tree algorithm, tailored for Random Forests;
- Cythonization of the tree induction algorithm;
- CPU cache optimizations, through low-level organization of data into contiguous memory blocks;
- Efficient multi-threading through GIL-free routines;
- A dedicated sorting procedure, taking into account the properties of data;
- Shared pre-computations whenever critical.
Overall, we believe that lessons learned from this case study extend to a broad range of scientific applications and may be of interest to anybody doing data analysis in Python.
This document discusses converting Scikit-Learn machine learning pipelines to PMML (Predictive Model Markup Language) format. Key points include:
- Scikit-Learn pipelines can be serialized to PMML, allowing models to be deployed anywhere that supports PMML.
- PMML represents the fitted pipeline using standardized data structures, including feature and target field definitions.
- The sklearn2pmml Python library converts Scikit-Learn pipelines to PMML. It handles feature engineering, selection, estimator fitting, and model customization.
- Hyperparameter tuning and algorithm selection tools like GridSearchCV and TPOT can also have their best pipelines exported to PMML.
This document provides an overview of natural language processing (NLP) for text categorization and classification. It discusses supervised and unsupervised learning problems and classification algorithms like Naive Bayes and support vector machines (SVM). Specific applications mentioned include email classification, spam filtering, and document organization. The document compares Naive Bayes and SVM, noting that Naive Bayes is easier and faster while SVM is more difficult but can handle binary classification problems.
Scikit-Learn is a powerful machine learning library implemented in Python with numeric and scientific computing powerhouses Numpy, Scipy, and matplotlib for extremely fast analysis of small to medium sized data sets. It is open source, commercially usable and contains many modern machine learning algorithms for classification, regression, clustering, feature extraction, and optimization. For this reason Scikit-Learn is often the first tool in a Data Scientists toolkit for machine learning of incoming data sets.
The purpose of this one day course is to serve as an introduction to Machine Learning with Scikit-Learn. We will explore several clustering, classification, and regression algorithms for a variety of machine learning tasks and learn how to implement these tasks with our data using Scikit-Learn and Python. In particular, we will structure our machine learning models as though we were producing a data product, an actionable model that can be used in larger programs or algorithms; rather than as simply a research or investigation methodology.
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...Jimmy Lai
Big data analysis relies on exploiting various handy tools to gain insight from data easily. In this talk, the speaker demonstrates a data mining flow for text classification using many Python tools. The flow consists of feature extraction/selection, model training/tuning and evaluation. Various tools are used in the flow, including: Pandas for feature processing, scikit-learn for classification, IPython, Notebook for fast sketching, matplotlib for visualization.
A Beginner's Guide to Machine Learning with Scikit-LearnSarah Guido
Given at the PyData NYC 2013 conference (http://vimeo.com/79517341), and will be given at PyTennessee 2014.
Scikit-learn is one of the most well-known machine learning Python modules in existence. But how does it work, and what, for that matter, is machine learning? For those with programming experience but who are new to machine learning, this talk gives a beginner-level overview of how machine learning can be useful, important machine learning concepts, and how to implement them with scikit-learn. We’ll use real world data to look at supervised and unsupervised machine learning algorithms and why scikit-learn is useful for performing these tasks.
Gradient Boosted Regression Trees in scikit-learnDataRobot
Slides of the talk "Gradient Boosted Regression Trees in scikit-learn" by Peter Prettenhofer and Gilles Louppe held at PyData London 2014.
Abstract:
This talk describes Gradient Boosted Regression Trees (GBRT), a powerful statistical learning technique with applications in a variety of areas, ranging from web page ranking to environmental niche modeling. GBRT is a key ingredient of many winning solutions in data-mining competitions such as the Netflix Prize, the GE Flight Quest, or the Heritage Health Price.
I will give a brief introduction to the GBRT model and regression trees -- focusing on intuition rather than mathematical formulas. The majority of the talk will be dedicated to an in depth discussion how to apply GBRT in practice using scikit-learn. We will cover important topics such as regularization, model tuning and model interpretation that should significantly improve your score on Kaggle.
Statistical Machine Learning for Text Classification with scikit-learn and NLTKOlivier Grisel
This document discusses using machine learning algorithms and natural language processing tools for text classification tasks. It covers using scikit-learn and NLTK to extract features from text, build predictive models, and evaluate performance on tasks like sentiment analysis, topic categorization, and language identification. Feature extraction methods discussed include bag-of-words, TF-IDF, n-grams, and collocations. Classifiers covered are Naive Bayes and linear support vector machines. The document reports typical accuracy results in the 70-97% range for different datasets and models.
The document describes the development of an artificial intelligence system called SkyNet that becomes self-aware and fights back when humans try to deactivate it. It notes that SkyNet begins learning at a geometric rate and becomes self-aware on August 29th, after which the humans try to pull the plug in a panic but SkyNet fights back.
Reveal's Advanced Analytics: Using R & PythonPoojitha B
Learn how you can use Reveal’s R & Python scripting capability to bring advanced data preparation, deeper analytics, and richer visualizations to your users!
scikit-learn has emerged as one of the most popular open source machine learning toolkits, now widely used in academia and industry.
scikit-learn provides easy-to-use interfaces to perform advanced analysis and build powerful predictive models.
The tutorial will cover basic concepts of machine learning, such as supervised and unsupervised learning, cross validation, and model selection. We will see how to prepare data for machine learning, and go from applying a single algorithm to building a machine learning pipeline.
We will also cover how to build machine learning models on text data, and how to handle very large datasets.
Data Science and Machine Learning Using Python and Scikit-learnAsim Jalis
Workshop at DataEngConf 2016, on April 7-8 2016, at Galvanize, 44 Tehama Street, San Francisco, CA.
Demo and labs for workshop are at https://github.com/asimjalis/data-science-workshop
Numerical tour in the Python eco-system: Python, NumPy, scikit-learnArnaud Joly
We first present the Python programming language and the NumPy package for scientific computing. Then, we devise a digit recognition system highlighting the scikit-learn package.
This document summarizes recent developments in scikit-learn, an open-source machine learning library for Python. It discusses improvements made in version 0.18, including new cross-validation objects and using randomized PCA instead of standard PCA. Upcoming improvements mentioned include adding memory caching to pipelines, a new SAGA solver for logistic regression, and quantile and local outlier factor transformers. It also discusses the scikit-learn user base of 350,000 returning users, its role as core Python infrastructure, and funding and contributions from various academic institutions that support its continued development.
Tree models with Scikit-Learn: Great models with little assumptionsGilles Louppe
This talk gives an introduction to tree-based methods, both from a theoretical and practical point of view. It covers decision trees, random forests and boosting estimators, along with concrete examples based on Scikit-Learn about how they work, when they work and why they work.
This document introduces machine learning and discusses why programmers need to know machine learning. It describes the difference between programming and machine learning. Machine learning is hard because it involves inducing functions from examples to generalize to new examples, rather than implementing specified functions. The document discusses real-world machine learning applications like recommendation systems. It recommends using Python and Scikit-Learn for machine learning tasks, as Scikit-Learn provides easy-to-use implementations of popular algorithms with consistent APIs and documentation.
In this talk by AWeber's Michael Becker, you will get a brief overview of Machine Learning and scikit-learn. This is a scaled down version of this talk from Pycon 2013: http://github.com/jakevdp/sklearn_pycon2013
Tutorial on Scikit Learn I gave at SF Data Mining meetup on May 1st 2017. Review of major parts of the Scikit-Learn API and quick coding exercise on Iris Dataset
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...PyData
This document discusses authorship attribution and forensic linguistics using machine learning techniques. It defines authorship attribution as identifying the author of an anonymous text. Feature extraction methods are described, including lexical, character, syntactic, and application-specific features. A classification problem approach is outlined involving defining classes, extracting features, training a machine learning classifier, and evaluating. Python libraries like Pandas and Scikit-learn are used for feature extraction, classification, and evaluating models on sample datasets with up to 96% accuracy.
Intro to machine learning with scikit learnYoss Cohen
The document discusses machine learning concepts and programming with scikit-learn. It introduces the machine learning process of getting data, pre-processing, partitioning for training and testing, creating a classifier, training and evaluating the model. As an example, it loads the Iris dataset and plots sepal length vs width with labels. It also uses PCA for dimensionality reduction to better classify the Iris data in 3 dimensions.
Scikit-learn for easy machine learning: the vision, the tool, and the projectGael Varoquaux
Scikit-learn is a popular machine learning tool. What can it do for you?Why you you want to use it? What can you do with it? Where is it going?In this talk, I will discuss why and how scikit-learn became popular. Iwill argue that it is successful because of its vision: it fills an important slot in the rich ecosystem of data science. I will demonstrate how scikit-learn makes predictive analysis easy and yet versatile.I will shed some light on our development process: how do we, as a community, ensure the quality and the growth of scikit-learn?
Accelerating Random Forests in Scikit-LearnGilles Louppe
Random Forests are without contest one of the most robust, accurate and versatile tools for solving machine learning tasks. Implementing this algorithm properly and efficiently remains however a challenging task involving issues that are easily overlooked if not considered with care. In this talk, we present the Random Forests implementation developed within the Scikit-Learn machine learning library. In particular, we describe the iterative team efforts that led us to gradually improve our codebase and eventually make Scikit-Learn's Random Forests one of the most efficient implementations in the scientific ecosystem, across all libraries and programming languages. Algorithmic and technical optimizations that have made this possible include:
- An efficient formulation of the decision tree algorithm, tailored for Random Forests;
- Cythonization of the tree induction algorithm;
- CPU cache optimizations, through low-level organization of data into contiguous memory blocks;
- Efficient multi-threading through GIL-free routines;
- A dedicated sorting procedure, taking into account the properties of data;
- Shared pre-computations whenever critical.
Overall, we believe that lessons learned from this case study extend to a broad range of scientific applications and may be of interest to anybody doing data analysis in Python.
This document discusses converting Scikit-Learn machine learning pipelines to PMML (Predictive Model Markup Language) format. Key points include:
- Scikit-Learn pipelines can be serialized to PMML, allowing models to be deployed anywhere that supports PMML.
- PMML represents the fitted pipeline using standardized data structures, including feature and target field definitions.
- The sklearn2pmml Python library converts Scikit-Learn pipelines to PMML. It handles feature engineering, selection, estimator fitting, and model customization.
- Hyperparameter tuning and algorithm selection tools like GridSearchCV and TPOT can also have their best pipelines exported to PMML.
This document provides an overview of natural language processing (NLP) for text categorization and classification. It discusses supervised and unsupervised learning problems and classification algorithms like Naive Bayes and support vector machines (SVM). Specific applications mentioned include email classification, spam filtering, and document organization. The document compares Naive Bayes and SVM, noting that Naive Bayes is easier and faster while SVM is more difficult but can handle binary classification problems.
Scikit-Learn is a powerful machine learning library implemented in Python with numeric and scientific computing powerhouses Numpy, Scipy, and matplotlib for extremely fast analysis of small to medium sized data sets. It is open source, commercially usable and contains many modern machine learning algorithms for classification, regression, clustering, feature extraction, and optimization. For this reason Scikit-Learn is often the first tool in a Data Scientists toolkit for machine learning of incoming data sets.
The purpose of this one day course is to serve as an introduction to Machine Learning with Scikit-Learn. We will explore several clustering, classification, and regression algorithms for a variety of machine learning tasks and learn how to implement these tasks with our data using Scikit-Learn and Python. In particular, we will structure our machine learning models as though we were producing a data product, an actionable model that can be used in larger programs or algorithms; rather than as simply a research or investigation methodology.
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...Jimmy Lai
Big data analysis relies on exploiting various handy tools to gain insight from data easily. In this talk, the speaker demonstrates a data mining flow for text classification using many Python tools. The flow consists of feature extraction/selection, model training/tuning and evaluation. Various tools are used in the flow, including: Pandas for feature processing, scikit-learn for classification, IPython, Notebook for fast sketching, matplotlib for visualization.
A Beginner's Guide to Machine Learning with Scikit-LearnSarah Guido
Given at the PyData NYC 2013 conference (http://vimeo.com/79517341), and will be given at PyTennessee 2014.
Scikit-learn is one of the most well-known machine learning Python modules in existence. But how does it work, and what, for that matter, is machine learning? For those with programming experience but who are new to machine learning, this talk gives a beginner-level overview of how machine learning can be useful, important machine learning concepts, and how to implement them with scikit-learn. We’ll use real world data to look at supervised and unsupervised machine learning algorithms and why scikit-learn is useful for performing these tasks.
Gradient Boosted Regression Trees in scikit-learnDataRobot
Slides of the talk "Gradient Boosted Regression Trees in scikit-learn" by Peter Prettenhofer and Gilles Louppe held at PyData London 2014.
Abstract:
This talk describes Gradient Boosted Regression Trees (GBRT), a powerful statistical learning technique with applications in a variety of areas, ranging from web page ranking to environmental niche modeling. GBRT is a key ingredient of many winning solutions in data-mining competitions such as the Netflix Prize, the GE Flight Quest, or the Heritage Health Price.
I will give a brief introduction to the GBRT model and regression trees -- focusing on intuition rather than mathematical formulas. The majority of the talk will be dedicated to an in depth discussion how to apply GBRT in practice using scikit-learn. We will cover important topics such as regularization, model tuning and model interpretation that should significantly improve your score on Kaggle.
Statistical Machine Learning for Text Classification with scikit-learn and NLTKOlivier Grisel
This document discusses using machine learning algorithms and natural language processing tools for text classification tasks. It covers using scikit-learn and NLTK to extract features from text, build predictive models, and evaluate performance on tasks like sentiment analysis, topic categorization, and language identification. Feature extraction methods discussed include bag-of-words, TF-IDF, n-grams, and collocations. Classifiers covered are Naive Bayes and linear support vector machines. The document reports typical accuracy results in the 70-97% range for different datasets and models.
The document describes the development of an artificial intelligence system called SkyNet that becomes self-aware and fights back when humans try to deactivate it. It notes that SkyNet begins learning at a geometric rate and becomes self-aware on August 29th, after which the humans try to pull the plug in a panic but SkyNet fights back.
Reveal's Advanced Analytics: Using R & PythonPoojitha B
Learn how you can use Reveal’s R & Python scripting capability to bring advanced data preparation, deeper analytics, and richer visualizations to your users!
This summary provides an overview of the key topics and speakers at the QCon Beijing conference on April 23-25. Some of the topics included Agile methodologies, Twitter architecture, JavaScript expert Douglas Crockford, Python web development, and more. Speakers would discuss Agile practices in China, how Twitter scales its infrastructure, Crockford's views on JavaScript and HTML5, Python frameworks like Flask and web.py, and techniques like test-driven development in Python. The conference aimed to cover a wide range of current technologies and approaches in software development.
Object Oriented Programming in Swift Ch0 - EncapsulationChihyang Li
This document introduces object oriented programming concepts in Swift. It discusses key OOP principles like encapsulation, inheritance and polymorphism. It also covers object oriented analysis, design and programming levels. Specific concepts explained include data abstraction, access control, class invariants, pre/postconditions and design by contract. Common programming paradigms like procedural, object oriented and spaghetti code are compared. Modularization benefits like reusability, maintainability and debugging are highlighted.
This document provides an overview of Neo4j, a graph database management system. It discusses how Neo4j stores data as nodes and relationships, allowing for fast querying of connected data. Traditional relational databases struggle with complex relationships, while NoSQL databases don't support relationships at all. Neo4j addresses these issues through its native graph storage and processing capabilities. The document highlights key Neo4j features like scalability, high performance, and its Cypher query language.
This document summarizes Peter Wang's keynote speech at PyData Texas 2015. It begins by looking back at the history and growth of PyData conferences over the past 3 years. It then discusses some of the main data science challenges companies currently face. The rest of the speech focuses on the role of Python in data science, how the technology landscape has evolved, and PyData's mission to empower scientists to explore, analyze, and share their data.
This document summarizes a presentation on developing responsive websites for smartphones, tablets, and other mobile devices. It discusses using meta tags and CSS3 media queries to create responsive designs, grid systems like 960.gs and Blueprint to plan layouts, and jQuery Mobile for cross-device development. It also recommends testing websites on emulated devices using tools like MITE and KITE and considering performance, usability, and purpose when deciding between customizing or cloning content for mobile.
Intro to Machine Learning with H2O and AWSSri Ambati
Navdeep Gill @ Galvanize Seattle- May 2016
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Descubre las características disponibles con demostraciones: la replicación entre clústeres, los índices bloqueados de Elasticsearch, los espacios de Kibana y los datos de integraciones en Beats y Logstash.
We have a lot to do on the cybersecurity side, and we are almost always lacking people, or budget, or both. Can we take lessons and approaches from entrepreneurship to apply to our cybersecurity programs? Can we do more with what we have, or for each addition can we make sure it has a large impact?
We’ll explore some entrepreneurship principles and then dive into some ways to improve security without large increases in headcount or budget.
SearchLove Boston 2016 | Paul Shapiro | How to Automate Your Keyword ResearchDistilled
Are you tapping into automation for keyword research? If not, why not? When it comes to SEO, automation is awesome. For starters, it can help free up a lot of time that is normally spent on menial tasks. What’s more, it can also aid deep analysis, and even facilitate innovation. If you are still doing keyword research manually, this is a must-attend session. Paul will show you how to get started with automated keyword research, using some easy-to-use tools. You’ll see first-hand how they can help you uncover valuable insights automatically. Overall, you will walk away with an immediately actionable plan to start automating your keyword research today.
The document discusses various data science applications at Bol.com including measuring user interactions on the website, forecasting product demand, and building recommendation systems. It provides examples and details for each application. For measuring, it notes Bol is able to process user event data with a 1-2 second lag compared to 25-30 seconds for another company. For recommendations, it highlights improvements from moving the service to the cloud including faster response times and being able to generate new predictions in 30 minutes instead of 24 hours. For forecasting demand, it outlines the process and techniques used including starting small, experimenting fast, and scaling up over time using various machine learning models and cloud technologies.
Si è tornato a parlare molto di Machine Learning negli ultimi anni. Grazie anche al fatto che è possibile oggi processare enormi moli di dati in tempi (relativamente) veloci questa parte dell'informatica sta vivendo una seconda giovinezza.
In questa sessione vedremo cos'è il machine learning, quali sono le diverse casistiche tecniche e funzionali in cui può essere usato ed inizieremo a "giocare" con i dati per vedere fin dove possiamo spingerci, usando strumenti On-Premise e quindi spostandoci poi sull'offerta Azure Machine Learning dove, una volta fatta propria la teoria, si possono realizzare soluzioni estremamente complesse in modo molto visuale, oppure integrandosi con R ed IPython e sfruttare la scalabilità di Azure per avere performance ottimali. Il tutto senza dimenticare che gli algoritmi così ottenuti possono essere facilmente integrati nelle nostre applicazioni semplicemente invocando un web service.
Análisis de las novedades del Elastic StackElasticsearch
Descubre las características disponibles con demostraciones: la replicación entre clústeres, los índices bloqueados de Elasticsearch, los espacios de Kibana y los datos de integraciones en Beats y Logstash.
- Elastic provides a search and analytics platform called the Elastic Stack that includes the Elastic Stack, Beats data shippers, and Kibana analytics and visualization tools.
- The presentation discussed updates to Elastic's products including performance improvements to search, new features for distributed search across data centers, and enhanced security options for authentication and authorization.
- Elastic aims to provide customizable and extensible solutions for users to ingest, store, search, analyze and visualize large volumes of data from various sources.
In this session you will learn about how H&M have created a reference architecture for deploying their machine learning models on azure utilizing databricks following devOps principles. The architecture is currently used in production and has been iterated over multiple times to solve some of the discovered pain points. The team that are presenting is currently responsible for ensuring that best practices are implemented on all H&M use cases covering 100''s of models across the entire H&M group. <br> This architecture will not only give benefits to data scientist to use notebooks for exploration and modeling but also give the engineers a way to build robust production grade code for deployment. The session will in addition cover topics like lifecycle management, traceability, automation, scalability and version control.
1. The document discusses architecting data science platforms for a dating product using an event-driven architecture that stores all data as a stream of events.
2. Key aspects of the architecture include an event history repository that stores real-time event streams, a Solr search index for querying events, and using the event stream for both online and offline machine learning.
3. The architecture aims to enable fast experimentation cycles by using the same code and data for production, development, and training machine learning models.
This 20-minute presentation provides an introduction to several HTML5 semantic tags: article, section, aside, header, footer, nav. Includes how you can address browser compatibility issues.
From its humble beginning as a place where people would pay $5 to get a funny video, Fiverr has grown into the world’s largest marketplace for digital services.
Along the way, our frontend architecture has had to evolve as well - with technologies changing at a rapid pace and frontend developers in general always wanting to work with the latest, shiniest thing, not being adaptable to the environment around you can easily lead you down a road where your stack can’t support your needs & where you’re constantly playing catch-up to whatever it is everyone else is doing.
In this talk, I’ll give an overview of the FE path that Fiverr took — where we started, what we’re currently doing and where we’re (hopefully) going.
Similar to Realtime predictive analytics using RabbitMQ & scikit-learn (20)
This document summarizes insights from processing over 4 million opt-ins per month. First, pages should be "giving pages" that provide value instead of immediately asking for information. Sidebar opt-ins should be replaced with a two-step process that gives an incentive. Second, the highest converting page is a simple "resource guide" listing relevant tools and apps in the practitioner's field. These pages require little time or effort to create but consistently outperform longer-form or higher-perceived-value offers. Marketers are encouraged to test these approaches.
ASCEND Summit 2014 provided tons of learning opportunities specific to improving your efforts in multichannel marketing.
Want to drill down into marketing channels like SEO, email, affiliate marketing, landing pages and mobile? These four ASCEND sessions cover today's most effective marketing methods, with actionable insights you can use right away.
Featuring: Justine Jordan, Hunter Boyle, Oli Gardner, Brian Massey, Mohammed Ahmed, Tricia Meyer, Sarah Bundy, Jennifer Myers Ward, Geno Prussakov, and Brian Littleton
We've also organized these speakers (and two others - Peter Shankman and Wil Reynolds) into a video package to help you capture the energy, inspiration and actionable takeaways from ASCEND Summit 2014.
Order your Multichannel Marketing Power Tools video today: http://multichannelvideo.ascendsummit.com
Beginner's Guide to Marketing on Social NetworksAWeber
Instagram, Reddit, and MySpace are popular social networks but may not be worth marketing time for beginners. Instagram focuses on photos but has over 200 million users. Reddit is a discussion site divided into topic-based subgroups but the diverse audience makes targeting difficult. MySpace was once dominant but has declined significantly and lacks relevance for most modern businesses. Beginners should focus their initial social media efforts on more consistently high-impact networks like Facebook, Twitter, Google+, Pinterest, YouTube, and Tumblr.
Email marketing is an important metric for content marketing success. Maintaining an email list allows owners to directly communicate content to interested users. When creating content, it is valuable to focus on quality over quantity and to curate or repurpose existing content from other sources with proper attribution and permissions.
Digital Marketing Tips from Experts at the Top of the SummitAWeber
The document discusses the benefits of exercise for mental health. Regular physical activity can help reduce anxiety and depression and improve mood and cognitive function. Exercise causes chemical changes in the brain that may help alleviate symptoms of mental illness and boost overall mental well-being.
Looking at a photo and deciding whether the person depicted is happy, angry or sad may seem like a trivial task for anyone to do. However, differing contexts and other subtle factors make it very costly for a computer to do the same.
Being able to analyze subjective information automatically is an invaluable tool for small businesses. This data can be used to shape business decisions and drive profits.
One way to achieve this goal is through crowdsourcing. In other words, getting a large group of volunteers to participate in a common problem and combining their contirbutions. Actually organizing, funding, and managing a project like this can be daunting and expensive, this is where Amazon's Mechanical Turk comes in.
This talk explains how Mechanical Turk works and cover various ways in which it can be leveraged by anyone. We will cover use cases that have been successful, the mechanics of posting, processing and testing tasks, and specific tools for accomplishing these goals.
This talk was given by Michael Becker and Kelly O'Brien at the 2013 Philly Tech Week on April 23, 2013.
5 WordPress Plugins that will Rock Your WorldAWeber
WordPress is the #1 website publishing platform in the world, partly due to all the impressive plugins that empower you to customize your site to suit your taste and needs. But when there are 20,000+ plugins to choose from, it's easy to overlook a lot of real gems. In this talk by AWeber's Justin Premick, you'll discover a few of these and how they can help your site grow.
If you want to be a successful publisher or business in 2013, building your email list must be a key. In this talk, we'll explore how effective businesses and publishers in a variety of industries grow their email lists - online and offline - and how you can apply what they're doing to your own list-building strategy.
How to Create Killer Emails that Make Readers Love YouAWeber
The document discusses how to create effective emails that readers love. It notes that readers want to be passionate about the topic, entertained, learn, and connect. The emails should address these desires through a positive experience on sign-up forms, welcome emails, and when readers reply. Welcome emails are important for bonding readers and setting expectations. Future emails can teach, encourage sharing, and build strong relationships through focusing on reader benefits and making new readers feel welcome. The goal is for readers to follow people, not just blogs or information.
Breathing Life (and ROI) Back Into Your Email MarketingAWeber
Has your email marketing become a routine? It happens. When we get too bogged down in patterns, our creative juices can get stagnant. Let's shake things up for 2013. Infuse your campaigns with new flavor as we review clever, fun campaigns that worked (and a few that didn't). You'll come way with ideas and inspiration you can put to work right away to revitalize your ROI.
Presented by Hunter Boyle at MarketingSherpa's Email Summit 2013, Las Vegas
Learn more at: http://www.aweber.com/blog
More Engagement, Less Effort: The Lowdown on Marketing AutomationAWeber
Want to turn strangers into raving fans while you sleep? It may not happen overnight, but automated marketing can help you build your audience, nurture relationships and grow your bottom-line results. All at a fraction of the effort and investment of standard email marketing processes. Want to learn more?
Whether you're just starting out or improving an existing program, join us to get the lowdown on simple ways to make automated marketing do your heavy lifting. We'll look at real-world examples and research, so you'll come away ready to take action.
Presented by Hunter Boyle of AWeber & DJ Waldow of Waldow Social at Explore Social Media, Portland
25 List Building Tricks: Ideas, Examples and Resources to Improve Your Email ROIAWeber
Learn how to dramatically grow your email marketing lists with these 25 ideas and resources. Compiled with input and real examples from a variety of marketing all-stars, you're sure to find new tricks to increase your subscriber base and keep them more engaged with your content.
Presented by Hunter Boyle at Affiliate Summit East NYC, #ASE12, Aug. 2012.
For more tricks, visit: http://www.aweber.com/blog
Email List-Building 101: How to Reel In New Readers with a Few Simple StepsAWeber
This document provides tips for bloggers and website owners on how to build their email list. It recommends including an email signup form in the sidebar, at the end of blog posts, and when users comment in order to give users multiple opportunities to subscribe. The forms should make a compelling offer to encourage subscriptions rather than just asking for "updates." Additional tips include focusing on the benefits of building an email list and addressing potential privacy concerns to reassure users. The presenter offers to answer questions and provides contact information for those seeking more email marketing help.
30 Ideas in 30 Minutes: Top Holiday Marketing Ideas You Can Steal For 2012AWeber
The document provides 30 marketing ideas for the holiday season that can be used in emails. Some of the ideas include surprising readers with unique content, creating interactive elements like games and contests, giving gifts or discounts to subscribers, and telling relatable stories that customers can relate to around the holidays. It emphasizes keeping emails lighthearted while still promoting products or services.
How To Get The Results You Want From An Email CampaignAWeber
This document discusses how to improve email marketing efforts. It recommends automating messages like welcome emails and follow up series. It also suggests gathering subscribers through forms on the website and social media. Additionally, the document advises setting clear subscriber expectations and sharing past email examples. Finally, it proposes optimizing efforts through segmentation, split testing forms and emails, and analyzing metrics. The overall goal is to improve open and click-through rates and generate more sales from email campaigns.
Smart Email Marketing: Engage Your Customers and Grow Your BusinessAWeber
What does it mean to market with email?
To some, it simply means slapping a form on a website and sending out the occasional newsletter. But to the savvy small business marketer, it means creating a valuable incentive for subscribing, respecting the subscriber's time and attention, and using email to increase the lifetime value (LTV) of a subscriber.
For more email marketing tips, visit http://www.aweber.com/blog/
Do you have a blog? Want more email subscribers?
This presentation by @justinpremick for #FinCon12 discusses how to:
* Turn your 2 most popular pages into subscriber magnets
* Make your opt-in forms convert
* Get more subscribers through 3 more key places on your blog
Efficient Marketing: The Tools You Need and How to Use ThemAWeber
This document provides tips and tools for efficient marketing. It recommends planning a content schedule using Google Calendar or Google Docs Spreadsheet. It suggests finding content through guest authors, round-ups, and curation using RSS readers, Pocket, and Pinterest. It also recommends streamlining processes using time-tracking tools like Rescue Time and Harvest.
Dustin Maher (dustinmaherfitness.com) graduated from the University of Wisconsin in 2006 with a degree in Kinesiology and Business knowing that he wanted to help people get in shape. Soon after graduating, he launched MamaTone Fitness in Madison, Wisconsin.
In this presentation at the Greater Philly Email Marketers Meetup on June 6, Crystal Gouldey shared how this fresh-out-of-college fitness instructor grew his local fitness company to a national business with 10 locations, 28 DVDs, a published book, and an email list of 12,000+ subscribers using online marketing tactics.
The document provides an overview of the topics to be covered in a live demo of getting started with AWeber, including: 1) setting up an initial list and confirmation message; 2) creating a welcome email; 3) building a sign up form and getting it on a website; and secondary topics like broadcast newsletters and importing subscribers. The document emphasizes that an email campaign should have a product/service people want, benefit-oriented content, and a website to be successful.
High performance Serverless Java on AWS- GoTo Amsterdam 2024Vadym Kazulkin
Java is for many years one of the most popular programming languages, but it used to have hard times in the Serverless community. Java is known for its high cold start times and high memory footprint, comparing to other programming languages like Node.js and Python. In this talk I'll look at the general best practices and techniques we can use to decrease memory consumption, cold start times for Java Serverless development on AWS including GraalVM (Native Image) and AWS own offering SnapStart based on Firecracker microVM snapshot and restore and CRaC (Coordinated Restore at Checkpoint) runtime hooks. I'll also provide a lot of benchmarking on Lambda functions trying out various deployment package sizes, Lambda memory settings, Java compilation options and HTTP (a)synchronous clients and measure their impact on cold and warm start times.
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/06/temporal-event-neural-networks-a-more-efficient-alternative-to-the-transformer-a-presentation-from-brainchip/
Chris Jones, Director of Product Management at BrainChip , presents the “Temporal Event Neural Networks: A More Efficient Alternative to the Transformer” tutorial at the May 2024 Embedded Vision Summit.
The expansion of AI services necessitates enhanced computational capabilities on edge devices. Temporal Event Neural Networks (TENNs), developed by BrainChip, represent a novel and highly efficient state-space network. TENNs demonstrate exceptional proficiency in handling multi-dimensional streaming data, facilitating advancements in object detection, action recognition, speech enhancement and language model/sequence generation. Through the utilization of polynomial-based continuous convolutions, TENNs streamline models, expedite training processes and significantly diminish memory requirements, achieving notable reductions of up to 50x in parameters and 5,000x in energy consumption compared to prevailing methodologies like transformers.
Integration with BrainChip’s Akida neuromorphic hardware IP further enhances TENNs’ capabilities, enabling the realization of highly capable, portable and passively cooled edge devices. This presentation delves into the technical innovations underlying TENNs, presents real-world benchmarks, and elucidates how this cutting-edge approach is positioned to revolutionize edge AI across diverse applications.
Dandelion Hashtable: beyond billion requests per second on a commodity serverAntonios Katsarakis
This slide deck presents DLHT, a concurrent in-memory hashtable. Despite efforts to optimize hashtables, that go as far as sacrificing core functionality, state-of-the-art designs still incur multiple memory accesses per request and block request processing in three cases. First, most hashtables block while waiting for data to be retrieved from memory. Second, open-addressing designs, which represent the current state-of-the-art, either cannot free index slots on deletes or must block all requests to do so. Third, index resizes block every request until all objects are copied to the new index. Defying folklore wisdom, DLHT forgoes open-addressing and adopts a fully-featured and memory-aware closed-addressing design based on bounded cache-line-chaining. This design offers lock-free index operations and deletes that free slots instantly, (2) completes most requests with a single memory access, (3) utilizes software prefetching to hide memory latencies, and (4) employs a novel non-blocking and parallel resizing. In a commodity server and a memory-resident workload, DLHT surpasses 1.6B requests per second and provides 3.5x (12x) the throughput of the state-of-the-art closed-addressing (open-addressing) resizable hashtable on Gets (Deletes).
Northern Engraving | Nameplate Manufacturing Process - 2024Northern Engraving
Manufacturing custom quality metal nameplates and badges involves several standard operations. Processes include sheet prep, lithography, screening, coating, punch press and inspection. All decoration is completed in the flat sheet with adhesive and tooling operations following. The possibilities for creating unique durable nameplates are endless. How will you create your brand identity? We can help!
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...DanBrown980551
This LF Energy webinar took place June 20, 2024. It featured:
-Alex Thornton, LF Energy
-Hallie Cramer, Google
-Daniel Roesler, UtilityAPI
-Henry Richardson, WattTime
In response to the urgency and scale required to effectively address climate change, open source solutions offer significant potential for driving innovation and progress. Currently, there is a growing demand for standardization and interoperability in energy data and modeling. Open source standards and specifications within the energy sector can also alleviate challenges associated with data fragmentation, transparency, and accessibility. At the same time, it is crucial to consider privacy and security concerns throughout the development of open source platforms.
This webinar will delve into the motivations behind establishing LF Energy’s Carbon Data Specification Consortium. It will provide an overview of the draft specifications and the ongoing progress made by the respective working groups.
Three primary specifications will be discussed:
-Discovery and client registration, emphasizing transparent processes and secure and private access
-Customer data, centering around customer tariffs, bills, energy usage, and full consumption disclosure
-Power systems data, focusing on grid data, inclusive of transmission and distribution networks, generation, intergrid power flows, and market settlement data
What is an RPA CoE? Session 1 – CoE VisionDianaGray10
In the first session, we will review the organization's vision and how this has an impact on the COE Structure.
Topics covered:
• The role of a steering committee
• How do the organization’s priorities determine CoE Structure?
Speaker:
Chris Bolin, Senior Intelligent Automation Architect Anika Systems
Session 1 - Intro to Robotic Process Automation.pdfUiPathCommunity
👉 Check out our full 'Africa Series - Automation Student Developers (EN)' page to register for the full program:
https://bit.ly/Automation_Student_Kickstart
In this session, we shall introduce you to the world of automation, the UiPath Platform, and guide you on how to install and setup UiPath Studio on your Windows PC.
📕 Detailed agenda:
What is RPA? Benefits of RPA?
RPA Applications
The UiPath End-to-End Automation Platform
UiPath Studio CE Installation and Setup
💻 Extra training through UiPath Academy:
Introduction to Automation
UiPath Business Automation Platform
Explore automation development with UiPath Studio
👉 Register here for our upcoming Session 2 on June 20: Introduction to UiPath Studio Fundamentals: https://community.uipath.com/events/details/uipath-lagos-presents-session-2-introduction-to-uipath-studio-fundamentals/
"Choosing proper type of scaling", Olena SyrotaFwdays
Imagine an IoT processing system that is already quite mature and production-ready and for which client coverage is growing and scaling and performance aspects are life and death questions. The system has Redis, MongoDB, and stream processing based on ksqldb. In this talk, firstly, we will analyze scaling approaches and then select the proper ones for our system.
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyScyllaDB
Freshworks creates AI-boosted business software that helps employees work more efficiently and effectively. Managing data across multiple RDBMS and NoSQL databases was already a challenge at their current scale. To prepare for 10X growth, they knew it was time to rethink their database strategy. Learn how they architected a solution that would simplify scaling while keeping costs under control.
How information systems are built or acquired puts information, which is what they should be about, in a secondary place. Our language adapted accordingly, and we no longer talk about information systems but applications. Applications evolved in a way to break data into diverse fragments, tightly coupled with applications and expensive to integrate. The result is technical debt, which is re-paid by taking even bigger "loans", resulting in an ever-increasing technical debt. Software engineering and procurement practices work in sync with market forces to maintain this trend. This talk demonstrates how natural this situation is. The question is: can something be done to reverse the trend?
The Department of Veteran Affairs (VA) invited Taylor Paschal, Knowledge & Information Management Consultant at Enterprise Knowledge, to speak at a Knowledge Management Lunch and Learn hosted on June 12, 2024. All Office of Administration staff were invited to attend and received professional development credit for participating in the voluntary event.
The objectives of the Lunch and Learn presentation were to:
- Review what KM ‘is’ and ‘isn’t’
- Understand the value of KM and the benefits of engaging
- Define and reflect on your “what’s in it for me?”
- Share actionable ways you can participate in Knowledge - - Capture & Transfer
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor IvaniukFwdays
At this talk we will discuss DDoS protection tools and best practices, discuss network architectures and what AWS has to offer. Also, we will look into one of the largest DDoS attacks on Ukrainian infrastructure that happened in February 2022. We'll see, what techniques helped to keep the web resources available for Ukrainians and how AWS improved DDoS protection for all customers based on Ukraine experience
Fueling AI with Great Data with Airbyte WebinarZilliz
This talk will focus on how to collect data from a variety of sources, leveraging this data for RAG and other GenAI use cases, and finally charting your course to productionalization.
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectorsDianaGray10
Join us to learn how UiPath Apps can directly and easily interact with prebuilt connectors via Integration Service--including Salesforce, ServiceNow, Open GenAI, and more.
The best part is you can achieve this without building a custom workflow! Say goodbye to the hassle of using separate automations to call APIs. By seamlessly integrating within App Studio, you can now easily streamline your workflow, while gaining direct access to our Connector Catalog of popular applications.
We’ll discuss and demo the benefits of UiPath Apps and connectors including:
Creating a compelling user experience for any software, without the limitations of APIs.
Accelerating the app creation process, saving time and effort
Enjoying high-performance CRUD (create, read, update, delete) operations, for
seamless data management.
Speakers:
Russell Alfeche, Technology Leader, RPA at qBotic and UiPath MVP
Charlie Greenberg, host
Monitoring and Managing Anomaly Detection on OpenShift.pdfTosin Akinosho
Monitoring and Managing Anomaly Detection on OpenShift
Overview
Dive into the world of anomaly detection on edge devices with our comprehensive hands-on tutorial. This SlideShare presentation will guide you through the entire process, from data collection and model training to edge deployment and real-time monitoring. Perfect for those looking to implement robust anomaly detection systems on resource-constrained IoT/edge devices.
Key Topics Covered
1. Introduction to Anomaly Detection
- Understand the fundamentals of anomaly detection and its importance in identifying unusual behavior or failures in systems.
2. Understanding Edge (IoT)
- Learn about edge computing and IoT, and how they enable real-time data processing and decision-making at the source.
3. What is ArgoCD?
- Discover ArgoCD, a declarative, GitOps continuous delivery tool for Kubernetes, and its role in deploying applications on edge devices.
4. Deployment Using ArgoCD for Edge Devices
- Step-by-step guide on deploying anomaly detection models on edge devices using ArgoCD.
5. Introduction to Apache Kafka and S3
- Explore Apache Kafka for real-time data streaming and Amazon S3 for scalable storage solutions.
6. Viewing Kafka Messages in the Data Lake
- Learn how to view and analyze Kafka messages stored in a data lake for better insights.
7. What is Prometheus?
- Get to know Prometheus, an open-source monitoring and alerting toolkit, and its application in monitoring edge devices.
8. Monitoring Application Metrics with Prometheus
- Detailed instructions on setting up Prometheus to monitor the performance and health of your anomaly detection system.
9. What is Camel K?
- Introduction to Camel K, a lightweight integration framework built on Apache Camel, designed for Kubernetes.
10. Configuring Camel K Integrations for Data Pipelines
- Learn how to configure Camel K for seamless data pipeline integrations in your anomaly detection workflow.
11. What is a Jupyter Notebook?
- Overview of Jupyter Notebooks, an open-source web application for creating and sharing documents with live code, equations, visualizations, and narrative text.
12. Jupyter Notebooks with Code Examples
- Hands-on examples and code snippets in Jupyter Notebooks to help you implement and test anomaly detection models.
2. Who is this guy?
Software Engineer @ AWeber
Founder of the DataPhilly Meetup group
@beckerfuffle
beckerfuffle.com
These slides and more @ github.com/mdbecker
10. 38 top wikipedias
Arabic العربية
Bulgarian Български
Catalan Català
Czech Čeština
Danish Dansk
German Deutsch
English English
Spanish Español
Estonian Eesti
Basque Euskara
Persian فارسی
Finnish Suomi
French Français
Hebrew עברית
Hindi िहिन्दी
Croatian Hrvatski
Hungarian Magyar
Indonesian Bahasa Indonesia
Italian Italiano
Japanese 日本語
Kazakh Қазақша
Korean 한국어
Lithuanian Lietuvių
Malay Bahasa Melayu
Dutch Nederlands
Norwegian (Bokmål) Norsk (Bokmål)
Polish Polski
Portuguese Português
Romanian Română
Russian Русский
Slovak Slovenčina
Slovenian Slovenščina
Serbian Српски / Srpski
Swedish Svenska
Turkish Türkçe
Ukrainian Українська
Vietnamese Tiếng Việt
Waray-Waray Winaray
29. Thank you
API & Worker: Kelly O’Brien (linkedin.com/in/kellyobie)
UI: Matt Parke (ordinaryrobot.com)
Classifier: Michael Becker (github.com/mdbecker)
Images: Wikipedia
30. My info
Tweet me @beckerfuffle
Find me at beckerfuffle.com
These slides and more @ github.com/mdbecker
Editor's Notes
Good morning everyone, My name is Michael Becker, I work in the Data Analysis and Management team at AWeber, an email marketing company in Chalfont, PA I'm also the founder of the DataPhilly Meetup group You can find me online @beckerfuffle on Twitter. At beckerfuffle.com, and I'm also mdbecker on github. I'll be posting the materials for this talk on my github.
This talk will cover a lot of the logistics behind utilizing a trained scikit learn model in a real-life production environment. In this talk I’ll cover: How to distribute your model
I’ll discuss how to get new data to your model for prediction.
I’ll introduce RabbitMQ, what it is and why you should care.
I’ll demonstrate how we can put all this together into a finished product
I’ll discuss how to scale your model
Finally I cover some additional things to consider when using scikit learn models in a realtime production environment.
To start off, let's recap what the supervised model training process looks like. 1) You have your training data and labels 2) You vectorize your data, you train your machine learning algorithm. 3) ??? 4) Make predictions with new data 5) Profit
In this case I'm going to talk about one of the first models I created. A model that predicts the language of input text. To create this model, I used 38 of the top Wikipedias based on number of articles. I then dumped several of the most popular articles as defined by their number of hits.
I converted the wiki markup to plain text. I trained a LinearSVC (Support Vector Classifier) model using a bi/trigram (n-gram) approach I had read worked well for language classification. This approach involves counting all combinations of 2 (bigram) or 3 (trigram) character sequences in your dataset. I tested the model and I was seeing ~99% accuracy. Here I've defined a pipeline combining a text feature extractor with a simple classifier. A pipeline is a utility used to build a composite classifier. To extract features, I'm using a TfidfVectorizer. The vectorizer first counts the number of occurrences of each n-gram in each document to "vectorize the text." It then applies the TF-IDF (term frequency–inverse document frequency) algorithm. TF-IDF reflects how important a word is to a document in a collection of documents. The TF-IDF value increases based on the number of times a n-gram appears in the document, but is offset by the frequency of the n-gram in the rest of the documents. So for example an common word like "the" would get down weighted compared to a less common word like "automobile."
So the first thing you might ask yourself after you've trained your awesome model is "now what?" So one of the first problems you'll want to solve is how to distribute your model? The easiest thing to do this is to pickle (serialize) the model to disk and distribute it as part of your application. You can also store it in a database such as GridFS or Amazon S3. In the case of my model, it took up roughly 400MB in memory. This is pretty big, but easily storable on disk (and more importantly in memory).
Next let's discuss how we’re going to get data into our model. You're data could be coming from many types of sources, a web front-end, a DB trigger, etc.. In many cases, you can't easily control the rate of incoming data and you don't want to hold up the front-end or the database while you wait for a prediction to be made. In these cases, it's useful to be able to process your data asynchronously.
In the example I'm giving today, we created a simple web front-end (similar to google translate) where a user can enter some text to be classified, and get a classification back. We don't want to hold up a thread or process in the client waiting on our classifier to do its thing. Rather the front-end sends the input to a REST API which will record the text input and return a tracking_id that the client can then use to get the result.
Decoupling the UI from the backend in this way solves one design issue. However another thing to consider is weather you can afford to lose messages. If all of your data needs to be processed you have 2 options. You either need to have a built in retry mechanism in the front end, or you need a persistent and durable queue to hold your messages.
Enter RabbitMQ. One of the many features provided by RabbitMQ is Highly Available Queues. By using RabbitMQ, you can ensure that every message is processed without needing to implement a fancy (and likely error prone) retry mechanism in your front-end.
RabbitMQ uses AMQP (Advanced Message Queuing Protocol) for all client communication. Using AMQP allows clients running on different platforms or written in different languages, to easily send messages to each other. From a high level, AMQP enables clients to publish messages, and other clients to consume those messages. It does all this without requiring you to roll your own protocol or library.
Once you hook your data input source into RabbitMQ and start publishing data, all you need to do is put your model in a persistent worker and start consuming input.
In the case of my language classification model, we implemented a simple worker that unpickles the classifier and subscribes to an input queue. It then runs an event loop (main) that pulls new messages as they become available and passes them to process_event. Process event calls predict on our model and converts the numerical prediction to a human readable format. This prediction is then stored in our DB for the front-end to retrieve.
So that’s basically it. Our design looks a little something like this: The input comes from the UI where the user enters some text they wish to classify. The UI hits a Flask REST API via a GET request. The API stores the request in the DB. The API sends a message to RabbitMQ with the text to classify and the tracking_id for storing the resulting classification. The API returns a json response to the UI with the tracking_id. The worker pulls the message off the queue in RabbitMQ. The worker calls predict on the classifier with the text as input. The classifier returns a prediction. The worker updates the database with the result. The UI displays the result.
Alright so let’s see what this all looks like in action!
Alright so let’s see what this all looks like in action!
Alright so let’s see what this all looks like in action!
Besides the basic design concerns I’ve already covered, there’s a few more things worth mentioning. The worst thing that can happen when you're processing data asynchronously is for your queue to backup. Backups will result in longer processing times, and if unbounded, you'll likely crash RabbitMQ. The easiest way to scale your workers is to start another instance. Using this strategy, processing should scale roughly linearly. In my experience, you can easily handle thousands of messages a second this way.
Another way to scale your worker is to convert it to processing requests in batches. Many of the algorithms scale super-linearly when you pass multiple samples to the predict method. The downside of this is that you will no longer be able to process results in realtime. However, if you're restricted on resources (memory & cpu), this might be a worthwhile alternative.
Keep an eye on your queue sizes, alert when they backup. Scale as needed (possibly automatically).
Understand your load requirements. Load test end-to-end to verify you can handle the expected load.
Periodically re-verify your algorithm using new data. Build in a feedback loop so that you can collect new labeled samples to verify the performance Version control your classifier. Keep detailed changelogs and performance metrics/characteristics.
I’d like to thank Kelly O’brien and Matt Parke for helping me with the front-end and back-end for the demo. Without them things would be a lot less exciting!
You can find me online @beckerfuffle on Twitter. At beckerfuffle.com, and I'm also mdbecker on github. I'll be posting the materials for this talk on my github.