Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...Benjamin Bengfort
This is an overview of the goals and roadmap for the Yellowbrick model visualization library (www.scikit-yb.org). If you're interested in contributing to Yellowbrick or writing visualizers, this is a good place to get started.
In the presentation we discuss the expected workflow of data scientists interacting with the model selection triple and Scikit-Learn. We describe the Yellowbrick API and it's relationship to the Scikit-Learn API. We introduce our primary object: the Visualizer, an estimator that learns from data and displays it visually. Finally we describe the requirements for developing for Yellowbrick, the tools and utilities in place and how to get started.
Yellowbrick is a suite of visual diagnostic tools called "Visualizers" that extend the Scikit-Learn API to allow human steering of the model selection process. In a nutshell, Yellowbrick combines Scikit-Learn with Matplotlib in the best tradition of the Scikit-Learn documentation, but to produce visualizations for your models!
This presentation was given during the opening session of the 2017 Spring DDL Research Labs.
KSQL is an open source streaming SQL engine for Apache Kafka. Come hear how KSQL makes it easy to get started with a wide-range of stream processing applications such as real-time ETL, sessionization, monitoring and alerting, or fraud detection. We'll cover both how to get started with KSQL and some under-the-hood details of how it all works.
How does Netflix recommend movies? In this presentation we go over a very common technique for recommendations called matrix factorization to predict what rating a user will give a movie. It sounds like a complicated mathematical concept, but all it consists of is finding a set of intermediate features such as action, comedy, etc., and using them to help us determine the ratings.
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...Benjamin Bengfort
This is an overview of the goals and roadmap for the Yellowbrick model visualization library (www.scikit-yb.org). If you're interested in contributing to Yellowbrick or writing visualizers, this is a good place to get started.
In the presentation we discuss the expected workflow of data scientists interacting with the model selection triple and Scikit-Learn. We describe the Yellowbrick API and it's relationship to the Scikit-Learn API. We introduce our primary object: the Visualizer, an estimator that learns from data and displays it visually. Finally we describe the requirements for developing for Yellowbrick, the tools and utilities in place and how to get started.
Yellowbrick is a suite of visual diagnostic tools called "Visualizers" that extend the Scikit-Learn API to allow human steering of the model selection process. In a nutshell, Yellowbrick combines Scikit-Learn with Matplotlib in the best tradition of the Scikit-Learn documentation, but to produce visualizations for your models!
This presentation was given during the opening session of the 2017 Spring DDL Research Labs.
KSQL is an open source streaming SQL engine for Apache Kafka. Come hear how KSQL makes it easy to get started with a wide-range of stream processing applications such as real-time ETL, sessionization, monitoring and alerting, or fraud detection. We'll cover both how to get started with KSQL and some under-the-hood details of how it all works.
How does Netflix recommend movies? In this presentation we go over a very common technique for recommendations called matrix factorization to predict what rating a user will give a movie. It sounds like a complicated mathematical concept, but all it consists of is finding a set of intermediate features such as action, comedy, etc., and using them to help us determine the ratings.
AWS Neptune - A Fast and reliable Graph Database Built for the CloudAmazon Web Services
Dickson Yue, Solutions Architect, AWS
Amazon Neptune is a fully managed graph database service which has been built ground up for handling rich highly connected data. Come learn how to transform your business with Amazon Neptune and hear diverse use cases such as recommendation engines, knowledge graphs, fraud detection, social networks, network management and life sciences.
Kafka Intro With Simple Java Producer ConsumersJean-Paul Azar
Introduction to Kafka streaming platform. Covers Kafka Architecture with some small examples from the command line. Then we expand on this with a multi-server example. Lastly, we added some simple Java client examples for a Kafka Producer and a Kafka Consumer.
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...HostedbyConfluent
We built Apache Pinot - a real-time distributed OLAP datastore - for low-latency analytics at scale. This is heavily used at companies such as LinkedIn, Uber, Slack, where Kafka serves as the backbone for capturing vast amounts of data. Pinot ingests millions of events per sec from Kafka, builds indexes in real-time and serves 100K+ queries per second while ensuring latency SLA of millisecond to sub second.
In the first implementation, we used the Consumer Group feature to manage the offsets and checkpoints across multiple Kafka Consumers. However, to achieve fault tolerance and scalability, we had to run multiple consumer groups for the same topic. This was our initial strategy to maintain the SLA at high query workload. But this model posed other challenges - since Kafka maintains offset per consumer group, achieving data consistency across multiple consumer groups was not possible. Also, a failure of a single node in a consumer group meant the entire consumer group was unavailable for query processing. Restarting the failed node needed lot of manual operations to ensure data is consumed exactly once. This resulted in management overhead and inefficient hardware utilization.
While taking inspiration from the Kafka consumer group implementation, we redesigned the real-time consumption in Pinot to maintain consistent offset across multiple consumer groups. This allowed us to guarantee consistent data across all replicas. This enabled us to copy data from another consumer group during node addition, node failure or increasing the replication group.
In this talk, we will deep dive into the various challenges faced and considerations that went into this design, and learn what makes Pinot resilient to failures both in Kafka Brokers and Pinot Components. We will introduce the new concept of ""lockstep"" sequencing where multiple consumer groups can synchronize checkpoints periodically and maintain consistency. We'll describe how we achieve this while maintaining strict freshness SLAs, and also withstanding high throughput and ingestion.
Developing R Graphical User Interfaces, presented at
1. Workshop on Development of R software for data analysis, Hasselt University, Belgium, March 13th, 2013.
2. Joint Seminar, Medical Epidemiology and Biostatistics Department, Karolinska Institutet, April 4th, 2013.
Feature Engineering - Getting most out of data for predictive modelsGabriel Moreira
How should data be preprocessed for use in machine learning algorithms? How to identify the most predictive attributes of a dataset? What features can generate to improve the accuracy of a model?
Feature Engineering is the process of extracting and selecting, from raw data, features that can be used effectively in predictive models. As the quality of the features greatly influences the quality of the results, knowing the main techniques and pitfalls will help you to succeed in the use of machine learning in your projects.
In this talk, we will present methods and techniques that allow us to extract the maximum potential of the features of a dataset, increasing flexibility, simplicity and accuracy of the models. The analysis of the distribution of features and their correlations, the transformation of numeric attributes (such as scaling, normalization, log-based transformation, binning), categorical attributes (such as one-hot encoding, feature hashing, Temporal (date / time), and free-text attributes (text vectorization, topic modeling).
Python, Python, Scikit-learn, and Spark SQL examples will be presented and how to use domain knowledge and intuition to select and generate features relevant to predictive models.
Using Grafana with InfluxDB 2.0 and Flux Lang by Jacob LisiInfluxData
Flux, the new InfluxData data scripting and query language (formerly IFQL), super-charges queries both for analytics and data science. Jacob Lisi from Grafana Labs will give a quick overview of the language features as well as the moving parts for a working deployment. Grafana is an open source dashboard solution that shares Flux’s passion for analytics and data science. For that reason, they are very excited to showcase the new Flux support within Grafana, and a couple of common analytics use cases to get the most out of your data.
In this InfluxDays NYC 2019 talk, Jacob Lisi will share the latest updates they have made with their Flux builder in Grafana.
Extending MariaDB with user-defined functionsMariaDB plc
Learn how to write user-defined functions (UDFs) as stored functions or C/C++ extensions. While MariaDB Server provides a wide range of built-in functions defined in the SQL specification, some applications can benefit from custom-built, server-side functions. Want to build a function to calculate the Tanimoto coefficient between two chemical compounds or genes? In this advanced session, MariaDB's Andrew Hutchings and Sylvain Arbaudie will get you started writing UDFs that could give you a leg up on the competition.
Machine Learning and Data Mining: 14 Evaluation and CredibilityPier Luca Lanzi
Course "Machine Learning and Data Mining" for the degree of Computer Engineering at the Politecnico di Milano. In this lecture we discuss the evaluation of classification algorithms.
The Missing Manual for Leveled Compaction Strategy (Wei Deng & Ryan Svihla, D...DataStax
In this presentation, we will look into JIRAs, JavaDocs and system log entries to gain a deeper understanding on how LCS works under the hood. We will explain what scenarios don't work well for LCS and (more importantly) why. We will leverage legacy TRACE/DEBUG level log for compaction related objects as well as some newer compaction logging information introduced in C* 3.6 (CASSANDRA-10805) to gain better insights.
About the Speakers
Wei Deng Solutions Architect, DataStax
Solutions Architect for DataStax. I have a strong interest in big data, cloud application and distributed computing practices.
KSQL-ops! Running ksqlDB in the Wild (Simon Aubury, ThoughtWorks) Kafka Summi...confluent
Since its release in 2018, KSQL has grown from interesting curiosity into ksqlDB - a production grade streaming system. What does it look like to run KSQL in the enterprise? How has the promise of the Kafka Streams with an SQL dialect worked in the wild?
Let's explore stream processing with ksqlDB in the enterprise. How is it used to rapid prototyping; for taking an idea to production. Using the flexible scripting to help teams with error discover and system introspection. Plus how extended teams can use KSQL as a stepping stone for building and sharing real-time scoring and streaming insights.
This session will cover production deployments of ksqlDB in banking, finance, transport and insurance. What can go wrong, and what can go right. See how teams embrace the technology to solve stream processing challenges.
Big data real time architectures -
How do to big data processing in real time?
What architectures are out there to support this paradigm?
Which one should we choose?
What Advantages / Pitfalls they contain.
These slides provide an overview of the most popular approaches up to date to solve the task of object detection with deep neural networks. It reviews both the two stages approaches such as R-CNN, Fast R-CNN and Faster R-CNN, and one-stage approaches such as YOLO and SSD. It also contains pointers to relevant datasets (Pascal, COCO, ILSRVC, OpenImages) and the definition of the Average Precision (AP) metric.
Full program:
https://www.talent.upc.edu/ing/estudis/formacio/curs/310400/postgraduate-course-artificial-intelligence-deep-learning/
Introduction to KSQL: Streaming SQL for Apache Kafka®confluent
Join Tom Green, Solution Engineer at Confluent for this Lunch and Learn talk covering KSQL. Confluent KSQL is the streaming SQL engine that enables real-time data processing against Apache Kafka®. It provides an easy-to-use, yet powerful interactive SQL interface for stream processing on Kafka, without the need to write code in a programming language such as Java or Python. KSQL is scalable, elastic, fault-tolerant, and it supports a wide range of streaming operations, including data filtering, transformations, aggregations, joins, windowing, and sessionization.
By attending one of these sessions, you will learn:
-How to query streams, using SQL, without writing code.
-How KSQL provides automated scalability and out-of-the-box high availability for streaming queries
-How KSQL can be used to join streams of data from different sources
-The differences between Streams and Tables in Apache Kafka
ksqlDB: A Stream-Relational Database Systemconfluent
Speaker: Matthias J. Sax, Software Engineer, Confluent
ksqlDB is a distributed event streaming database system that allows users to express SQL queries over relational tables and event streams. The project was released by Confluent in 2017 and is hosted on Github and developed with an open-source spirit. ksqlDB is built on top of Apache Kafka®, a distributed event streaming platform. In this talk, we discuss ksqlDB’s architecture that is influenced by Apache Kafka and its stream processing library, Kafka Streams. We explain how ksqlDB executes continuous queries while achieving fault tolerance and high vailability. Furthermore, we explore ksqlDB’s streaming SQL dialect and the different types of supported queries.
Matthias J. Sax is a software engineer at Confluent working on ksqlDB. He mainly contributes to Kafka Streams, Apache Kafka's stream processing library, which serves as ksqlDB's execution engine. Furthermore, he helps evolve ksqlDB's "streaming SQL" language. In the past, Matthias also contributed to Apache Flink and Apache Storm and he is an Apache committer and PMC member. Matthias holds a Ph.D. from Humboldt University of Berlin, where he studied distributed data stream processing systems.
https://db.cs.cmu.edu/events/quarantine-db-talk-2020-confluent-ksqldb-a-stream-relational-database-system/
Given at PyDataSV 2014
In machine learning, clustering is a good way to explore your data and pull out patterns and relationships. Scikit-learn has some great clustering functionality, including the k-means clustering algorithm, which is among the easiest to understand. Let's take an in-depth look at k-means clustering and how to use it. This mini-tutorial/talk will cover what sort of problems k-means clustering is good at solving, how the algorithm works, how to choose k, how to tune the algorithm's parameters, and how to implement it on a set of data.
AWS Neptune - A Fast and reliable Graph Database Built for the CloudAmazon Web Services
Dickson Yue, Solutions Architect, AWS
Amazon Neptune is a fully managed graph database service which has been built ground up for handling rich highly connected data. Come learn how to transform your business with Amazon Neptune and hear diverse use cases such as recommendation engines, knowledge graphs, fraud detection, social networks, network management and life sciences.
Kafka Intro With Simple Java Producer ConsumersJean-Paul Azar
Introduction to Kafka streaming platform. Covers Kafka Architecture with some small examples from the command line. Then we expand on this with a multi-server example. Lastly, we added some simple Java client examples for a Kafka Producer and a Kafka Consumer.
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...HostedbyConfluent
We built Apache Pinot - a real-time distributed OLAP datastore - for low-latency analytics at scale. This is heavily used at companies such as LinkedIn, Uber, Slack, where Kafka serves as the backbone for capturing vast amounts of data. Pinot ingests millions of events per sec from Kafka, builds indexes in real-time and serves 100K+ queries per second while ensuring latency SLA of millisecond to sub second.
In the first implementation, we used the Consumer Group feature to manage the offsets and checkpoints across multiple Kafka Consumers. However, to achieve fault tolerance and scalability, we had to run multiple consumer groups for the same topic. This was our initial strategy to maintain the SLA at high query workload. But this model posed other challenges - since Kafka maintains offset per consumer group, achieving data consistency across multiple consumer groups was not possible. Also, a failure of a single node in a consumer group meant the entire consumer group was unavailable for query processing. Restarting the failed node needed lot of manual operations to ensure data is consumed exactly once. This resulted in management overhead and inefficient hardware utilization.
While taking inspiration from the Kafka consumer group implementation, we redesigned the real-time consumption in Pinot to maintain consistent offset across multiple consumer groups. This allowed us to guarantee consistent data across all replicas. This enabled us to copy data from another consumer group during node addition, node failure or increasing the replication group.
In this talk, we will deep dive into the various challenges faced and considerations that went into this design, and learn what makes Pinot resilient to failures both in Kafka Brokers and Pinot Components. We will introduce the new concept of ""lockstep"" sequencing where multiple consumer groups can synchronize checkpoints periodically and maintain consistency. We'll describe how we achieve this while maintaining strict freshness SLAs, and also withstanding high throughput and ingestion.
Developing R Graphical User Interfaces, presented at
1. Workshop on Development of R software for data analysis, Hasselt University, Belgium, March 13th, 2013.
2. Joint Seminar, Medical Epidemiology and Biostatistics Department, Karolinska Institutet, April 4th, 2013.
Feature Engineering - Getting most out of data for predictive modelsGabriel Moreira
How should data be preprocessed for use in machine learning algorithms? How to identify the most predictive attributes of a dataset? What features can generate to improve the accuracy of a model?
Feature Engineering is the process of extracting and selecting, from raw data, features that can be used effectively in predictive models. As the quality of the features greatly influences the quality of the results, knowing the main techniques and pitfalls will help you to succeed in the use of machine learning in your projects.
In this talk, we will present methods and techniques that allow us to extract the maximum potential of the features of a dataset, increasing flexibility, simplicity and accuracy of the models. The analysis of the distribution of features and their correlations, the transformation of numeric attributes (such as scaling, normalization, log-based transformation, binning), categorical attributes (such as one-hot encoding, feature hashing, Temporal (date / time), and free-text attributes (text vectorization, topic modeling).
Python, Python, Scikit-learn, and Spark SQL examples will be presented and how to use domain knowledge and intuition to select and generate features relevant to predictive models.
Using Grafana with InfluxDB 2.0 and Flux Lang by Jacob LisiInfluxData
Flux, the new InfluxData data scripting and query language (formerly IFQL), super-charges queries both for analytics and data science. Jacob Lisi from Grafana Labs will give a quick overview of the language features as well as the moving parts for a working deployment. Grafana is an open source dashboard solution that shares Flux’s passion for analytics and data science. For that reason, they are very excited to showcase the new Flux support within Grafana, and a couple of common analytics use cases to get the most out of your data.
In this InfluxDays NYC 2019 talk, Jacob Lisi will share the latest updates they have made with their Flux builder in Grafana.
Extending MariaDB with user-defined functionsMariaDB plc
Learn how to write user-defined functions (UDFs) as stored functions or C/C++ extensions. While MariaDB Server provides a wide range of built-in functions defined in the SQL specification, some applications can benefit from custom-built, server-side functions. Want to build a function to calculate the Tanimoto coefficient between two chemical compounds or genes? In this advanced session, MariaDB's Andrew Hutchings and Sylvain Arbaudie will get you started writing UDFs that could give you a leg up on the competition.
Machine Learning and Data Mining: 14 Evaluation and CredibilityPier Luca Lanzi
Course "Machine Learning and Data Mining" for the degree of Computer Engineering at the Politecnico di Milano. In this lecture we discuss the evaluation of classification algorithms.
The Missing Manual for Leveled Compaction Strategy (Wei Deng & Ryan Svihla, D...DataStax
In this presentation, we will look into JIRAs, JavaDocs and system log entries to gain a deeper understanding on how LCS works under the hood. We will explain what scenarios don't work well for LCS and (more importantly) why. We will leverage legacy TRACE/DEBUG level log for compaction related objects as well as some newer compaction logging information introduced in C* 3.6 (CASSANDRA-10805) to gain better insights.
About the Speakers
Wei Deng Solutions Architect, DataStax
Solutions Architect for DataStax. I have a strong interest in big data, cloud application and distributed computing practices.
KSQL-ops! Running ksqlDB in the Wild (Simon Aubury, ThoughtWorks) Kafka Summi...confluent
Since its release in 2018, KSQL has grown from interesting curiosity into ksqlDB - a production grade streaming system. What does it look like to run KSQL in the enterprise? How has the promise of the Kafka Streams with an SQL dialect worked in the wild?
Let's explore stream processing with ksqlDB in the enterprise. How is it used to rapid prototyping; for taking an idea to production. Using the flexible scripting to help teams with error discover and system introspection. Plus how extended teams can use KSQL as a stepping stone for building and sharing real-time scoring and streaming insights.
This session will cover production deployments of ksqlDB in banking, finance, transport and insurance. What can go wrong, and what can go right. See how teams embrace the technology to solve stream processing challenges.
Big data real time architectures -
How do to big data processing in real time?
What architectures are out there to support this paradigm?
Which one should we choose?
What Advantages / Pitfalls they contain.
These slides provide an overview of the most popular approaches up to date to solve the task of object detection with deep neural networks. It reviews both the two stages approaches such as R-CNN, Fast R-CNN and Faster R-CNN, and one-stage approaches such as YOLO and SSD. It also contains pointers to relevant datasets (Pascal, COCO, ILSRVC, OpenImages) and the definition of the Average Precision (AP) metric.
Full program:
https://www.talent.upc.edu/ing/estudis/formacio/curs/310400/postgraduate-course-artificial-intelligence-deep-learning/
Introduction to KSQL: Streaming SQL for Apache Kafka®confluent
Join Tom Green, Solution Engineer at Confluent for this Lunch and Learn talk covering KSQL. Confluent KSQL is the streaming SQL engine that enables real-time data processing against Apache Kafka®. It provides an easy-to-use, yet powerful interactive SQL interface for stream processing on Kafka, without the need to write code in a programming language such as Java or Python. KSQL is scalable, elastic, fault-tolerant, and it supports a wide range of streaming operations, including data filtering, transformations, aggregations, joins, windowing, and sessionization.
By attending one of these sessions, you will learn:
-How to query streams, using SQL, without writing code.
-How KSQL provides automated scalability and out-of-the-box high availability for streaming queries
-How KSQL can be used to join streams of data from different sources
-The differences between Streams and Tables in Apache Kafka
ksqlDB: A Stream-Relational Database Systemconfluent
Speaker: Matthias J. Sax, Software Engineer, Confluent
ksqlDB is a distributed event streaming database system that allows users to express SQL queries over relational tables and event streams. The project was released by Confluent in 2017 and is hosted on Github and developed with an open-source spirit. ksqlDB is built on top of Apache Kafka®, a distributed event streaming platform. In this talk, we discuss ksqlDB’s architecture that is influenced by Apache Kafka and its stream processing library, Kafka Streams. We explain how ksqlDB executes continuous queries while achieving fault tolerance and high vailability. Furthermore, we explore ksqlDB’s streaming SQL dialect and the different types of supported queries.
Matthias J. Sax is a software engineer at Confluent working on ksqlDB. He mainly contributes to Kafka Streams, Apache Kafka's stream processing library, which serves as ksqlDB's execution engine. Furthermore, he helps evolve ksqlDB's "streaming SQL" language. In the past, Matthias also contributed to Apache Flink and Apache Storm and he is an Apache committer and PMC member. Matthias holds a Ph.D. from Humboldt University of Berlin, where he studied distributed data stream processing systems.
https://db.cs.cmu.edu/events/quarantine-db-talk-2020-confluent-ksqldb-a-stream-relational-database-system/
Given at PyDataSV 2014
In machine learning, clustering is a good way to explore your data and pull out patterns and relationships. Scikit-learn has some great clustering functionality, including the k-means clustering algorithm, which is among the easiest to understand. Let's take an in-depth look at k-means clustering and how to use it. This mini-tutorial/talk will cover what sort of problems k-means clustering is good at solving, how the algorithm works, how to choose k, how to tune the algorithm's parameters, and how to implement it on a set of data.
Tutorial on Scikit Learn I gave at SF Data Mining meetup on May 1st 2017. Review of major parts of the Scikit-Learn API and quick coding exercise on Iris Dataset
Data Science and Machine Learning Using Python and Scikit-learnAsim Jalis
Workshop at DataEngConf 2016, on April 7-8 2016, at Galvanize, 44 Tehama Street, San Francisco, CA.
Demo and labs for workshop are at https://github.com/asimjalis/data-science-workshop
Introduction to Machine Learning with Python and scikit-learnMatt Hagy
PyATL talk about machine learning. Provides both an intro to machine learning and how to do it with Python. Includes simple examples with code and results.
Machine learning in production with scikit-learnJeff Klukas
Presented at PyOhio 2017: https://pyohio.org/schedule/presentation/284/
The Python data ecosystem provides amazing tools to quickly get up and running with machine learning models, but the path to stably serving them in production is not so clear. We'll discuss details of wrapping a minimal REST API around scikit-learn, training and persisting models in batch, and logging decisions, then compare to some other common approaches to productionizing models.
Personal point of view on scikit-learn: past, present, and future.
This talks gives a bit of history, mentions exciting development, and a personal vision on the future.
Numerical tour in the Python eco-system: Python, NumPy, scikit-learnArnaud Joly
We first present the Python programming language and the NumPy package for scientific computing. Then, we devise a digit recognition system highlighting the scikit-learn package.
Realtime predictive analytics using RabbitMQ & scikit-learnAWeber
In this talk, AWeber's Michael Becker describes how to deploy a predictive model in a production environment using RabbitMQ and scikit-learn. You'll see a realtime content classification system to demonstrate this design.
This is the slides for the data science workshop at CDIPS, UC Berkeley on 06-28-2017. It is about general machine learning with a focus on scikit-learn. You can find all the related material: https://github.com/qingkaikong/20170628_ML_sklearn
In this talk by AWeber's Michael Becker, you will get a brief overview of Machine Learning and scikit-learn. This is a scaled down version of this talk from Pycon 2013: http://github.com/jakevdp/sklearn_pycon2013
Tree models with Scikit-Learn: Great models with little assumptionsGilles Louppe
This talk gives an introduction to tree-based methods, both from a theoretical and practical point of view. It covers decision trees, random forests and boosting estimators, along with concrete examples based on Scikit-Learn about how they work, when they work and why they work.
scikit-learn has emerged as one of the most popular open source machine learning toolkits, now widely used in academia and industry.
scikit-learn provides easy-to-use interfaces to perform advanced analysis and build powerful predictive models.
The tutorial will cover basic concepts of machine learning, such as supervised and unsupervised learning, cross validation, and model selection. We will see how to prepare data for machine learning, and go from applying a single algorithm to building a machine learning pipeline.
We will also cover how to build machine learning models on text data, and how to handle very large datasets.
A brief introduction to clustering with Scikit learn. In this presentation, we provide an overview with real examples of how to make use and optimize within k-means clustering.
Scikit-learn for easy machine learning: the vision, the tool, and the projectGael Varoquaux
Scikit-learn is a popular machine learning tool. What can it do for you?Why you you want to use it? What can you do with it? Where is it going?In this talk, I will discuss why and how scikit-learn became popular. Iwill argue that it is successful because of its vision: it fills an important slot in the rich ecosystem of data science. I will demonstrate how scikit-learn makes predictive analysis easy and yet versatile.I will shed some light on our development process: how do we, as a community, ensure the quality and the growth of scikit-learn?
Accelerating Random Forests in Scikit-LearnGilles Louppe
Random Forests are without contest one of the most robust, accurate and versatile tools for solving machine learning tasks. Implementing this algorithm properly and efficiently remains however a challenging task involving issues that are easily overlooked if not considered with care. In this talk, we present the Random Forests implementation developed within the Scikit-Learn machine learning library. In particular, we describe the iterative team efforts that led us to gradually improve our codebase and eventually make Scikit-Learn's Random Forests one of the most efficient implementations in the scientific ecosystem, across all libraries and programming languages. Algorithmic and technical optimizations that have made this possible include:
- An efficient formulation of the decision tree algorithm, tailored for Random Forests;
- Cythonization of the tree induction algorithm;
- CPU cache optimizations, through low-level organization of data into contiguous memory blocks;
- Efficient multi-threading through GIL-free routines;
- A dedicated sorting procedure, taking into account the properties of data;
- Shared pre-computations whenever critical.
Overall, we believe that lessons learned from this case study extend to a broad range of scientific applications and may be of interest to anybody doing data analysis in Python.
Tong is a data scientist in Supstat Inc and also a master students of Data Mining. He has been an active R programmer and developer for 5 years. He is the author of the R package of XGBoost, one of the most popular and contest-winning tools on kaggle.com nowadays.
Agenda:
Introduction of Xgboost
Real World Application
Model Specification
Parameter Introduction
Advanced Features
Kaggle Winning Solution
Nyc open-data-2015-andvanced-sklearn-expandedVivian S. Zhang
Scikit-learn is a machine learning library in Python, that has become a valuable tool for many data science practitioners.
This talk will cover some of the more advanced aspects of scikit-learn, such as building complex machine learning pipelines, model evaluation, parameter search, and out-of-core learning.
Apart from metrics for model evaluation, we will cover how to evaluate model complexity, and how to tune parameters with grid search, randomized parameter search, and what their trade-offs are. We will also cover out of core text feature processing via feature hashing.
---------------------------------------------------------
Andreas is an Assistant Research Scientist at the NYU Center for Data Science, building a group to work on open source software for data science. Previously he worked as a Machine Learning Scientist at Amazon, working on computer vision and forecasting problems. He is one of the core developers of the scikit-learn machine learning library, and maintained it for several years.
Material will be posted here:
https://github.com/amueller/pydata-nyc-advanced-sklearn
Blog:
peekaboo-vision.blogspot.com
Twitter:
https://twitter.com/t3kcit
Our fall 12-Week Data Science bootcamp starts on Sept 21st,2015. Apply now to get a spot!
If you are hiring Data Scientists, call us at (1)888-752-7585 or reach info@nycdatascience.com to share your openings and set up interviews with our excellent students.
---------------------------------------------------------------
Come join our meet-up and learn how easily you can use R for advanced Machine learning. In this meet-up, we will demonstrate how to understand and use Xgboost for Kaggle competition. Tong is in Canada and will do remote session with us through google hangout.
---------------------------------------------------------------
Speaker Bio:
Tong is a data scientist in Supstat Inc and also a master students of Data Mining. He has been an active R programmer and developer for 5 years. He is the author of the R package of XGBoost, one of the most popular and contest-winning tools on kaggle.com nowadays.
Pre-requisite(if any): R /Calculus
Preparation: A laptop with R installed. Windows users might need to have RTools installed as well.
Agenda:
Introduction of Xgboost
Real World Application
Model Specification
Parameter Introduction
Advanced Features
Kaggle Winning Solution
Event arrangement:
6:45pm Doors open. Come early to network, grab a beer and settle in.
7:00-9:00pm XgBoost Demo
Reference:
https://github.com/dmlc/xgboost
Cheat Sheet for Machine Learning in Python: Scikit-learnKarlijn Willems
Get started with machine learning in Python thanks to this scikit-learn cheat sheet, which is a handy one-page reference that guides you through the several steps to make your own machine learning models. Thanks to the code examples, you won't get lost!
Large Language Models and the End of ProgrammingMatt Welsh
Talk by Matt Welsh at Craft Conference 2024 on the impact that Large Language Models will have on the future of software development. In this talk, I discuss the ways in which LLMs will impact the software industry, from replacing human software developers with AI, to replacing conventional software with models that perform reasoning, computation, and problem-solving.
In the ever-evolving landscape of technology, enterprise software development is undergoing a significant transformation. Traditional coding methods are being challenged by innovative no-code solutions, which promise to streamline and democratize the software development process.
This shift is particularly impactful for enterprises, which require robust, scalable, and efficient software to manage their operations. In this article, we will explore the various facets of enterprise software development with no-code solutions, examining their benefits, challenges, and the future potential they hold.
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Globus
The Earth System Grid Federation (ESGF) is a global network of data servers that archives and distributes the planet’s largest collection of Earth system model output for thousands of climate and environmental scientists worldwide. Many of these petabyte-scale data archives are located in proximity to large high-performance computing (HPC) or cloud computing resources, but the primary workflow for data users consists of transferring data, and applying computations on a different system. As a part of the ESGF 2.0 US project (funded by the United States Department of Energy Office of Science), we developed pre-defined data workflows, which can be run on-demand, capable of applying many data reduction and data analysis to the large ESGF data archives, transferring only the resultant analysis (ex. visualizations, smaller data files). In this talk, we will showcase a few of these workflows, highlighting how Globus Flows can be used for petabyte-scale climate analysis.
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Globus
The U.S. Geological Survey (USGS) has made substantial investments in meeting evolving scientific, technical, and policy driven demands on storing, managing, and delivering data. As these demands continue to grow in complexity and scale, the USGS must continue to explore innovative solutions to improve its management, curation, sharing, delivering, and preservation approaches for large-scale research data. Supporting these needs, the USGS has partnered with the University of Chicago-Globus to research and develop advanced repository components and workflows leveraging its current investment in Globus. The primary outcome of this partnership includes the development of a prototype enterprise repository, driven by USGS Data Release requirements, through exploration and implementation of the entire suite of the Globus platform offerings, including Globus Flow, Globus Auth, Globus Transfer, and Globus Search. This presentation will provide insights into this research partnership, introduce the unique requirements and challenges being addressed and provide relevant project progress.
Globus Connect Server Deep Dive - GlobusWorld 2024Globus
We explore the Globus Connect Server (GCS) architecture and experiment with advanced configuration options and use cases. This content is targeted at system administrators who are familiar with GCS and currently operate—or are planning to operate—broader deployments at their institution.
Developing Distributed High-performance Computing Capabilities of an Open Sci...Globus
COVID-19 had an unprecedented impact on scientific collaboration. The pandemic and its broad response from the scientific community has forged new relationships among public health practitioners, mathematical modelers, and scientific computing specialists, while revealing critical gaps in exploiting advanced computing systems to support urgent decision making. Informed by our team’s work in applying high-performance computing in support of public health decision makers during the COVID-19 pandemic, we present how Globus technologies are enabling the development of an open science platform for robust epidemic analysis, with the goal of collaborative, secure, distributed, on-demand, and fast time-to-solution analyses to support public health.
Listen to the keynote address and hear about the latest developments from Rachana Ananthakrishnan and Ian Foster who review the updates to the Globus Platform and Service, and the relevance of Globus to the scientific community as an automation platform to accelerate scientific discovery.
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...informapgpstrackings
Keep tabs on your field staff effortlessly with Informap Technology Centre LLC. Real-time tracking, task assignment, and smart features for efficient management. Request a live demo today!
For more details, visit us : https://informapuae.com/field-staff-tracking/
How Recreation Management Software Can Streamline Your Operations.pptxwottaspaceseo
Recreation management software streamlines operations by automating key tasks such as scheduling, registration, and payment processing, reducing manual workload and errors. It provides centralized management of facilities, classes, and events, ensuring efficient resource allocation and facility usage. The software offers user-friendly online portals for easy access to bookings and program information, enhancing customer experience. Real-time reporting and data analytics deliver insights into attendance and preferences, aiding in strategic decision-making. Additionally, effective communication tools keep participants and staff informed with timely updates. Overall, recreation management software enhances efficiency, improves service delivery, and boosts customer satisfaction.
Understanding Globus Data Transfers with NetSageGlobus
NetSage is an open privacy-aware network measurement, analysis, and visualization service designed to help end-users visualize and reason about large data transfers. NetSage traditionally has used a combination of passive measurements, including SNMP and flow data, as well as active measurements, mainly perfSONAR, to provide longitudinal network performance data visualization. It has been deployed by dozens of networks world wide, and is supported domestically by the Engagement and Performance Operations Center (EPOC), NSF #2328479. We have recently expanded the NetSage data sources to include logs for Globus data transfers, following the same privacy-preserving approach as for Flow data. Using the logs for the Texas Advanced Computing Center (TACC) as an example, this talk will walk through several different example use cases that NetSage can answer, including: Who is using Globus to share data with my institution, and what kind of performance are they able to achieve? How many transfers has Globus supported for us? Which sites are we sharing the most data with, and how is that changing over time? How is my site using Globus to move data internally, and what kind of performance do we see for those transfers? What percentage of data transfers at my institution used Globus, and how did the overall data transfer performance compare to the Globus users?
Enhancing Research Orchestration Capabilities at ORNL.pdfGlobus
Cross-facility research orchestration comes with ever-changing constraints regarding the availability and suitability of various compute and data resources. In short, a flexible data and processing fabric is needed to enable the dynamic redirection of data and compute tasks throughout the lifecycle of an experiment. In this talk, we illustrate how we easily leveraged Globus services to instrument the ACE research testbed at the Oak Ridge Leadership Computing Facility with flexible data and task orchestration capabilities.
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteGoogle
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
👉👉 Click Here To Get More Info 👇👇
https://sumonreview.com/ai-pilot-review/
AI Pilot Review: Key Features
✅Deploy AI expert bots in Any Niche With Just A Click
✅With one keyword, generate complete funnels, websites, landing pages, and more.
✅More than 85 AI features are included in the AI pilot.
✅No setup or configuration; use your voice (like Siri) to do whatever you want.
✅You Can Use AI Pilot To Create your version of AI Pilot And Charge People For It…
✅ZERO Manual Work With AI Pilot. Never write, Design, Or Code Again.
✅ZERO Limits On Features Or Usages
✅Use Our AI-powered Traffic To Get Hundreds Of Customers
✅No Complicated Setup: Get Up And Running In 2 Minutes
✅99.99% Up-Time Guaranteed
✅30 Days Money-Back Guarantee
✅ZERO Upfront Cost
See My Other Reviews Article:
(1) TubeTrivia AI Review: https://sumonreview.com/tubetrivia-ai-review
(2) SocioWave Review: https://sumonreview.com/sociowave-review
(3) AI Partner & Profit Review: https://sumonreview.com/ai-partner-profit-review
(4) AI Ebook Suite Review: https://sumonreview.com/ai-ebook-suite-review
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns
Unlocking Business Potential: Tailored Technology Solutions by Prosigns
Discover how Prosigns, a leading technology solutions provider, partners with businesses to drive innovation and success. Our presentation showcases our comprehensive range of services, including custom software development, web and mobile app development, AI & ML solutions, blockchain integration, DevOps services, and Microsoft Dynamics 365 support.
Custom Software Development: Prosigns specializes in creating bespoke software solutions that cater to your unique business needs. Our team of experts works closely with you to understand your requirements and deliver tailor-made software that enhances efficiency and drives growth.
Web and Mobile App Development: From responsive websites to intuitive mobile applications, Prosigns develops cutting-edge solutions that engage users and deliver seamless experiences across devices.
AI & ML Solutions: Harnessing the power of Artificial Intelligence and Machine Learning, Prosigns provides smart solutions that automate processes, provide valuable insights, and drive informed decision-making.
Blockchain Integration: Prosigns offers comprehensive blockchain solutions, including development, integration, and consulting services, enabling businesses to leverage blockchain technology for enhanced security, transparency, and efficiency.
DevOps Services: Prosigns' DevOps services streamline development and operations processes, ensuring faster and more reliable software delivery through automation and continuous integration.
Microsoft Dynamics 365 Support: Prosigns provides comprehensive support and maintenance services for Microsoft Dynamics 365, ensuring your system is always up-to-date, secure, and running smoothly.
Learn how our collaborative approach and dedication to excellence help businesses achieve their goals and stay ahead in today's digital landscape. From concept to deployment, Prosigns is your trusted partner for transforming ideas into reality and unlocking the full potential of your business.
Join us on a journey of innovation and growth. Let's partner for success with Prosigns.
Check out the webinar slides to learn more about how XfilesPro transforms Salesforce document management by leveraging its world-class applications. For more details, please connect with sales@xfilespro.com
If you want to watch the on-demand webinar, please click here: https://www.xfilespro.com/webinars/salesforce-document-management-2-0-smarter-faster-better/
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Shahin Sheidaei
Games are powerful teaching tools, fostering hands-on engagement and fun. But they require careful consideration to succeed. Join me to explore factors in running and selecting games, ensuring they serve as effective teaching tools. Learn to maintain focus on learning objectives while playing, and how to measure the ROI of gaming in education. Discover strategies for pitching gaming to leadership. This session offers insights, tips, and examples for coaches, team leads, and enterprise leaders seeking to teach from simple to complex concepts.
Quarkus Hidden and Forbidden ExtensionsMax Andersen
Quarkus has a vast extension ecosystem and is known for its subsonic and subatomic feature set. Some of these features are not as well known, and some extensions are less talked about, but that does not make them less interesting - quite the opposite.
Come join this talk to see some tips and tricks for using Quarkus and some of the lesser known features, extensions and development techniques.
In software engineering, the right architecture is essential for robust, scalable platforms. Wix has undergone a pivotal shift from event sourcing to a CRUD-based model for its microservices. This talk will chart the course of this pivotal journey.
Event sourcing, which records state changes as immutable events, provided robust auditing and "time travel" debugging for Wix Stores' microservices. Despite its benefits, the complexity it introduced in state management slowed development. Wix responded by adopting a simpler, unified CRUD model. This talk will explore the challenges of event sourcing and the advantages of Wix's new "CRUD on steroids" approach, which streamlines API integration and domain event management while preserving data integrity and system resilience.
Participants will gain valuable insights into Wix's strategies for ensuring atomicity in database updates and event production, as well as caching, materialization, and performance optimization techniques within a distributed system.
Join us to discover how Wix has mastered the art of balancing simplicity and extensibility, and learn how the re-adoption of the modest CRUD has turbocharged their development velocity, resilience, and scalability in a high-growth environment.
We describe the deployment and use of Globus Compute for remote computation. This content is aimed at researchers who wish to compute on remote resources using a unified programming interface, as well as system administrators who will deploy and operate Globus Compute services on their research computing infrastructure.
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamtakuyayamamoto1800
In this slide, we show the simulation example and the way to compile this solver.
In this solver, the Helmholtz equation can be solved by helmholtzFoam. Also, the Helmholtz equation with uniformly dispersed bubbles can be simulated by helmholtzBubbleFoam.
6. The PMML pipeline
A smart pipeline that collects supporting information about
"passed through" features and label(s):
● Name
● Data type (eg. string, float, integer, boolean)
● Operational type (eg. continuous, categorical, ordinal)
● Domain of valid values
● Missing and invalid value treatments
7. Making a PMML pipeline (1/2)
from sklearn2pmml import PMMLPipeline
from sklearn2pmml import sklearn2pmml
pipeline = PMMLPipeline([...])
pipeline.fit(X, y)
sklearn2pmml(pipeline, "pipeline.pmml")
9. Pipeline setup (1/3)
1. Feature and label definition
Specific to (J)PMML
2. Feature engineering
3. Feature selection
Scikit-Learn persists all features, (J)PMML persists "surviving" features
4. Estimator fitting
5. Decision engineering
Specific to (J)PMML
10. Pipeline setup (2/3)
Workflow:
1. Column- and column set-oriented feature definition,
engineering and selection
2. Table-oriented feature engineering and selection
3. Estimator fitting
Full support for pipeline nesting, branching.
23. Scikit-Learn challenge
class SelectPercentile(BaseTransformer, SelectorMixin):
def _get_support_mask(self):
scores = self.scores
threshold = stats.scoreatpercentile(scores, 100 - self.percentile)
return (scores > threshold)
Python methods don't have a persistent state:
Exception in thread "main": java.lang.IllegalArgumentException: The
selector object does not have a persistent '_get_support_mask' attribute