This document outlines an agenda for a tutorial on data wrangling for Kaggle data science competitions. The tutorial covers the anatomy of a Kaggle competition, algorithms for amateur data scientists, model evaluation and interpretation, and hands-on sessions for three sample competitions: Titanic, Data Science London, and PAKDD 2014. The goals are to familiarize participants with competition mechanics, explore algorithms and the data science process, and have participants submit entries for three competitions by applying algorithms like CART, random forests, and SVMs to Kaggle datasets.
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learntEugene Yan Ziyou
Our team achieved 85th position out of 3,514 at the very popular Kaggle Otto Product Classification Challenge. Here's an overview of how we did it, as well as some techniques we learnt from fellow Kagglers during and after the competition.
Kaggle Higgs Boson Machine Learning ChallengeBernard Ong
What It Took to Score the Top 2% on the Higgs Boson Machine Learning Challenge. A journey into advanced machine learning models ensembles stacking methods.
Tong is a data scientist in Supstat Inc and also a master students of Data Mining. He has been an active R programmer and developer for 5 years. He is the author of the R package of XGBoost, one of the most popular and contest-winning tools on kaggle.com nowadays.
Agenda:
Introduction of Xgboost
Real World Application
Model Specification
Parameter Introduction
Advanced Features
Kaggle Winning Solution
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learntEugene Yan Ziyou
Our team achieved 85th position out of 3,514 at the very popular Kaggle Otto Product Classification Challenge. Here's an overview of how we did it, as well as some techniques we learnt from fellow Kagglers during and after the competition.
Kaggle Higgs Boson Machine Learning ChallengeBernard Ong
What It Took to Score the Top 2% on the Higgs Boson Machine Learning Challenge. A journey into advanced machine learning models ensembles stacking methods.
Tong is a data scientist in Supstat Inc and also a master students of Data Mining. He has been an active R programmer and developer for 5 years. He is the author of the R package of XGBoost, one of the most popular and contest-winning tools on kaggle.com nowadays.
Agenda:
Introduction of Xgboost
Real World Application
Model Specification
Parameter Introduction
Advanced Features
Kaggle Winning Solution
Feature Engineering - Getting most out of data for predictive modelsGabriel Moreira
How should data be preprocessed for use in machine learning algorithms? How to identify the most predictive attributes of a dataset? What features can generate to improve the accuracy of a model?
Feature Engineering is the process of extracting and selecting, from raw data, features that can be used effectively in predictive models. As the quality of the features greatly influences the quality of the results, knowing the main techniques and pitfalls will help you to succeed in the use of machine learning in your projects.
In this talk, we will present methods and techniques that allow us to extract the maximum potential of the features of a dataset, increasing flexibility, simplicity and accuracy of the models. The analysis of the distribution of features and their correlations, the transformation of numeric attributes (such as scaling, normalization, log-based transformation, binning), categorical attributes (such as one-hot encoding, feature hashing, Temporal (date / time), and free-text attributes (text vectorization, topic modeling).
Python, Python, Scikit-learn, and Spark SQL examples will be presented and how to use domain knowledge and intuition to select and generate features relevant to predictive models.
In this talk, Dmitry shares his approach to feature engineering which he used successfully in various Kaggle competitions. He covers common techniques used to convert your features into numeric representation used by ML algorithms.
Slides to support Austin Machine Learning Meetup, 1/19/2015.
Overview of techniques of recent Kaggle code to perform online logistic regression with FTRL-proximal (SGD, L1/L2 regularization) and hash trick.
Gradient Boosted Regression Trees in scikit-learnDataRobot
Slides of the talk "Gradient Boosted Regression Trees in scikit-learn" by Peter Prettenhofer and Gilles Louppe held at PyData London 2014.
Abstract:
This talk describes Gradient Boosted Regression Trees (GBRT), a powerful statistical learning technique with applications in a variety of areas, ranging from web page ranking to environmental niche modeling. GBRT is a key ingredient of many winning solutions in data-mining competitions such as the Netflix Prize, the GE Flight Quest, or the Heritage Health Price.
I will give a brief introduction to the GBRT model and regression trees -- focusing on intuition rather than mathematical formulas. The majority of the talk will be dedicated to an in depth discussion how to apply GBRT in practice using scikit-learn. We will cover important topics such as regularization, model tuning and model interpretation that should significantly improve your score on Kaggle.
Our fall 12-Week Data Science bootcamp starts on Sept 21st,2015. Apply now to get a spot!
If you are hiring Data Scientists, call us at (1)888-752-7585 or reach info@nycdatascience.com to share your openings and set up interviews with our excellent students.
---------------------------------------------------------------
Come join our meet-up and learn how easily you can use R for advanced Machine learning. In this meet-up, we will demonstrate how to understand and use Xgboost for Kaggle competition. Tong is in Canada and will do remote session with us through google hangout.
---------------------------------------------------------------
Speaker Bio:
Tong is a data scientist in Supstat Inc and also a master students of Data Mining. He has been an active R programmer and developer for 5 years. He is the author of the R package of XGBoost, one of the most popular and contest-winning tools on kaggle.com nowadays.
Pre-requisite(if any): R /Calculus
Preparation: A laptop with R installed. Windows users might need to have RTools installed as well.
Agenda:
Introduction of Xgboost
Real World Application
Model Specification
Parameter Introduction
Advanced Features
Kaggle Winning Solution
Event arrangement:
6:45pm Doors open. Come early to network, grab a beer and settle in.
7:00-9:00pm XgBoost Demo
Reference:
https://github.com/dmlc/xgboost
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15MLconf
Many Shades of Scale: Big Learning Beyond Big Data: In the machine learning research community, much of the attention devoted to ‘big data’ in recent years has been manifested as development of new algorithms and systems for distributed training on many examples. This focus has led to significant advances in the field, from basic but operational implementations on popular platforms to highly sophisticated prototypes in the literature. In the meantime, other aspects of scaling up learning have received relatively little attention, although they are often more pressing in practice. The talk will survey these less-studied facets of big learning: scaling to an extremely large number of features, to many components in predictive pipelines, and to multiple data scientists collaborating on shared experiments.
The slide of the talk in http://www.meetup.com/R-Users-Sydney/events/223867196/
There is a web version here: http://wush978.github.io/FeatureHashing/index.html
Comparison Study of Decision Tree Ensembles for RegressionSeonho Park
Nowadays, decision tree ensemble methods are widely used for solving classification and regression problem due to their rigorousness and robustness. To compare with classification, the performance in regression problem so far has not been yet addressed in detail. In this presentation, we review the state-of-art decision tree ensemble methodology in scikit-learn and xgboost for regression. Also, empirical study results are illustrated to compare their performance and computational efficiency.
Feature Engineering - Getting most out of data for predictive models - TDC 2017Gabriel Moreira
How should data be preprocessed for use in machine learning algorithms? How to identify the most predictive attributes of a dataset? What features can generate to improve the accuracy of a model?
Feature Engineering is the process of extracting and selecting, from raw data, features that can be used effectively in predictive models. As the quality of the features greatly influences the quality of the results, knowing the main techniques and pitfalls will help you to succeed in the use of machine learning in your projects.
In this talk, we will present methods and techniques that allow us to extract the maximum potential of the features of a dataset, increasing flexibility, simplicity and accuracy of the models. The analysis of the distribution of features and their correlations, the transformation of numeric attributes (such as scaling, normalization, log-based transformation, binning), categorical attributes (such as one-hot encoding, feature hashing, Temporal (date / time), and free-text attributes (text vectorization, topic modeling).
Python, Python, Scikit-learn, and Spark SQL examples will be presented and how to use domain knowledge and intuition to select and generate features relevant to predictive models.
One of the most important, yet often overlooked, aspects of predictive modeling is the transformation of data to create model inputs, better known as feature engineering (FE). This talk will go into the theoretical background behind FE, showing how it leverages existing data to produce better modeling results. It will then detail some important FE techniques that should be in every data scientist’s tool kit.
What is boosting
Boosting algorithm
Building models using GBM
Algorithm main Parameters
Finetuning models
Hyper parameters in GBM
Validating GBM models
Overview of tree algorithms from decision tree to xgboostTakami Sato
For my understanding, I surveyed popular tree algorithms on Machine Learning and their evolution. This is the first time I wrote a presentation in English. So, I am happy if you give me a feedback.
How to win data science competitions with Deep LearningSri Ambati
Note: Please download the slides first, otherwise some links won't work!
How to win kaggle style data science competitions and influence decisions with R, Deep Learning and H2O's fast algorithms.
We take a few public and kaggle datasets and model to win competitions on accuracy and scoring speed.
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16MLconf
Multi-algorithm Ensemble Learning at Scale: Software, Hardware and Algorithmic Approaches: Multi-algorithm ensemble machine learning methods are often used when the true prediction function is not easily approximated by a single algorithm. The Super Learner algorithm, also known as stacking, combines multiple, typically diverse, base learning algorithms into a single, powerful prediction function through a secondary learning process called metalearning. Although ensemble methods offer superior performance over their singleton counterparts, there is an implicit computational cost to ensembles, as it requires training and cross-validating multiple base learning algorithms.
We will demonstrate a variety of software- and hardware-based approaches that lead to more scalable ensemble learning software, including a highly scalable implementation of stacking called “H2O Ensemble”, built on top of the open source, distributed machine learning platform, H2O. H2O Ensemble scales across multi-node clusters and allows the user to create ensembles of deep neural networks, Gradient Boosting Machines, Random Forest, and others. As for algorithm-based approaches, we will present two algorithmic modifications to the original stacking algorithm that further reduce computation time — Subsemble algorithm and the Online Super Learner algorithm. This talk will also include benchmarks of the implementations of these new stacking variants.
R, Data Wrangling & Kaggle Data Science CompetitionsKrishna Sankar
Presentation for my tutorial at Big Data Tech Con http://goo.gl/ZRoFHi
This is the R version of my pycon tutorial + a few updates
It is work in progress. I will update with daily snapshot until done.
Feature Engineering - Getting most out of data for predictive modelsGabriel Moreira
How should data be preprocessed for use in machine learning algorithms? How to identify the most predictive attributes of a dataset? What features can generate to improve the accuracy of a model?
Feature Engineering is the process of extracting and selecting, from raw data, features that can be used effectively in predictive models. As the quality of the features greatly influences the quality of the results, knowing the main techniques and pitfalls will help you to succeed in the use of machine learning in your projects.
In this talk, we will present methods and techniques that allow us to extract the maximum potential of the features of a dataset, increasing flexibility, simplicity and accuracy of the models. The analysis of the distribution of features and their correlations, the transformation of numeric attributes (such as scaling, normalization, log-based transformation, binning), categorical attributes (such as one-hot encoding, feature hashing, Temporal (date / time), and free-text attributes (text vectorization, topic modeling).
Python, Python, Scikit-learn, and Spark SQL examples will be presented and how to use domain knowledge and intuition to select and generate features relevant to predictive models.
In this talk, Dmitry shares his approach to feature engineering which he used successfully in various Kaggle competitions. He covers common techniques used to convert your features into numeric representation used by ML algorithms.
Slides to support Austin Machine Learning Meetup, 1/19/2015.
Overview of techniques of recent Kaggle code to perform online logistic regression with FTRL-proximal (SGD, L1/L2 regularization) and hash trick.
Gradient Boosted Regression Trees in scikit-learnDataRobot
Slides of the talk "Gradient Boosted Regression Trees in scikit-learn" by Peter Prettenhofer and Gilles Louppe held at PyData London 2014.
Abstract:
This talk describes Gradient Boosted Regression Trees (GBRT), a powerful statistical learning technique with applications in a variety of areas, ranging from web page ranking to environmental niche modeling. GBRT is a key ingredient of many winning solutions in data-mining competitions such as the Netflix Prize, the GE Flight Quest, or the Heritage Health Price.
I will give a brief introduction to the GBRT model and regression trees -- focusing on intuition rather than mathematical formulas. The majority of the talk will be dedicated to an in depth discussion how to apply GBRT in practice using scikit-learn. We will cover important topics such as regularization, model tuning and model interpretation that should significantly improve your score on Kaggle.
Our fall 12-Week Data Science bootcamp starts on Sept 21st,2015. Apply now to get a spot!
If you are hiring Data Scientists, call us at (1)888-752-7585 or reach info@nycdatascience.com to share your openings and set up interviews with our excellent students.
---------------------------------------------------------------
Come join our meet-up and learn how easily you can use R for advanced Machine learning. In this meet-up, we will demonstrate how to understand and use Xgboost for Kaggle competition. Tong is in Canada and will do remote session with us through google hangout.
---------------------------------------------------------------
Speaker Bio:
Tong is a data scientist in Supstat Inc and also a master students of Data Mining. He has been an active R programmer and developer for 5 years. He is the author of the R package of XGBoost, one of the most popular and contest-winning tools on kaggle.com nowadays.
Pre-requisite(if any): R /Calculus
Preparation: A laptop with R installed. Windows users might need to have RTools installed as well.
Agenda:
Introduction of Xgboost
Real World Application
Model Specification
Parameter Introduction
Advanced Features
Kaggle Winning Solution
Event arrangement:
6:45pm Doors open. Come early to network, grab a beer and settle in.
7:00-9:00pm XgBoost Demo
Reference:
https://github.com/dmlc/xgboost
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15MLconf
Many Shades of Scale: Big Learning Beyond Big Data: In the machine learning research community, much of the attention devoted to ‘big data’ in recent years has been manifested as development of new algorithms and systems for distributed training on many examples. This focus has led to significant advances in the field, from basic but operational implementations on popular platforms to highly sophisticated prototypes in the literature. In the meantime, other aspects of scaling up learning have received relatively little attention, although they are often more pressing in practice. The talk will survey these less-studied facets of big learning: scaling to an extremely large number of features, to many components in predictive pipelines, and to multiple data scientists collaborating on shared experiments.
The slide of the talk in http://www.meetup.com/R-Users-Sydney/events/223867196/
There is a web version here: http://wush978.github.io/FeatureHashing/index.html
Comparison Study of Decision Tree Ensembles for RegressionSeonho Park
Nowadays, decision tree ensemble methods are widely used for solving classification and regression problem due to their rigorousness and robustness. To compare with classification, the performance in regression problem so far has not been yet addressed in detail. In this presentation, we review the state-of-art decision tree ensemble methodology in scikit-learn and xgboost for regression. Also, empirical study results are illustrated to compare their performance and computational efficiency.
Feature Engineering - Getting most out of data for predictive models - TDC 2017Gabriel Moreira
How should data be preprocessed for use in machine learning algorithms? How to identify the most predictive attributes of a dataset? What features can generate to improve the accuracy of a model?
Feature Engineering is the process of extracting and selecting, from raw data, features that can be used effectively in predictive models. As the quality of the features greatly influences the quality of the results, knowing the main techniques and pitfalls will help you to succeed in the use of machine learning in your projects.
In this talk, we will present methods and techniques that allow us to extract the maximum potential of the features of a dataset, increasing flexibility, simplicity and accuracy of the models. The analysis of the distribution of features and their correlations, the transformation of numeric attributes (such as scaling, normalization, log-based transformation, binning), categorical attributes (such as one-hot encoding, feature hashing, Temporal (date / time), and free-text attributes (text vectorization, topic modeling).
Python, Python, Scikit-learn, and Spark SQL examples will be presented and how to use domain knowledge and intuition to select and generate features relevant to predictive models.
One of the most important, yet often overlooked, aspects of predictive modeling is the transformation of data to create model inputs, better known as feature engineering (FE). This talk will go into the theoretical background behind FE, showing how it leverages existing data to produce better modeling results. It will then detail some important FE techniques that should be in every data scientist’s tool kit.
What is boosting
Boosting algorithm
Building models using GBM
Algorithm main Parameters
Finetuning models
Hyper parameters in GBM
Validating GBM models
Overview of tree algorithms from decision tree to xgboostTakami Sato
For my understanding, I surveyed popular tree algorithms on Machine Learning and their evolution. This is the first time I wrote a presentation in English. So, I am happy if you give me a feedback.
How to win data science competitions with Deep LearningSri Ambati
Note: Please download the slides first, otherwise some links won't work!
How to win kaggle style data science competitions and influence decisions with R, Deep Learning and H2O's fast algorithms.
We take a few public and kaggle datasets and model to win competitions on accuracy and scoring speed.
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16MLconf
Multi-algorithm Ensemble Learning at Scale: Software, Hardware and Algorithmic Approaches: Multi-algorithm ensemble machine learning methods are often used when the true prediction function is not easily approximated by a single algorithm. The Super Learner algorithm, also known as stacking, combines multiple, typically diverse, base learning algorithms into a single, powerful prediction function through a secondary learning process called metalearning. Although ensemble methods offer superior performance over their singleton counterparts, there is an implicit computational cost to ensembles, as it requires training and cross-validating multiple base learning algorithms.
We will demonstrate a variety of software- and hardware-based approaches that lead to more scalable ensemble learning software, including a highly scalable implementation of stacking called “H2O Ensemble”, built on top of the open source, distributed machine learning platform, H2O. H2O Ensemble scales across multi-node clusters and allows the user to create ensembles of deep neural networks, Gradient Boosting Machines, Random Forest, and others. As for algorithm-based approaches, we will present two algorithmic modifications to the original stacking algorithm that further reduce computation time — Subsemble algorithm and the Online Super Learner algorithm. This talk will also include benchmarks of the implementations of these new stacking variants.
R, Data Wrangling & Kaggle Data Science CompetitionsKrishna Sankar
Presentation for my tutorial at Big Data Tech Con http://goo.gl/ZRoFHi
This is the R version of my pycon tutorial + a few updates
It is work in progress. I will update with daily snapshot until done.
Discrete-event simulation: best practices and implementation details in Pytho...Carlos Natalino da Silva
Discrete-event simulation is one of the most useful techniques to evaluate quickly and effectively the performance of systems. It enables benchmarking proposed strategies against existing ones in a time- and computing-efficient manner. However, there are several aspects that should be considered when designing and implementing your simulation environment. In this tutorial, a number of best practices when designing and implementing event-driven simulations will be discussed. A use case of routing in optical networks will be used as an example. The implementation of the main simulator components using Java and Python will be described.
Genetic Algorithms (GAs) are a metaheuristic search technique belonging to the class of Evolutionary Algorithms (EAs). They have been proven to be effective in addressing several problems in many fields but also suffer from scalability issues that may not let them find a valid application for real world problems. Thus, the aim of providing highly scalable GA-based solutions, together with the reduced costs of parallel architectures, motivate the research on Parallel Genetic Algorithms (PGAs). Cloud computing may be a valid option for parallelisation, since there is no need of owning the physical hardware, which can be purchased from cloud providers, for the desired time, quantity and quality. There are different employable cloud technologies and approaches for this purpose, but they all introduce communication overhead. Thus, one might wonder if, and possibly when, specific approaches, environments and models show better performance than sequential versions in terms of execution time and resource usage. This thesis investigates if and when GAs can scale in the cloud using specific approaches. Firstly, Hadoop MapReduce is exploited designing and developing an open source framework, i.e., elephant56, that reduces the effort in developing and speed up GAs using three parallel models. The performance of the framework is then evaluated through an empirical study. Secondly, software containers and message queues are employed to develop, deploy and execute PGAs in the cloud and the devised system is evaluated with an empirical study on a commercial cloud provider. Finally, cloud technologies are also explored for the parallelisation of other EAs, designing and developing cCube, a collaborative microservices architecture for machine learning problems.
This thesis of PhD in Management & Information Technology was presented on 20th April 2017, University of Salerno, Fisciano, Italy.
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...Rodney Joyce
Number 2 in the Data Science for Dummies series - We'll predict Titanic survival with Databricks, python and MLSpark.
These are the slides only (excuse the Powerpoint animation issues) - check out the actual tech talk on YouTube: https://rodneyjoyce.home.blog/2019/05/03/data-science-for-dummies-machine-learning-with-databricks-python-sparkml-tech-talk-1-of-7/)
If you have not used Databricks before check out the first talk - Databricks for Dummies.
Here's the rest of the series: https://rodneyjoyce.home.blog/tag/data-science-for-dummies/
1) Data Science overview with Databricks
2) Titanic survival prediction with Azure Machine Learning Studio + Kaggle
3) Data Engineering with Titanic dataset + Databricks + Python
4) Titanic with Databricks + Spark ML
5) Titanic with Databricks + Azure Machine Learning Service
6) Titanic with Databricks + MLS + AutoML
7) Titanic with Databricks + MLFlow
8) Titanic with .NET Core + ML.NET
9) Deployment, DevOps/MLOps and Productionisation
With interest in AI based applications growing, and companies like IBM, Google, Microsoft, NVidia investing heavily in computing and software applications, interest in Deep Learning has been growing in the past few years. With advances in software and hardware technologies, Neural Networks are making a resurgence.
In this workshop, we will discuss the basics of Neural Networks and discuss how Deep Learning Neural networks are different from conventional Neural Network architectures. We will review a bit of mathematics that goes into building neural networks and understand the role of GPUs in Deep Learning. We will also get an introduction to Autoencoders, Convolutional Neural Networks, Recurrent Neural Networks and understand the state-of-the-art in hardware and software architectures. Functional Demos will be presented in Keras, a popular Python package with a backend in Theano and Tensorflow.
Paolo Lucente, author of the software pmacct.
Traffic matrices can greatly benefit key IP Service Provider activities like capacity planning, traffic engineering, better understand their traffic patterns and take meaningful peering decisions. This talk wants to present a way to build traffic matrices using telemetry data and BGP, leveraging along the way some case-studies with a technical cut. pmacct (http://www.pmacct.net) is a commonly used, free, open-source IPv4/IPv6a ccounting package which integrates a NetFlow/sFlow and a multi-RIB BGP collector in a single piece of software and is authored by the presenter.
"Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J...Dataconomy Media
"Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias Johansson, Lead Developer at Valo.io
Watch more from Data Natives Berlin 2016 here: http://bit.ly/2fE1sEo
Visit the conference website to learn more: www.datanatives.io
Follow Data Natives:
https://www.facebook.com/DataNatives
https://twitter.com/DataNativesConf
Stay Connected to Data Natives by Email: Subscribe to our newsletter to get the news first about Data Natives 2017: http://bit.ly/1WMJAqS
About the Author:
Tobias is technical lead developer for Valo.io in London. He has a background in the financial sector as a front-office developer but changed track in 2013 to be part of a team building a new real-time analytics platform from the ground up. His goal is to outlive the JVM and his tea addiction. This is his first appearance on the conference scene as a speaker.
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017StampedeCon
This technical session provides a hands-on introduction to TensorFlow using Keras in the Python programming language. TensorFlow is Google’s scalable, distributed, GPU-powered compute graph engine that machine learning practitioners used for deep learning. Keras provides a Python-based API that makes it easy to create well-known types of neural networks in TensorFlow. Deep learning is a group of exciting new technologies for neural networks. Through a combination of advanced training techniques and neural network architectural components, it is now possible to train neural networks of much greater complexity. Deep learning allows a model to learn hierarchies of information in a way that is similar to the function of the human brain.
Interest in Deep Learning has been growing in the past few years. With advances in software and hardware technologies, Neural Networks are making a resurgence. With interest in AI based applications growing, and companies like IBM, Google, Microsoft, NVidia investing heavily in computing and software applications, it is time to understand Deep Learning better!
In this workshop, we will discuss the basics of Neural Networks and discuss how Deep Learning Neural networks are different from conventional Neural Network architectures. We will review a bit of mathematics that goes into building neural networks and understand the role of GPUs in Deep Learning. We will also get an introduction to Autoencoders, Convolutional Neural Networks, Recurrent Neural Networks and understand the state-of-the-art in hardware and software architectures. Functional Demos will be presented in Keras, a popular Python package with a backend in Theano and Tensorflow.
Amazon SageMaker는 기계 학습을 위한 데이터와 알고리즘, 프레임워크를 빠르게 연결하에 손쉽게 ML 구축이 가능한 신규 클라우드 서비스입니다. 이번 시간에는 Amazon S3에 저장된 학습 데이터를 이용하여 가장 일반적으로 사용하는 알고리즘 몇 가지를 직접 실행해 보는 실습을 진행합니다. 이를 위해 유명한 오픈 소스 프레임워크인 TensorFlow와 Keras 그리고 Apache MXNet과 Gluon 등을 사용해 봅니다.
Data Science with Spark - Training at SparkSummit (East)Krishna Sankar
Slideset of the training we gave at the Spark Summit East.
Blog : https://doubleclix.wordpress.com/2015/03/25/data-science-with-spark-on-the-databricks-cloud-training-at-sparksummit-east/
Video is posted at Youtube https://www.youtube.com/watch?v=oTOgaMZkBKQ
Notes about Amazon VPC, a canonical architecture and finally how to implement MongoDB replica sets. My blog http://goo.gl/0guF2 has the color pictures. And the file is at http://doubleclix.files.wordpress.com/2012/10/vpc-distilled-04.pdf. For some reason, slideshare trims the colors.
My talk on NOSQL at OGF29.[Update with OSCON'10 presentation!] But updates do not work reliably in slideshare. So I also have latest version with my blog.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
3. Agenda [1 of 3]
o Intro, Goals, Logistics, Setup [10] [1:20-1:30)
o Anatomy of a Kaggle Competition [60] [1:30-2:30)
• Competition Mechanics
• Register, download data, create sub directories
• Trial Run : Submit Titanic
o Algorithms for the Amateur Data Scientist [20] [2:30-2:50)
• Algorithms, Tools & frameworks in perspective
• “Folk Wisdom”
o Break (30 Min) [2:50-3:10)
4. Agenda [2 of 3]
o Model Evaluation & Interpretation [30] [3:10-3:40)
• Confusion Matrix, ROC Graph
o Session 1 : The Art of a Competition – DS London + Scikit-learn[30 min] [3:40-4:10)
• Dataset Organization
• Analytics Walkthrough
• Algorithms - CART, RF, SVM
• Feature Extraction
• Hands-on Analytics programming of the challenge
• Submit entry
o Session 2 : The Art of a Competition – ASUS,PAKDD 2014 [20 min] [4:10-4:30)
• Dataset Organization
• Analytics Walkthrough, Transformations
5. Agenda [3 of 3]
o Questions, Discussions & Slack [10 min] [4:30-4:40]
o Schedule
• 12:20 – 1:20 : Lunch
• 1:20 – 4:40 : Tutorial (1:20-2:50;2:50-3:10:Break;3:10-4:40)
Overload Warning … There is enough material for a week’s training … which is good & bad !
Read thru at your pace, refer, ponder & internalize
6. Goals & Assumptions
o Goals:
• Get familiar with the mechanics of Data Science Competitions
• Explore the intersection of Algorithms, Data, Intelligence, Inference & Results
• Discuss Data Science Horse Sense ;o)
o At the end of the tutorial you should have :
• Submitted entries for 3 competitions
• Applied Algorithms on Kaggle Data
• CART, RF
• Linear Regression
• SVM
• Explored Data, have a good understanding of the Analytics Pipeline viz.
collect-store-transform-model-reason-deploy-visualize-recommend-infer-
explore
• Knowledge of Model Evaluation
• Cross Validation, ROC Curves
7. Close Encounters
— 1st
◦ This Tutorial
— 2nd
◦ Do More Hands-on Walkthrough
— 3nd
◦ Listen To Lectures
◦ More competitions …
9. Setup
o Install anaconda
o update (if needed)
• conda update conda
• conda update ipython
• condo update python
o conda install pydot
o ipython notebook –-plab=inline
o Go to ipython tab in the browser
o Import packages & print version #
10. Tutorial Materials
o Github : https://github.com/xsankar/freezing-bear
o Clone or download zip
o Open terminal
o cd ~/freezing-bear/notebooks
o ipython notebook –pylab=inline
o Click on ipython dashboard
o Just look thru the ipython notebooks
11. Tutorial Data
o Setup an account in Kaggle (www.kaggle.com)
o We will be using the data from 3 Kaggle competitions
① Titanic: Machine Learning from Disaster
• Download data from http://www.kaggle.com/c/titanic-gettingStarted
• Directory ~/freezing-bear/notebooks/titanic
② Data Science London + Scikit-learn
• Download data from http://www.kaggle.com/c/data-science-london-scikit-learn
• Directory ~/freezing-bear/notebooks/dsl
③ PAKDD 2014 - ASUS Malfunctional Components Prediction
• Download data from http://www.kaggle.com/c/pakdd-cup-2014
• Directory ~/freezing-bear/notebooks/asus
13. Kaggle Data Science Competitions
o Hosts Data Science Competitions
o Competition Attributes:
• Dataset
• Train
• Test (Submission)
• Final Evaluation Data Set (We don’t
see)
• Rules
• Time boxed
• Leaderboard
• Evaluation function
• Discussion Forum
• Private or Public
14. Titanic
Passenger
Metadata
• Small
• 3
Predictors
• Class
• Sex
• Age
• Survived?
http://www.ohgizmo.com/2007/03/21/romain-jerome-titanic
http://flyhigh-by-learnonline.blogspot.com/2009/12/at-movies-sherlock-holmes-2009.html
Data Science London + Sci-Kit learn Competition
PAKDD 2014
Predict Component Failures … ASUS
15. Train.csv
Taken
from
Titanic
Passenger
Manifest
Variable
Descrip-on
Survived
0-‐No,
1=yes
Pclass
Passenger
Class
(
1st,2nd,3rd
)
Sibsp
Number
of
Siblings/Spouses
Aboard
Parch
Number
of
Parents/Children
Aboard
Embarked
Port
of
EmbarkaLon
o C
=
Cherbourg
o Q
=
Queenstown
o S
=
Southampton
Titanic
Passenger
Metadata
• Small
• 3
Predictors
• Class
• Sex
• Age
• Survived?
17. Approach
o This is a classification problem - 0 or 1
o Troll the forums !
o Opportunity for us to try different algorithms & compare them
• Simple Model
• CART[Classification & Regression Tree]
• Greedy, top-down binary, recursive partitioning that divides feature space into sets
of disjoint rectangular regions
• RF
• Different parameters
• SVM
• Multiple kernels
• Table the results
o Use cross validation to predict our model performance & correlate with what Kaggle
says
http://www.dabi.temple.edu/~hbling/8590.002/Montillo_RandomForests_4-2-2009.pdf
18. Simple Model – Our First Submission
o #1 : Simple Model (M=survived)
o #2 : Simple Model (F=survived)
https://www.kaggle.com/c/titanic-gettingStarted/details/getting-started-with-python
Refer
to
iPython
notebook
<1-‐Intro-‐To-‐Kaggle>
at
hTps://github.com/xsankar/freezing-‐bear
19. #3 : Simple CART Model
o CART (Classification & Regression Tree)
http://scikit-learn.org/stable/modules/tree.html
20. #4 : Random Forest Model
o https://www.kaggle.com/wiki/GettingStartedWithPythonForDataScience
• Chris Clark http://blog.kaggle.com/2012/07/02/up-and-running-with-python-my-first-kaggle-entry/
o https://www.kaggle.com/c/titanic-gettingStarted/details/getting-started-with-random-forests
o https://github.com/RahulShivkumar/Titanic-Kaggle/blob/master/titanic.py
21. #5 : SVM
o Multiple Kernels
o kernel = ‘rbf’ #Radial Basis Function
o Kernel = ‘sigmoid’
o agconti's blog - Ultimate Titanic !
o http://fastly.kaggle.net/c/titanic-gettingStarted/forums/t/5105/ipython-notebook-tutorial-for-titanic-machine-learning-from-disaster/29713
22. Feature Engineering - Homework
o Add attribute : Age
• In train 714/891 have age; in test 332/418 have age
• Missing values can be just Mean Age of all passengers
• We could be more precise and calculate Mean Age based on Title
(Ms,Mrs,Master et al)
• Box plot age
o Add attribute : Mother, Family size et al
o Feature engineering ideas
• http://www.kaggle.com/c/titanic-gettingStarted/forums/t/6699/sharing-
experiences-about-data-munging-and-classification-steps-with-python
o More ideas at http://statsguys.wordpress.com/2014/01/11/data-analytics-for-beginners-pt-2/
o And https://github.com/wehrley/wehrley.github.io/blob/master/SOUPTONUTS.md
o Also Learning scikit-learn: Machine Learning in Python, By: Raúl Garreta; Guillermo
Moncecchi, Publisher: Packt Publishing has more ideas on feature selection et al
23. What does it mean ? Let us ponder ….
o We have a training data set representing a domain
• We reason over the dataset & develop a model to predict outcomes
o How good is our prediction when it comes to real life scenarios ?
o The assumption is that the dataset is taken at random
• Or Is it ? Is there a Sampling Bias ?
• i.i.d ? Independent ? Identically Distributed ?
• What about homoscedasticity ? Do they have the same finite variance ?
o Can we assure that another dataset (from the same domain) will give us the same
result ?
o Will our model & it’s parameters remain the same if we get another data set ?
o How can we evaluate our model ?
o How can we select the right parameters for a selected model ?
24. Algorithms for the
Amateur Data Scientist
“A towel is about the most massively useful thing an interstellar hitchhiker can have … any
man who can hitch the length and breadth of the Galaxy, rough it … win through, and still
know where his towel is, is clearly a man to be reckoned with.”
- From The Hitchhiker's Guide to the Galaxy, by Douglas Adams.
Algorithms ! The Most Massively useful thing an Amateur Data Scientist can have …
2:30
25. Ref: Anthony’s Kaggle Presentation
Data Scientists apply different techniques
• Support Vector Machine
• adaBoost
• Bayesian Networks
• Decision Trees
• Ensemble Methods
• Random Forest
• Logistic Regression
• Genetic Algorithms
• Monte Carlo Methods
• Principal Component Analysis
• Kalman Filter
• Evolutionary Fuzzy Modelling
• Neural Networks
Quora
• http://www.quora.com/What-are-the-top-10-data-mining-or-machine-learning-algorithms
26. Algorithm spectrum
o Regression
o Logit
o CART
o Ensemble :
Random
Forest
o Clustering
o KNN
o Genetic Alg
o Simulated
Annealing
o Collab
Filtering
o SVM
o Kernels
o SVD
o NNet
o Boltzman
Machine
o Feature
Learning
Machine
Learning
Cute
Math
Ar0ficial
Intelligence
27. Classifying Classifiers
Statistical
Structural
Regression
Naïve
Bayes
Bayesian
Networks
Rule-‐based
Distance-‐based
Neural
Networks
Production
Rules
Decision
Trees
Multi-‐layer
Perception
Functional
Nearest
Neighbor
Linear
Spectral
Wavelet
kNN
Learning
vector
Quantization
Ensemble
Random
Forests
Logistic
Regression1
SVM
Boosting
1Max
Entropy
Classifier
Ref: Algorithms of the Intelligent Web, Marmanis & Babenko
30. Data Science “folk knowledge” (1 of A)
o "If you torture the data long enough, it will confess to anything." – Hal Varian, Computer
Mediated Transactions
o Learning = Representation + Evaluation + Optimization
o It’s Generalization that counts
• The fundamental goal of machine learning is to generalize beyond the
examples in the training set
o Data alone is not enough
• Induction not deduction - Every learner should embody some knowledge
or assumptions beyond the data it is given in order to generalize beyond
it
o Machine Learning is not magic – one cannot get something from nothing
• In order to infer, one needs the knobs & the dials
• One also needs a rich expressive datasetA few useful things to know about machine learning - by Pedro Domingos
http://dl.acm.org/citation.cfm?id=2347755
31. Data Science “folk knowledge” (2 of A)
o Over fitting has many faces
• Bias – Model not strong enough. So the learner has the tendency to learn the
same wrong things
• Variance – Learning too much from one dataset; model will fall apart (ie much
less accurate) on a different dataset
• Sampling Bias
o Intuition Fails in high Dimensions –Bellman
• Blessing of non-conformity & lower effective dimension; many applications
have examples not uniformly spread but concentrated near a lower dimensional
manifold eg. Space of digits is much smaller then the space of images
o Theoretical Guarantees are not What they seem
• One of the major developments o f recent decades has been the realization that
we can have guarantees on the results of induction, particularly if we are
willing to settle for probabilistic guarantees.
o Feature engineering is the Key
A few useful things to know about machine learning - by Pedro Domingos
http://dl.acm.org/citation.cfm?id=2347755
32. Data Science “folk knowledge” (3 of A)
o More Data Beats a Cleverer Algorithm
• Or conversely select algorithms that improve with data
• Don’t optimize prematurely without getting more data
o Learn many models, not Just One
• Ensembles ! – Change the hypothesis space
• Netflix prize
• E.g. Bagging, Boosting, Stacking
o Simplicity Does not necessarily imply Accuracy
o Representable Does not imply Learnable
• Just because a function can be represented does not mean
it can be learned
o Correlation Does not imply Causation
o http://doubleclix.wordpress.com/2014/03/07/a-glimpse-of-google-nasa-peter-norvig/
o A few useful things to know about machine learning - by Pedro Domingos
§ http://dl.acm.org/citation.cfm?id=2347755
33. Data Science “folk knowledge” (4 of A)
o The simplest hypothesis that fits the data is also the most
plausible
• Occam’s Razor
• Don’t go for a 4 layer Neural Network unless
you have that complex data
• But that doesn’t also mean that one should
choose the simplest hypothesis
• Match the impedance of the domain, data & the
algorithms
o Think of over fitting as memorizing as opposed to learning.
o Data leakage has many forms
o Sometimes the Absence of Something is Everything
o [Corollary] Absence of Evidence is not the Evidence of
Absence
New to Machine Learning? Avoid these three mistakes, James Faghmous
https://medium.com/about-data/73258b3848a4
§ Simple
Model
§ High
Error
line
that
cannot
be
compensated
with
more
data
§ Gets
to
a
lower
error
rate
with
less
data
points
§ Complex
Model
§ Lower
Error
Line
§ But
needs
more
data
points
to
reach
decent
error
Ref: Andrew Ng/Stanford, Yaser S./CalTech
34. Check your assumptions
o The decisions a model makes, is directly related to the it’s assumptions about the
statistical distribution of the underlying data
o For example, for regression one should check that:
① Variables are normally distributed
• Test for normality via visual inspection, skew & kurtosis, outlier inspections via
plots, z-scores et al
② There is a linear relationship between the dependent & independent
variables
• Inspect residual plots, try quadratic relationships, try log plots et al
③ Variables are measured without error
④ Assumption of Homoscedasticity
§ Homoscedasticity assumes constant or near constant error variance
§ Check the standard residual plots and look for heteroscedasticity
§ For example in the figure, left box has the errors scattered randomly around zero; while the
right two diagrams have the errors unevenly distributed
Jason W. Osborne and Elaine Waters, Four assumptions of multiple regression that researchers should always test,
http://pareonline.net/getvn.asp?v=8&n=2
35. Data Science “folk knowledge” (5 of A)
Donald Rumsfeld is an armchair Data Scientist !
http://smartorg.com/2013/07/valuepoint19/
The
World
Knowns
Unknowns
You
UnKnown
Known
o Others
know,
you
don’t
o What
we
do
o Facts,
outcomes
or
scenarios
we
have
not
encountered,
nor
considered
o “Black
swans”,
outliers,
long
tails
of
probability
distribuLons
o Lack
of
experience,
imaginaLon
o PotenLal
facts,
outcomes
we
are
aware,
but
not
with
certainty
o StochasLc
processes,
ProbabiliLes
o Known Knowns
o There are things we know that we know
o Known Unknowns
o That is to say, there are things that we
now know we don't know
o But there are also Unknown Unknowns
o There are things we do not know we
don't know
36. Data Science “folk knowledge” (6 of A) - Pipeline
o Scalable
Model
Deployment
o Big
Data
automation
&
purpose
built
appliances
(soft/
hard)
o Manage
SLAs
&
response
times
o Volume
o Velocity
o Streaming
Data
o Canonical
form
o Data
catalog
o Data
Fabric
across
the
organization
o Access
to
multiple
sources
of
data
o Think
Hybrid
–
Big
Data
Apps,
Appliances
&
Infrastructure
Collect Store Transform
o Metadata
o Monitor
counters
&
Metrics
o Structured
vs.
Multi-‐
structured
o Flexible
&
Selectable
§ Data
Subsets
§ Attribute
sets
o Refine
model
with
§ Extended
Data
subsets
§ Engineered
Attribute
sets
o Validation
run
across
a
larger
data
set
Reason Model Deploy
Data Management
Data Science
o Dynamic
Data
Sets
o 2
way
key-‐value
tagging
of
datasets
o Extended
attribute
sets
o Advanced
Analytics
ExploreVisualize Recommend Predict
o Performance
o Scalability
o Refresh
Latency
o In-‐memory
Analytics
o Advanced
Visualization
o Interactive
Dashboards
o Map
Overlay
o Infographics
¤ Bytes to Business
a.k.a. Build the full
stack
¤ Find Relevant Data
For Business
¤ Connect the Dots
37. Volume
Velocity
Variety
Data Science “folk knowledge” (7 of A)
Context
Connect
edness
Intelligence
Interface
Inference
“Data of unusual size”
that can't be brute forced
o Three Amigos
o Interface = Cognition
o Intelligence = Compute(CPU) & Computational(GPU)
o Infer Significance & Causality
38. Data Science “folk knowledge” (8 of A)
Jeremy’s Axioms
o Iteratively explore data
o Tools
• Excel Format, Perl, Perl Book
o Get your head around data
• Pivot Table
o Don’t over-complicate
o If people give you data, don’t assume that you
need to use all of it
o Look at pictures !
o History of your submissions – keep a tab
o Don’t be afraid to submit simple solutions
• We will do this during this workshop
Ref: http://blog.kaggle.com/2011/03/23/getting-in-shape-for-the-sport-of-data-sciencetalk-by-jeremy-
howard/
39. Data Science “folk knowledge” (9 of A)
① Common Sense (some features make more sense then others)
② Carefully read these forums to get a peak at other peoples’ mindset
③ Visualizations
④ Train a classifier (e.g. logistic regression) and look at the feature weights
⑤ Train a decision tree and visualize it
⑥ Cluster the data and look at what clusters you get out
⑦ Just look at the raw data
⑧ Train a simple classifier, see what mistakes it makes
⑨ Write a classifier using handwritten rules
⑩ Pick a fancy method that you want to apply (Deep Learning/Nnet)
-- Maarten Bosma
-- http://www.kaggle.com/c/stumbleupon/forums/t/5761/methods-for-getting-a-first-overview-over-the-data
40. Data Science “folk knowledge” (A of A)
Lessons from Kaggle Winners
① Don’t over-fit
② All predictors are not needed
• All data rows are not needed, either
③ Tuning the algorithms will give different results
④ Reduce the dataset (Average, select transition data,…)
⑤ Test set & training set can differ
⑥ Iteratively explore & get your head around data
⑦ Don’t be afraid to submit simple solutions
⑧ Keep a tab & history your submissions
41. The curious case of the Data Scientist
o Data Scientist is multi-faceted & Contextual
o Data Scientist should be building Data Products
o Data Scientist should tell a story
Data Scientist (noun): Person who is better at
statistics than any software engineer & better
at software engineering than any statistician
– Josh Wills (Cloudera)
Data Scientist (noun): Person who is worse at
statistics than any statistician & worse at
software engineering than any software
engineer – Will Cukierski (Kaggle)
http://doubleclix.wordpress.com/2014/01/25/the-‐curious-‐case-‐of-‐the-‐data-‐scientist-‐profession/
Large is hard; Infinite is much easier !
– Titus Brown
42. Essential Reading List
o A few useful things to know about machine learning - by Pedro Domingos
• http://dl.acm.org/citation.cfm?id=2347755
o The Lack of A Priori Distinctions Between Learning Algorithms by David H. Wolpert
• http://mpdc.mae.cornell.edu/Courses/MAE714/Papers/
lack_of_a_priori_distinctions_wolpert.pdf
o http://www.no-free-lunch.org/
o Controlling the false discovery rate: a practical and powerful approach to multiple testing Benjamini, Y. and Hochberg,
Y. C
• http://www.stat.purdue.edu/~‾doerge/BIOINFORM.D/FALL06/Benjamini%20and%20Y
%20FDR.pdf
o A Glimpse of Googl, NASA,Peter Norvig + The Restaurant at the End of the Universe
• http://doubleclix.wordpress.com/2014/03/07/a-glimpse-of-google-nasa-peter-norvig/
o Avoid these three mistakes, James Faghmo
• https://medium.com/about-data/73258b3848a4
o Leakage in Data Mining: Formulation, Detection, and Avoidance
• http://www.cs.umb.edu/~‾ding/history/470_670_fall_2011/papers/
cs670_Tran_PreferredPaper_LeakingInDataMining.pdf
43. For your reading & viewing pleasure … An ordered List
① An Introduction to Statistical Learning
• http://www-bcf.usc.edu/~‾gareth/ISL/
② ISL Class Stanford/Hastie/Tibsharani at their best - Statistical Learning
• http://online.stanford.edu/course/statistical-learning-winter-2014
③ Prof. Pedro Domingo
• https://class.coursera.org/machlearning-001/lecture/preview
④ Prof. Andrew Ng
• https://class.coursera.org/ml-003/lecture/preview
⑤ Prof. Abu Mostafa, CaltechX: CS1156x: Learning From Data
• https://www.edx.org/course/caltechx/caltechx-cs1156x-learning-data-1120
⑥ Mathematicalmonk @ YouTube
• https://www.youtube.com/playlist?list=PLD0F06AA0D2E8FFBA
⑦ The Elements Of Statistical Learning
• http://statweb.stanford.edu/~‾tibs/ElemStatLearn/
http://www.quora.com/Machine-Learning/Whats-the-easiest-way-to-learn-
machine-learning/
45. Bias/Variance (1 of 2)
o Model Complexity
• Complex Model increases the
training data fit
• But then it overfits & doesn't
perform as well with real data
o Bias vs. Variance
o Classical diagram
o From ELSII, By Hastie, Tibshirani & Friedman
o Bias – Model learns wrong things; not
complex enough; error gap small; more
data by itself won’t help
o Variance – Different dataset will give
different error rate; over fitted model;
larger error gap; more data could help
Prediction Error
Training
Error
Ref: Andrew Ng/Stanford, Yaser S./CalTech
Learning Curve
46. Bias/Variance (2 of 2)
o High Bias
• Due to Underfitting
• Add more features
• More sophisticated model
• Quadratic Terms, complex equations,…
• Decrease regularization
o High Variance
• Due to Overfitting
• Use fewer features
• Use more training sample
• Increase Regularization
Prediction Error
Training
Error
Ref: Strata 2013 Tutorial by Olivier Grisel
Learning Curve
Need
more
features
or
more
complex
model
to
improve
Need
more
data
to
improve
47. Partition Data !
• Training (60%)
• Validation(20%) &
• “Vault” Test (20%) Data sets
k-fold Cross-Validation
• Split data into k equal parts
• Fit model to k-1 parts &
calculate prediction error on kth
part
• Non-overlapping dataset
Data Partition &
Cross-Validation
— Goal
◦ Model Complexity (-)
◦ Variance (-)
◦ Prediction Accuracy (+)
Train
Validate
Test
#2
#3
#4
#5
#1
#2
#3
#5
#4
#1
#2
#4
#5
#3
#1
#3
#4
#5
#2
#1
#3
#4
#5
#1
#2
K-‐fold
CV
(k=5)
Train
Validate
48. Bootstrap
• Draw datasets (with replacement) and fit model for each dataset
• Remember : Data Partitioning (#1) & Cross Validation (#2) are without
replacement
Bootstrap & Bagging
— Goal
◦ Model Complexity (-)
◦ Variance (-)
◦ Prediction Accuracy (+)
Bagging (Bootstrap aggregation)
◦ Average prediction over a collection of
bootstrap-ed samples, thus reducing
variance
49. ◦ “Output
of
weak
classifiers
into
a
powerful
commiTee”
◦ Final
PredicLon
=
weighted
majority
vote
◦ Later
classifiers
get
misclassified
points
– With
higher
weight,
– So
they
are
forced
– To
concentrate
on
them
◦ AdaBoost
(AdapLveBoosting)
◦ BoosLng
vs
Bagging
– Bagging
–
independent
trees
– BoosLng
–
successively
weighted
Boosting
— Goal
◦ Model Complexity (-)
◦ Variance (-)
◦ Prediction Accuracy (+)
50. ◦ Builds
large
collecLon
of
de-‐correlated
trees
&
averages
them
◦ Improves
Bagging
by
selecLng
i.i.d*
random
variables
for
splikng
◦ Simpler
to
train
&
tune
◦ “Do
remarkably
well,
with
very
li@le
tuning
required”
–
ESLII
◦ Less
suscepLble
to
over
fikng
(than
boosLng)
◦ Many
RF
implementaLons
– Original
version
-‐
Fortran-‐77
!
By
Breiman/Cutler
– Python,
R,
Mahout,
Weka,
Milk
(ML
toolkit
for
py),
matlab
* i.i.d – independent identically distributed
+ http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm
Random Forests+
— Goal
◦ Model Complexity (-)
◦ Variance (-)
◦ Prediction Accuracy (+)
51. ◦ Two
Step
– Develop
a
set
of
learners
– Combine
the
results
to
develop
a
composite
predictor
◦ Ensemble
methods
can
take
the
form
of:
– Using
different
algorithms,
– Using
the
same
algorithm
with
different
sekngs
– Assigning
different
parts
of
the
dataset
to
different
classifiers
◦ Bagging
&
Random
Forests
are
examples
of
ensemble
method
Ref: Machine Learning In Action
Ensemble Methods
— Goal
◦ Model Complexity (-)
◦ Variance (-)
◦ Prediction Accuracy (+)
52. Random Forests
o While Boosting splits based on best among all variables, RF splits based on best among
randomly chosen variables
o Simpler because it requires two variables – no. of Predictors (typically √k) & no. of trees
(500 for large dataset, 150 for smaller)
o Error prediction
• For each iteration, predict for dataset that is not in the sample (OOB data)
• Aggregate OOB predictions
• Calculate Prediction Error for the aggregate, which is basically the OOB
estimate of error rate
• Can use this to search for optimal # of predictors
• We will see how close this is to the actual error in the Heritage Health Prize
o Assumes equal cost for mis-prediction. Can add a cost function
o Proximity matrix & applications like adding missing data, dropping outliers
Ref: R News Vol 2/3, Dec 2002
Statistical Learning from a Regression Perspective : Berk
A Brief Overview of RF by Dan Steinberg
55. Cross Validation
o Reference:
• https://www.kaggle.com/wiki/
GettingStartedWithPythonForDataScience
• Chris Clark ‘s blog :
http://blog.kaggle.com/2012/07/02/up-and-running-with-python-
my-first-kaggle-entry/
• Predicive Modelling in py with scikit-learning, Olivier Grisel Strata
2013
• titanic from pycon2014/parallelmaster/An introduction to Predictive
Modeling in Python
Refer
to
iPython
notebook
<2-‐Model-‐EvaluaLon>
at
hTps://github.com/xsankar/freezing-‐bear
56. Model Evaluation - Accuracy
o Accuracy =
o For cases where tn is large compared tp, a degenerate return(false) will be
very accurate !
o Hence the F-measure is a better reflection of the model strength
Predicted=1
Predicted=0
Actual
=1
True+
(tp)
False-‐
(fn)
Actual=0
False+
(fp)
True-‐
(tn)
tp
+
tn
tp+fp+fn+tn
57. Model Evaluation – Precision & Recall
o Precision = How many items we identified are relevant
o Recall = How many relevant items did we identify
o Inverse relationship – Tradeoff depends on situations
• Legal – Coverage is important than correctness
• Search – Accuracy is more important
• Fraud
• Support cost (high fp) vs. wrath of credit card co.(high fn)
tp
tp+fp
• Precision
• Accuracy
• Relevancy
tp
tp+fn
• Recall
• True
+ve
Rate
• Coverage
• Sensitivity
• Hit
Rate
http://nltk.googlecode.com/svn/trunk/doc/book/ch06.html
fp
fp+tn
• Type
1
Error
Rate
• False
+ve
Rate
• False
Alarm
Rate
• Specificity
=
1
–
fp
rate
• Type
1
Error
=
fp
• Type
2
Error
=
fn
Predicted=1
Predicted=0
Actual
=1
True+
(tp)
False-‐
(fn)
Actual=0
False+
(fp)
True-‐
(tn)
59. Model Evaluation : F-Measure
Precision = tp / (tp+fp) : Recall = tp / (tp+fn)
F-Measure
Balanced, Combined, Weighted Harmonic Mean, measures effectiveness
=
β2
P
+
R
Common Form (Balanced F1) : β=1 (α = ½ ) ; F1 = 2PR / P+R
+
(1
–
α)
α
1
P
1
R
1
(β2
+
1)PR
Predicted=1
Predicted=0
Actual
=1
True+
(tp)
False-‐
(fn)
Actual=0
False+
(fp)
True-‐
(tn)
60. Hands-on Walkthru - Model Evaluation
Train
Test
712 (80%) 179
891
Refer
to
iPython
notebook
<2-‐Model-‐EvaluaLon>
at
hTps://github.com/xsankar/freezing-‐bear
61. ROC Analysis
o “How good is my model?”
o Good Reference : http://people.inf.elte.hu/kiss/13dwhdm/roc.pdf
o “A receiver operating characteristics (ROC) graph is a technique for visualizing,
organizing and selecting classifiers based on their performance”
o Much better than evaluating a model based on simple classification accuracy
o Plots tp rate vs. fp rate
o After understanding the ROC Graph, we will draw a few for our models in
iPython notebook <2-Model-Evaluation> at https://github.com/xsankar/
freezing-bear
62. ROC Graph - Discussion
o E = Conservative, Everything
NO
o H = Liberal, Everything YES
o Am not making any
political statement !
o F = Ideal
o G = Worst
o The diagonal is the chance
o North West Corner is good
o South-East is bad
• For example E
• Believe it or Not - I have
actually seen a graph
with the curve in this
region !
E
F
G
H
Predicted=1
Predicted=0
Actual
=1
True+
(tp)
False-‐
(fn)
Actual=0
False+
(fp)
True-‐
(tn)
63. ROC Graph – Clinical Example
Ifcc
:
Measures
of
diagnostic
accuracy:
basic
definitions
64. ROC Graph Walk thru
o iPython notebook <2-Model-Evaluation> at https://github.com/xsankar/
freezing-bear
65. The Art of a Competition – Session I :
Data Science London + Scikit-learn
3:40
66. Few interesting Links-troll the forums
o http://www.kaggle.com/c/data-science-london-scikit-learn/visualization/1113
• Will’s Solution
o Quick First prediction :
http://www.kaggle.com/c/data-science-london-scikit-learn/visualization/1075
o CV http://www.kaggle.com/c/data-science-london-scikit-learn/visualization/1183
o In-depth Solution :
http://www.kaggle.com/c/data-science-london-scikit-learn/forums/t/6528/
solution-discussion
o Video
http://datasciencelondon.org/machine-learning-python-scikit-learn-ipython-
dsldn-data-science-london-kaggle/
o More in http://www.kaggle.com/c/data-science-london-scikit-learn/visualization
67. #1 : SVM-Will
Refer
to
iPython
notebook
<3-‐Session-‐I>
at
hTps://github.com/xsankar/freezing-‐bear
69. The Art of a Competition -
Session II : ASUS, PAKDD2014
The 18th Pacific-Asia Conference on Knowledge Discovery & Data Mining
4:10
70. Data Organization & Approach
o Deceptively Simple Data
• Sales
• Repairs
• Predict Future Repairs
o Explore data
• Module : M1-M9
• Component : P1-P31
o Repair before sales
o -ve sales
o Reason for exploring this competition
is to get a feel for a complex dataset
in terms of processing
Sales
Repairs
Future
Repairs
?
71. Approach, Ideas & Results
o Discussions – troll the forums
• http://www.kaggle.com/c/pakdd-cup-2014/forums/t/7573/what-did-
you-do-to-get-to-the-top-of-the-board
• Ran Locar, James King, https://github.com/aparij/kaggle_asus/blob/master/
lin_comb_survival.py, Brandon Kam,
• http://www.kaggle.com/c/pakdd-cup-2014/forums/t/6980/sample-
submission-benchmark-in-leaderboard
• Beat_benchmark by Chitrasen
• Beat-benchmary.py
o Lots of interesting models & ideas
o Simple linear regression
o Quadratic Time ?
o Batch Mortality Rate
o Breaks with higher repair tendency ? Survival Analysis ?
Refer
to
iPython
notebook
<4-‐Session-‐II>
at
hTps://github.com/xsankar/freezing-‐bear
72. #1 – All 0s, All 1s
o All 0s
o All 1s
Refer
to
iPython
notebook
<4-‐Session-‐II>
at
hTps://github.com/xsankar/freezing-‐bear
73. #2 :
o Let us get serious & do some transformation
o Split
o Convert to int64
Refer
to
iPython
notebook
<4-‐Session-‐II>
at
hTps://github.com/xsankar/freezing-‐bear
74. #3 : Hints To Try
o Decay
o Kaplan Meier Analysis
o Cox Model
o Parametric Survival Models
76. The Beginning As the End
— We
started
with
a
set
of
goals
— Homework
◦ For
you
– Go
through
the
slides
– Do
the
walkthrough
– Work
on
more
compe00ons
– Submit
entries
to
Kaggle
77. References:
o An Introduction to scikit-learn, pycon 2013, Jake Vanderplas
• http://pyvideo.org/video/1655/an-introduction-to-scikit-learn-machine-
learning
o Advanced Machine Learning with scikit-learn, pycon 2013, Strata 2014, Olivier Grisel
• http://pyvideo.org/video/1719/advanced-machine-learning-with-scikit-learn
o Just The Basics, Strata 2013, William Cukierski & Ben Hamner
• http://strataconf.com/strata2013/public/schedule/detail/27291
o The Problem of Multiple Testing
• http://download.journals.elsevierhealth.com/pdfs/journals/1934-1482/
PIIS1934148209014609.pdf