How to Win Machine Learning Competitions ? HackerEarth
This presentation was given by Marios Michailidis (a.k.a Kazanova), Current Kaggle Rank #3 to help community learn machine learning better. It comprises of useful ML tips and techniques to perform better in machine learning competitions. Read the full blog: http://blog.hackerearth.com/winning-tips-machine-learning-competitions-kazanova-current-kaggle-3
One of the most important, yet often overlooked, aspects of predictive modeling is the transformation of data to create model inputs, better known as feature engineering (FE). This talk will go into the theoretical background behind FE, showing how it leverages existing data to produce better modeling results. It will then detail some important FE techniques that should be in every data scientist’s tool kit.
Winning data science competitions, presented by Owen ZhangVivian S. Zhang
<featured> Meetup event hosted by NYC Open Data Meetup, NYC Data Science Academy. Speaker: Owen Zhang, Event Info: http://www.meetup.com/NYC-Open-Data/events/219370251/
In this talk, Dmitry shares his approach to feature engineering which he used successfully in various Kaggle competitions. He covers common techniques used to convert your features into numeric representation used by ML algorithms.
Feature Engineering for ML - Dmitry Larko, H2O.aiSri Ambati
This talk was given at H2O World 2018 NYC and can be viewed here: https://youtu.be/wcFdmQSX6hM
Description:
In this talk, Dmitry shares his approach to feature engineering which he used successfully in various Kaggle competitions. He covers common techniques used to convert your features into numeric representation used by ML algorithms.
Speaker's Bio:
Dmitry has more than 10 years of experience in IT. Starting with data warehousing and BI, now in big data and data science. He has a lot of experience in predictive analytics software development for different domains and tasks. He is also a Kaggle Grandmaster who loves to use his machine learning and data science skills on Kaggle competitions.
How to Win Machine Learning Competitions ? HackerEarth
This presentation was given by Marios Michailidis (a.k.a Kazanova), Current Kaggle Rank #3 to help community learn machine learning better. It comprises of useful ML tips and techniques to perform better in machine learning competitions. Read the full blog: http://blog.hackerearth.com/winning-tips-machine-learning-competitions-kazanova-current-kaggle-3
One of the most important, yet often overlooked, aspects of predictive modeling is the transformation of data to create model inputs, better known as feature engineering (FE). This talk will go into the theoretical background behind FE, showing how it leverages existing data to produce better modeling results. It will then detail some important FE techniques that should be in every data scientist’s tool kit.
Winning data science competitions, presented by Owen ZhangVivian S. Zhang
<featured> Meetup event hosted by NYC Open Data Meetup, NYC Data Science Academy. Speaker: Owen Zhang, Event Info: http://www.meetup.com/NYC-Open-Data/events/219370251/
In this talk, Dmitry shares his approach to feature engineering which he used successfully in various Kaggle competitions. He covers common techniques used to convert your features into numeric representation used by ML algorithms.
Feature Engineering for ML - Dmitry Larko, H2O.aiSri Ambati
This talk was given at H2O World 2018 NYC and can be viewed here: https://youtu.be/wcFdmQSX6hM
Description:
In this talk, Dmitry shares his approach to feature engineering which he used successfully in various Kaggle competitions. He covers common techniques used to convert your features into numeric representation used by ML algorithms.
Speaker's Bio:
Dmitry has more than 10 years of experience in IT. Starting with data warehousing and BI, now in big data and data science. He has a lot of experience in predictive analytics software development for different domains and tasks. He is also a Kaggle Grandmaster who loves to use his machine learning and data science skills on Kaggle competitions.
Feature Engineering - Getting most out of data for predictive modelsGabriel Moreira
How should data be preprocessed for use in machine learning algorithms? How to identify the most predictive attributes of a dataset? What features can generate to improve the accuracy of a model?
Feature Engineering is the process of extracting and selecting, from raw data, features that can be used effectively in predictive models. As the quality of the features greatly influences the quality of the results, knowing the main techniques and pitfalls will help you to succeed in the use of machine learning in your projects.
In this talk, we will present methods and techniques that allow us to extract the maximum potential of the features of a dataset, increasing flexibility, simplicity and accuracy of the models. The analysis of the distribution of features and their correlations, the transformation of numeric attributes (such as scaling, normalization, log-based transformation, binning), categorical attributes (such as one-hot encoding, feature hashing, Temporal (date / time), and free-text attributes (text vectorization, topic modeling).
Python, Python, Scikit-learn, and Spark SQL examples will be presented and how to use domain knowledge and intuition to select and generate features relevant to predictive models.
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learntEugene Yan Ziyou
Our team achieved 85th position out of 3,514 at the very popular Kaggle Otto Product Classification Challenge. Here's an overview of how we did it, as well as some techniques we learnt from fellow Kagglers during and after the competition.
Feature Engineering - Getting most out of data for predictive models - TDC 2017Gabriel Moreira
How should data be preprocessed for use in machine learning algorithms? How to identify the most predictive attributes of a dataset? What features can generate to improve the accuracy of a model?
Feature Engineering is the process of extracting and selecting, from raw data, features that can be used effectively in predictive models. As the quality of the features greatly influences the quality of the results, knowing the main techniques and pitfalls will help you to succeed in the use of machine learning in your projects.
In this talk, we will present methods and techniques that allow us to extract the maximum potential of the features of a dataset, increasing flexibility, simplicity and accuracy of the models. The analysis of the distribution of features and their correlations, the transformation of numeric attributes (such as scaling, normalization, log-based transformation, binning), categorical attributes (such as one-hot encoding, feature hashing, Temporal (date / time), and free-text attributes (text vectorization, topic modeling).
Python, Python, Scikit-learn, and Spark SQL examples will be presented and how to use domain knowledge and intuition to select and generate features relevant to predictive models.
The Text Classification slides contains the research results about the possible natural language processing algorithms. Specifically, it contains the brief overview of the natural language processing steps, the common algorithms used to transform words into meaningful vectors/data, and the algorithms used to learn and classify the data.
To learn more about RAX Automation Suite, visit: www.raxsuite.com
This was presented at the London Artificial Intelligence & Deep Learning Meetup.
https://www.meetup.com/London-Artificial-Intelligence-Deep-Learning/events/245251725/
Enjoy the recording: https://youtu.be/CY3t11vuuOM.
- - -
Kasia discussed complexities of interpreting black-box algorithms and how these may affect some industries. She presented the most popular methods of interpreting Machine Learning classifiers, for example, feature importance or partial dependence plots and Bayesian networks. Finally, she introduced Local Interpretable Model-Agnostic Explanations (LIME) framework for explaining predictions of black-box learners – including text- and image-based models - using breast cancer data as a specific case scenario.
Kasia Kulma is a Data Scientist at Aviva with a soft spot for R. She obtained a PhD (Uppsala University, Sweden) in evolutionary biology in 2013 and has been working on all things data ever since. For example, she has built recommender systems, customer segmentations, predictive models and now she is leading an NLP project at the UK’s leading insurer. In spare time she tries to relax by hiking & camping, but if that doesn’t work ;) she co-organizes R-Ladies meetups and writes a data science blog R-tastic (https://kkulma.github.io/).
https://www.linkedin.com/in/kasia-kulma-phd-7695b923/
최근 이수가 되고 있는 Bayesian Deep Learning 관련 이론과 최근 어플리케이션들을 소개합니다. Bayesian Inference 의 이론에 관해서 간단히 설명하고 Yarin Gal 의 Monte Carlo Dropout 의 이론과 어플리케이션들을 소개합니다.
Our fall 12-Week Data Science bootcamp starts on Sept 21st,2015. Apply now to get a spot!
If you are hiring Data Scientists, call us at (1)888-752-7585 or reach info@nycdatascience.com to share your openings and set up interviews with our excellent students.
---------------------------------------------------------------
Come join our meet-up and learn how easily you can use R for advanced Machine learning. In this meet-up, we will demonstrate how to understand and use Xgboost for Kaggle competition. Tong is in Canada and will do remote session with us through google hangout.
---------------------------------------------------------------
Speaker Bio:
Tong is a data scientist in Supstat Inc and also a master students of Data Mining. He has been an active R programmer and developer for 5 years. He is the author of the R package of XGBoost, one of the most popular and contest-winning tools on kaggle.com nowadays.
Pre-requisite(if any): R /Calculus
Preparation: A laptop with R installed. Windows users might need to have RTools installed as well.
Agenda:
Introduction of Xgboost
Real World Application
Model Specification
Parameter Introduction
Advanced Features
Kaggle Winning Solution
Event arrangement:
6:45pm Doors open. Come early to network, grab a beer and settle in.
7:00-9:00pm XgBoost Demo
Reference:
https://github.com/dmlc/xgboost
xgboost를 이해하기 위해서 찾아보다가 내가 궁금한 내용을 따로 정리하였으나, 역시 구체적인 수식은 아직 모르겠다.
요즘 Kaggle에서 유명한 Xgboost가 뭘까?
Ensemble중 하나인 Boosting기법?
Ensemble 유형인 Bagging과 Boosting 차이는?
왜 Ensemble이 low bias, high variance 모델인가?
Bias 와 Variance 관계는?
Boosting 기법은 어떤게 있나?
Xgboost에서 사용하는 CART 알고리즘은?
Unified Approach to Interpret Machine Learning Model: SHAP + LIMEDatabricks
For companies that solve real-world problems and generate revenue from the data science products, being able to understand why a model makes a certain prediction can be as crucial as achieving high prediction accuracy in many applications. However, as data scientists pursuing higher accuracy by implementing complex algorithms such as ensemble or deep learning models, the algorithm itself becomes a blackbox and it creates the trade-off between accuracy and interpretability of a model’s output.
To address this problem, a unified framework SHAP (SHapley Additive exPlanations) was developed to help users interpret the predictions of complex models. In this session, we will talk about how to apply SHAP to various modeling approaches (GLM, XGBoost, CNN) to explain how each feature contributes and extract intuitive insights from a particular prediction. This talk is intended to introduce the concept of general purpose model explainer, as well as help practitioners understand SHAP and its applications.
Kaggle is a community of almost 400K data scientists who have built almost 2MM machine learning models to participate in our competitions. Data scientists come to Kaggle to learn, collaborate and develop the state of the art in machine learning. This talk will cover some of the lessons we have learned from the Kaggle community.
Feature Engineering - Getting most out of data for predictive modelsGabriel Moreira
How should data be preprocessed for use in machine learning algorithms? How to identify the most predictive attributes of a dataset? What features can generate to improve the accuracy of a model?
Feature Engineering is the process of extracting and selecting, from raw data, features that can be used effectively in predictive models. As the quality of the features greatly influences the quality of the results, knowing the main techniques and pitfalls will help you to succeed in the use of machine learning in your projects.
In this talk, we will present methods and techniques that allow us to extract the maximum potential of the features of a dataset, increasing flexibility, simplicity and accuracy of the models. The analysis of the distribution of features and their correlations, the transformation of numeric attributes (such as scaling, normalization, log-based transformation, binning), categorical attributes (such as one-hot encoding, feature hashing, Temporal (date / time), and free-text attributes (text vectorization, topic modeling).
Python, Python, Scikit-learn, and Spark SQL examples will be presented and how to use domain knowledge and intuition to select and generate features relevant to predictive models.
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learntEugene Yan Ziyou
Our team achieved 85th position out of 3,514 at the very popular Kaggle Otto Product Classification Challenge. Here's an overview of how we did it, as well as some techniques we learnt from fellow Kagglers during and after the competition.
Feature Engineering - Getting most out of data for predictive models - TDC 2017Gabriel Moreira
How should data be preprocessed for use in machine learning algorithms? How to identify the most predictive attributes of a dataset? What features can generate to improve the accuracy of a model?
Feature Engineering is the process of extracting and selecting, from raw data, features that can be used effectively in predictive models. As the quality of the features greatly influences the quality of the results, knowing the main techniques and pitfalls will help you to succeed in the use of machine learning in your projects.
In this talk, we will present methods and techniques that allow us to extract the maximum potential of the features of a dataset, increasing flexibility, simplicity and accuracy of the models. The analysis of the distribution of features and their correlations, the transformation of numeric attributes (such as scaling, normalization, log-based transformation, binning), categorical attributes (such as one-hot encoding, feature hashing, Temporal (date / time), and free-text attributes (text vectorization, topic modeling).
Python, Python, Scikit-learn, and Spark SQL examples will be presented and how to use domain knowledge and intuition to select and generate features relevant to predictive models.
The Text Classification slides contains the research results about the possible natural language processing algorithms. Specifically, it contains the brief overview of the natural language processing steps, the common algorithms used to transform words into meaningful vectors/data, and the algorithms used to learn and classify the data.
To learn more about RAX Automation Suite, visit: www.raxsuite.com
This was presented at the London Artificial Intelligence & Deep Learning Meetup.
https://www.meetup.com/London-Artificial-Intelligence-Deep-Learning/events/245251725/
Enjoy the recording: https://youtu.be/CY3t11vuuOM.
- - -
Kasia discussed complexities of interpreting black-box algorithms and how these may affect some industries. She presented the most popular methods of interpreting Machine Learning classifiers, for example, feature importance or partial dependence plots and Bayesian networks. Finally, she introduced Local Interpretable Model-Agnostic Explanations (LIME) framework for explaining predictions of black-box learners – including text- and image-based models - using breast cancer data as a specific case scenario.
Kasia Kulma is a Data Scientist at Aviva with a soft spot for R. She obtained a PhD (Uppsala University, Sweden) in evolutionary biology in 2013 and has been working on all things data ever since. For example, she has built recommender systems, customer segmentations, predictive models and now she is leading an NLP project at the UK’s leading insurer. In spare time she tries to relax by hiking & camping, but if that doesn’t work ;) she co-organizes R-Ladies meetups and writes a data science blog R-tastic (https://kkulma.github.io/).
https://www.linkedin.com/in/kasia-kulma-phd-7695b923/
최근 이수가 되고 있는 Bayesian Deep Learning 관련 이론과 최근 어플리케이션들을 소개합니다. Bayesian Inference 의 이론에 관해서 간단히 설명하고 Yarin Gal 의 Monte Carlo Dropout 의 이론과 어플리케이션들을 소개합니다.
Our fall 12-Week Data Science bootcamp starts on Sept 21st,2015. Apply now to get a spot!
If you are hiring Data Scientists, call us at (1)888-752-7585 or reach info@nycdatascience.com to share your openings and set up interviews with our excellent students.
---------------------------------------------------------------
Come join our meet-up and learn how easily you can use R for advanced Machine learning. In this meet-up, we will demonstrate how to understand and use Xgboost for Kaggle competition. Tong is in Canada and will do remote session with us through google hangout.
---------------------------------------------------------------
Speaker Bio:
Tong is a data scientist in Supstat Inc and also a master students of Data Mining. He has been an active R programmer and developer for 5 years. He is the author of the R package of XGBoost, one of the most popular and contest-winning tools on kaggle.com nowadays.
Pre-requisite(if any): R /Calculus
Preparation: A laptop with R installed. Windows users might need to have RTools installed as well.
Agenda:
Introduction of Xgboost
Real World Application
Model Specification
Parameter Introduction
Advanced Features
Kaggle Winning Solution
Event arrangement:
6:45pm Doors open. Come early to network, grab a beer and settle in.
7:00-9:00pm XgBoost Demo
Reference:
https://github.com/dmlc/xgboost
xgboost를 이해하기 위해서 찾아보다가 내가 궁금한 내용을 따로 정리하였으나, 역시 구체적인 수식은 아직 모르겠다.
요즘 Kaggle에서 유명한 Xgboost가 뭘까?
Ensemble중 하나인 Boosting기법?
Ensemble 유형인 Bagging과 Boosting 차이는?
왜 Ensemble이 low bias, high variance 모델인가?
Bias 와 Variance 관계는?
Boosting 기법은 어떤게 있나?
Xgboost에서 사용하는 CART 알고리즘은?
Unified Approach to Interpret Machine Learning Model: SHAP + LIMEDatabricks
For companies that solve real-world problems and generate revenue from the data science products, being able to understand why a model makes a certain prediction can be as crucial as achieving high prediction accuracy in many applications. However, as data scientists pursuing higher accuracy by implementing complex algorithms such as ensemble or deep learning models, the algorithm itself becomes a blackbox and it creates the trade-off between accuracy and interpretability of a model’s output.
To address this problem, a unified framework SHAP (SHapley Additive exPlanations) was developed to help users interpret the predictions of complex models. In this session, we will talk about how to apply SHAP to various modeling approaches (GLM, XGBoost, CNN) to explain how each feature contributes and extract intuitive insights from a particular prediction. This talk is intended to introduce the concept of general purpose model explainer, as well as help practitioners understand SHAP and its applications.
Kaggle is a community of almost 400K data scientists who have built almost 2MM machine learning models to participate in our competitions. Data scientists come to Kaggle to learn, collaborate and develop the state of the art in machine learning. This talk will cover some of the lessons we have learned from the Kaggle community.
HackerEarth helping a startup hire developers - The Practo Case StudyHackerEarth
A startup's hiring requirement are probably the hardest ones to satisfy. Find out how Practo, a health case startup based out of India filled it's tech hiring requirement in record time using HackerEarth
Intra company hackathons using HackerEarthHackerEarth
How to conduct an internal Hackathon within your company to engage developers and find the best developers in your company and understanding the technical climate of your company
The workshop will present how to combine tools to quickly query, transform and model data using command line tools.
The goal is to show that command line tools are efficient at handling reasonable sizes of data and can accelerate the data science
process. We will show that in many instances, command line processing ends up being much faster than ‘big-data’ solutions. The content
of the workshop is derived from the book of the same name (http://datascienceatthecommandline.com/). In addition, we will cover
vowpal-wabbit (https://github.com/JohnLangford/vowpal_wabbit) as a versatile command line tool for modeling large datasets.
Ethics in Data Science and Machine LearningHJ van Veen
Introduction and overview on ethics in data science and machine learning, variations and examples of algorithmic bias, and a call-to-action for self-regulation. Given by Thierry Silbermann as part of the Sao Paulo Machine Learning Meetup, theme: "Ethics".
https://www.linkedin.com/in/thierrysilbermann
https://twitter.com/silbermannt
https://github.com/thierry-silbermann
Doing your first Kaggle (Python for Big Data sets)Domino Data Lab
You love python. You love Data Science. But the size of your data set keeps crashing your code. Is it time to bring in big data tools or simply code smarter? Lee is going to show you efficiency hacks, drawn from top Kaggle competitors, to get python to work on large data sets. Skip the hassle of creating a Big Data infrastructure. Let’s find out how far we can push our home laptop first.
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...Spark Summit
Feature hashing is a powerful technique for handling high-dimensional features in machine learning. It is fast, simple, memory-efficient, and well suited to online learning scenarios. While an approximation, it has surprisingly low accuracy tradeoffs in many machine learning problems.
Feature hashing has been made somewhat popular by libraries such as Vowpal Wabbit and scikit-learn. In Spark MLlib, it is mostly used for text features, however its use cases extend more broadly. Many Spark users are not familiar with the ways in which feature hashing might be applied to their problems.
In this talk, I will cover the basics of feature hashing, and how to use it for all feature types in machine learning. I will also introduce a more flexible and powerful feature hashing transformer for use within Spark ML pipelines. Finally, I will explore the performance and scalability tradeoffs of feature hashing on various datasets.
Fairly Measuring Fairness In Machine LearningHJ van Veen
We look at a case and two research papers on measuring discrimination in machine learning models for extending credit. Presentation given as part of the Sao Paulo Machine Learning Meetup, theme "Ethics in Data Science".
Nick day, JGA Recruitment Payroll, Managing Director collaborated with HackerEarth and discussed actionable tips for recruiting & retaining best candidates in your talent pipeline.
Most of analytics modeling work today focuses on the production of single-purpose "artisanal" models for predictions. This approach to analytics is fragile with respect to model consistency, reorganization, and resource availability. This talk will argue that instead the focus of analytics modeling should be toward the production of analytics interchangeable parts, which can be combined in creative ways to produce a wide variety of analytics results. This "nuts and bolts" approach allows analytics groups to produce results in an agile way where the time between ask and answer is determined by the right combination of analytics, rather than the modeling.
by Szilard Pafka
Chief Scientist at Epoch
Szilard studied Physics in the 90s in Budapest and has obtained a PhD by using statistical methods to analyze the risk of financial portfolios. Next he has worked in finance quantifying and managing market risk. A decade ago he moved to California to become the Chief Scientist of a credit card processing company doing what now is called data science (data munging, analysis, modeling, visualization, machine learning etc). He is the founder/organizer of several data science meetups in Santa Monica, and he is also a visiting professor at CEU in Budapest, where he teaches data science in the Masters in Business Analytics program.
While extracting business value from data has been performed by practitioners for decades, the last several years have seen an unprecedented amount of hype in this field. This hype has created not only unrealistic expectations in results, but also glamour in the usage of the newest tools assumably capable of extraordinary feats. In this talk I will apply the much needed methods of critical thinking and quantitative measurements (that data scientists are supposed to use daily in solving problems for their companies) to assess the capabilities of the most widely used software tools for data science. I will discuss in details two such analyses, one concerning the size of datasets used for analytics and the other one regarding the performance of machine learning software used for supervised learning.
In this presentation, I talk about data science competitions. After an introduction of the data science competitions, I go through the benefits, misconceptions, and best practices of competitions.
How hackathons can drive top line revenue growthHackerEarth
Innovation management overview
What is a hackathon?
Why hackathons?
Role of Hackathon in enterprise innovation
Leveraging hackathon-based innovation campaign for growth
Keys to conducting a successful hackathon
Vowpal Wabbit is both an open-source machine learning toolkit and an active research platform. In this talk I introduce Vowpal Wabbit, discuss some of the design decisions, and the types of problems for which VW is (or is not) a good fit. The talk includes (live) demonstrations some of the latest features for recommendation, contextual bandit, and structured prediction problems.
Introduction to machine learning. Basics of machine learning. Overview of machine learning. Linear regression. logistic regression. cost function. Gradient descent. sensitivity, specificity. model selection.
AI at scale requires a perfect storm of data, algorithms and cloud infrastructure. Modern deep learning requires large amounts of training data. We develop methods that improve data collection, aggregation and augmentation. This involves active learning, partial feedback, crowdsourcing, and generative models. We analyze large-scale machine learning methods for distributed training. We show that gradient quantization can yield best of both the worlds: accuracy and communication efficiency. We extend matrix methods to higher dimensions using tensor algebraic techniques and show superior performance. Finally, at AWS, we are developing robust software frameworks and AI cloud services at all levels of the stack.
The ten commandments of Apex are the fundamental rules that good Force.com developers always use when developing. Join us to learn the common pitfalls that often cause difficulties for new developers to the platform so you can architect and develop top-notch Force.com applications.
Heuristic design of experiments w meta gradient searchGreg Makowski
Once you have started learning about predictive algorithms, and the basic knowledge discovery in databases process, what is the next level of detail to learn for a consulting project?
* Give examples of the many model training parameters
* Track results in a "model notebook"
* Use a model metric that combines both accuracy and generalization to rank models
* How to strategically search over the model training parameters - use a gradient descent approach
* One way to describe an arbitrarily complex predictive system is by using sensitivity analysis
"A Framework for Developing Trading Models Based on Machine Learning" by Kris...Quantopian
Presented at QuantCon Singapore 2016, Quantopian's quantitative finance and algorithmic trading conference, November 11th.
Machine learning is improving facets of our lives as diverse as health screening, transportation and even our entertainment choices. It stands to reason that machine learning can also improve trading performance, however the practical application is fraught with pitfalls and obstacles that nullify the benefits and present a high barrier to entry. Building on background information and introductory material, Kris will propose a framework for efficient and robust experimentation with machine learning methods for algorithmic trading. The framework's objective is to arrive at parsimonious models whose positive past performance is unlikely to be due to chance. The framework is demonstrated via practical examples of various machine learning models for algorithmic trading.
The Power of Auto ML and How Does it WorkIvo Andreev
Automated ML is an approach to minimize the need of data science effort by enabling domain experts to build ML models without having deep knowledge of algorithms, mathematics or programming skills. The mechanism works by allowing end-users to simply provide data and the system automatically does the rest by determining approach to perform particular ML task. At first this may sound discouraging to those aiming to the “sexiest job of the 21st century” - the data scientists. However, Auto ML should be considered as democratization of ML, rather that automatic data science.
In this session we will talk about how Auto ML works, how is it implemented by Microsoft and how it could improve the productivity of even professional data scientists.
Making Machine Learning Work in Practice - StampedeCon 2014StampedeCon
At StampedeCon 2014, Kilian Q. Weinberger (Washington University) presented "Making Machine Learning work in Practice."
Here, Kilian will go over common pitfalls and tricks on how to make machine learning work.
This presentation nlp classifiers, the different types of models tfidf, word2vec & DL models such as feed forward NN , CNN & siamese networks. Details on important metrics such as precision, recall AUC are also given
Machine learning for IoT - unpacking the blackboxIvo Andreev
Have you ever considered Machine Learning as a black box? It sounds as a kind of magic happening. Although being one among many solutions available, Azure ML has proved to be a great balance between flexibility, usability and affordable price. But how does Azure ML compare with the other ML providers? How to choose the appropriate algorithm? Do you understand the key performance indicators and how to improve the quality of your models? The session is about understanding the black box and using it for IoT workload and not only.
Mariia Havrylovych "Active learning and weak supervision in NLP projects"Fwdays
Successful artificial intelligence solutions always require a massive amount of high-quality labeled data. In most cases, we don’t have a large and qualitative labeled set together. Weak supervision and active learning tools may help you optimize the labeling process and address the shortage of data labels.
First, we will review how active learning can significantly reduce the amount of labeled data for training with classic approaches. We will show how active learning methods can be customized for a specific (NLP) task by using text embedding.
With weak supervision, we will see how using simple rules gets a big train dataset automatically and high model performance without manual labeling at all.
In the end, we will combine active learning and weak supervision by taking advantage of both techniques and achieving the best metrics.
Similar to Tips and tricks to win kaggle data science competitions (20)
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
3. • Real-time standings - competitive spirit of a game
• Cooperate with anyone to learn from each other
• Improve or learn new skills from shared scripts in
forums
• Overall profile rankings on past competition
performance – added value to data science profiles
• Friendly and helping community
• And of course nice prizes for top finishers
Kaggle –
What’s so fun about it?
4. 2. Make predictions on test samples1. Build model using
training samples with
known labels
TRAINING SET TEST SET - PUBLIC TEST SET - PRIVATE
3. Submit model predictions
on “unseen” data and get
quick response in public
leaderboard standings
4. Final standings based on
Private test predictions
How does it all work?
5. TRAINING SET TEST SET
1. Make random
training set
splits
3.Infer information from
CV models
CV1 CV2 CV3
CV1 + CV2 => validate CV3
CV1 + CV3 => validate CV2
CV2 + CV3 => validate CV1
2. Run models on all-but-one
subsamples
Cross-validation – the basics
6. CV1 CV2 CV3 TEST SET
CV1 CV2 CV3
mean(p(CV1+CV2),p(CV1+CV3),p(CV2+CV3))
NO GOLDEN RULE WHICH IS BETTER!
#1 better on smaller datasets, simpler code
#2 preferred on larger datasets, faster iterations
#2 > #1 for many top kagglers – but situational
#1
#2
TRAINING SET TEST SET
Using CV results
7. TIP #1 – DO NOT OVERFIT PUBLIC TEST SET
Model Public test score Private test score
Number trees 200 0.812 0.815
Number trees 300 0.814 0.819
Number trees 400 0.816 0.820
Number trees 500 0.814 0.821
Number trees 450 0.817 0.819
Number trees 430 0.818 0.816
… many attempts later … Maximum public LB score Seems good right? – often NO!
Number trees 437 0.820 0.814
Random forest example – tuning number of trees with no proper validation
#500 < #400 => test if #450 is good? => public LB score confirms it! => spurious results
Public leaderboard – don’t be fooled
Rule of thumb – the less submissions you make, less likely you have overfitted to LB solution!
8. TRAIN
PUBLIC TEST
PRIVATE TEST
Time
TRAIN
PUBLIC TEST
PRIVATE TEST
Time
NO MAYBE
TRAIN
PUBLIC TEST
PRIVATE TEST
TimeMUST
TRAIN
PUBLIC
TEST
PRIVATE
TEST
TimeYES
Should I even trust public leaderboard?
9. TIP #2 – TRUST YOUR CV
Typical question on smaller datasets:
“I’m doing proper cross-validation and see improvements on my CV score,
but public leaderboard is so random and does not correlate at all!”
What to trust
1) CV?
2) Public LB score?
Both seem valid choices ?!
1) is better 90%+ of the time (other 10% due to luck)
CV or LB?
10. Typical question on smaller datasets:
“I’m doing proper cross-validation and see improvements on my CV score,
but public leaderboard is so random and does not correlate at all!”
What to trust
1) CV?
2) Public LB score?
Both seem valid choices ?!
1) is better 90%+ of the time (other 10% due to luck)
3) X*CV + (1-X)*LB
Typically X=0.5 is ok
Top kagglers’ pick most of the time:
TIP #2 – TRUST YOUR CV
CV or LB?
12. Santander competition:
• “small” data – many fast modelling iterations
for everyone
• Many people over-tuned parameters to fit
public leaderboard in order to maximize their
standings
• CV and public LB scores had scissors like
relationship
• The ones who trusted CV-based submissions
had much lower ranking variation in final
standings
• Teams who discarded best public LB
submissions stayed in top (my team
included)
• Trusting CV was hard thing to do – we did it
repeated 20-times
Trusting CV – ultimate example
13. TIP #3 – BAGGING IS IMPORTANT
Average
Seed3
Seed2
Seed1
A model may suffer from initial
parameters of your classifier
(may overestimate or underestimate
your potential score)
Model
Same
model but
with less
variance!
Bagging ~= averaging similar models Bias-variance tradeoff:
• reduced variance penalty of a
classifier
• bias is not very important –
competitors suffer from bias
too (with similar degree)
Bootstrap aggregating – in short: bagging
14. • One hot encoding – create bag-of-words (bag the size of unique categories) and
assign 0/1’s for each representing column. Will fail on new categories.
• Label encoding – label each category into space. Will fail for new categories
Features with many categories - rows:categories ratio 20:1 or less
• Count encoding – count number of each category and use counter as a value.
Iterate counter for each CV fold
• Hash encoding – mapping categories to reduced category space and do one hot
encoding. Produces good results but may be sub-optimal.
• Category embedding – use a function (autoencoder) to map each category to
Euclidian space
• Likelihood encoding – apply label average (likelihood) for each category
Ways to deal with categorical features
Encode categories appearing 3+
times – reduce training feature
space with no loss of info
Dealing with NA’s depend on situation. NA itself is an information unit! Usually separate category is enough.
15. TIP #4 – AVOID LABEL LEAKAGE TO TRAINING SET
X1 Label X1m
“A1123” 0 0.5
“A1123” 1 0.5
“A1123” 1 0.5
“A1123” 0 0.5
“A1123” NA 0.5
How not to do it!
Label information of each row was transferred into
independent variable X1m – label now has indirect
dependence to itself - label leakage introduced!
Imagine having categorical feature with 100k different values on 1M dataset –
one hot encoding each category => dimensionality curse for any model
Likelihood encoding
16. X1 Label X1m
“A1123” 0 0.666 + rand(0,sigma)
“A1123” 1 0.333 + rand(0,sigma)
“A1123” 1 0.333 + rand(0,sigma)
“A1123” 0 0.666 + rand(0,sigma)
“A1123” NA 0.5
How to do it? Might work but risky
Problem: selecting optimum sigma value is painful and
depends on how much you regularize your model
Holdout set might help but not always
Likelihood encoding
TIP #4 – AVOID LABEL LEAKAGE TO TRAINING SET
17. X1 Label X1m
“A1123” 0 CV out-of-fold prediction
“A1123” 1 CV out-of-fold prediction
“A1123” 1 CV out-of-fold prediction
“A1123” 0 CV out-of-fold prediction
“A1123” NA CV predictions average
How to do it? Works well but may underperform
Nested cross-validation and high number of CV folds for your model is highly recommended
Additional CV calculations are done within each model CV fold. Tricky and prone to errors method.
Problem: some degree of loss of information due to CV scheme
Good option for larger datasets
Likelihood encoding
TIP #4 – AVOID LABEL LEAKAGE TO TRAINING SET
18. Categorical features – interactions!
• If interactions are natural for a problem - ML only does
approximations! => sub-optimal
• Always test your method with all explicitly created possible 2-way
interactions
• Interactions feature selection may be important due to exploding
feature space (20 categorical features => 20*19/2 interactions!)
• If 2-way interactions help – go even further (3-way, 4-way, …)
• Applying previous category transformations may be necessary
X1 X2 X1*X2
A C AC
B R BR
A R AR
A C AC
B R BR
B R BR
Interaction example
19. Numeric features
Feature transformations to consider:
• Scaling – min/max, N(0,1), root/power scaling, log scaling, Box-Cox, quantiles
• Rounding (too much precision might be noise!)
• Interactions {+,-,*,/}
• Row counters – #0’s, #NA’s, #negatives
• Row stats (if makes sense) – min, max, median, skewness…
• Row similarity scoring – cosine similarity
• Clustering
• …
Sometimes numeric features are categorical or even ordinal by nature - always consider that!
Tree methods are almost invariant to scaling.
Linear models (including NN’s) need careful raw data treatment!
20. • Regularized GLMs (underperforming but good ensembling material)
• XGBoost (top pick for traditional data)
• LightGBM (top pick for traditional data)
• Keras (NN’s are always good with good pre-processing)
• LinearSVM (non-linear is resource hungry and usually not worth it)
• Vowpal Wabbit (extremely fast online learning algorithms)
• Random forests (used to be popular, underperforming to GBM’s now)
To achieve maximum, each method requires good understanding of how they
work and how they should be tuned
Methods
21. Tuning parameters = experience + intuition + resources at hand
Tuning parameters – expert approach (xgboost)
Typical routine:
1. Set learning rate (eta) parameter 0.1-0.2 (based on dataset size and available resources); all other parameters at default;
personal experience: eta does not influence other parameter tuning!
2. Test maximum tree depth parameter, rule: 6-8-10-12-14; pick best performing on CV
3. Introduce regularization on leaf splits (alpha/lambda), rule: 2^k; IF it helps on CV, try to tune it as much as possible!
4. Tune min leaf node size (min_child_weight), rule: 1-0-5-10-20-50; IF it helps on CV, try to tune it as much as possible; IF it does
not help, use value 0 (default = 1 is worse most of the times!); If 3-4 works – repeat step 2.
5. Tune randomness of each iteration (column/row sampling); Usually 0.7/0.7 is best and rarely needs to be tuned
6. Decrease eta to value which you are comfortable with your hardware; 0.025 is typically a good choice; lower values than 0.01
don’t provide significant score uplift
7. If training time is reasonable, introduce bagging (num_parallel_trees) – how many models should be averaged in each training
iteration; random forests + gradient boosting recommended value - up to 5.
22. Naive approach : apply grid search on all parameter space
- Zero effort and no supervision
- Enormous parameters’ space
- Very time consuming
Bayesian optimization methods : trade-off between expert and grid-search approach
- Zero effort and no supervision
- Grid space reduced on previous iterations’ results (mimic expert decisions)
- Time consuming (still)
Tuning parameters – non-expert approach
Golden rule : finding optimal parameter configuration rarely is a good time investment!
23. • Get good score in public LB as fast as possible (if you intend to team up with experienced
people)
• Fail fast & often / agile sprint / fast iterations
• Use many different methods and remember to save all your models (both CV and test runs)
• Invest into having generic modelling scripts for every method – time saver long-term
• Debug, Debug, Debug – nothing worse is finding a silly preprocessing error at the last day of
the competition
• Write reproducible code - setting seeds before any RNG tasks
• Learn to write efficient code (vectorization, parallelization; use C, FORTRAN, etc. based libs)
• Learn how to look at the data and do feature engineering
• Get access to reasonable hardware (8CPU threads, 32GB RAM minimum); AWS is always an
option
TIP #5 – USE SAME CV SPLIT FOR ALL YOUR MODELS!
Competitive advantage
24. Ideally you would want to have a framework which could be applied to any competition;
It should be:
• Data friendly (sparse/dense data, missing values, larger than memory)
• Problem friendly (classification, regression, clustering, ranking, etc.)
• Memory friendly (garbage collection, avoid using swap partition, etc.)
• Automated (runs unsupervised from data reading to generating submission files)
Competitive advantage – data flow pipeline
Good framework can save hours of repetitive work when going from one competition to another
25. 50% - feature engineering
30% - model diversity
10% - luck
10% - proper ensembling:
• Voting
• Averaging
• Bagging
• Boosting
• Binning
• Blending
• Stacking
Although 10%, but it is the key what
separates top guys from the rest
MODEL PUBLIC MAE PRIVATE MAE
Random Forests 500 estimators 6.156 6.546
Extremely Randomized Trees
500 estimators
6.317 6.666
KNN-Classifier with 5 neighbors 6.828 7.460
Logistic Regression 6.694 6.949
Stacking with Extremely
Randomized Trees
4.772 4.718
Ensembling added efficiency +30%
Success formula (personal opinion)
27. Ensembling by averaging
Let’s say we have N classifiers: X1, X2, … , XN
We want to make a single prediction using weighted average:
B1*X1+B2*X2+B3*X3+…+BN*XN
TIP #6 – apply optimization for estimating ensemble weights
very common mistake to select weights based on leaderboard feedback –
inefficient & prone to leaderboard overfitting
Solve the problem using CV predictions with optimization algorithms
optim(B1*X1+B2*X2+B3*X3+…+BN*XN) with starting weights Bi=1/N
Critical to control convergence criteria, as default values usually lead to suboptimal solutions!
TIP #7 - apply optim on each fold and average Bi among folds
28. Ensembling by averaging
TIP #8 – apply transformations before averaging
For ranking-based models apply rank transformation on predicted probabilities
Rank transformation makes average more robust to model prediction probability outliers
Rank transformed ecdf
Probability ecdf
TIP #9 – always try geometric mean as an option
root(X1*X2*…*XN,N)
Sometimes outperforms weighted average, when
number of models is not too large
30. Structure of stacked models
TIP #10 – stack of non-tuned
models often outperforms smaller
stack of fine-tuned models
• Weaker models often struggle to
compensate fine-tuned model’s
weaknesses
• “More is better” motivates create monster
ensembles of sub-optimal models
• Model selection for stacker > fine-tuning
31. Model selection for ensemble
Having more models than necessary in ensemble may hurt.
Lets say we have a library of created models. Usually greedy-forward approach works well:
• Start with a few well-performing models’ ensemble
• Loop through each other model in a library and add to current ensemble
• Determine best performing ensemble configuration
• Repeat until metric converged
During each loop iteration it is wise to consider only a subset of library models, which could
work as a regularization for model selection.
Repeating procedure few times and bagging results reduces the possibility of overfitting by
doing model selection.
32. 0.4
0.41
0.42
0.43
0.44
0.45
0.46
Logloss
Score progress over duration of a competition
Basic models
on raw data
Simple feature
engineering
Sophisticated feature
engineering
Minor tweaks +
ensembling
Rule of thumb: data first, ML later
1. Create few simple models first – having a good
dataflow pipeline early is very convenient
2. Data exploration and visualization. Might be
boring and frustrating, but pays off well. Excel
is underrated in this aspect
3. Think of how to make smart features; avoid
using linear combinations, {/,*} is superior
4. Winners usually find something that most
people struggle to see in data. Not many
people look at the data at all!
34. Find something that no one else can find = guaranteed top place finish
BNP Paribas competition:
• Medium sized dataset with 100+ anonymous features
• numerical features scaled to fit in [0; 20] interval, all looking like noise
• Dedicated 3 weeks for data exploration - found something no one else did
• Moving from i.i.d. dataset assumption to timetable dataset
• Whole new class of time related feature engineering
35. Feature engineering - summary
Feature engineering is hard and is the most creative part in Kaggle – not many enjoy it
Typical feature transformations:
• Multiplication & ratios of numeric features
• Log, Box-Cox transformations
• PCA, t-SNE outputs
• Percentile ranks
• Interactions of categorical features (A,B => “AB”), multiple levels of interactions
• Lag & lead features (for timetable data)
• Target likelihood ratios of categorical features
• One-hot encoding/label encoding of categorical features
• …
Having strong explicit features in simple models often beat monster ensembles with no feature
engineering (personal experience )
36. Data leakage – competition design
Sometimes competition has its flaws by data design, when you can directly infer some
information from training set to test set (leakage); For example:
• The order of rows matter as data rows have not been random shuffled
• Same subject appears both in train & test set
• Time-based data structure with improper data shuffling, etc.
Detecting and exploiting a leakage can give a significant advantage, and many people
dedicate significant amount time for that.
As a result, the outcome of a competition not what a sponsor wanted to achieve.
37. Final remarks
• Kaggle is a playground for hyper-optimization and stacking – for business any solution in
10% rankings is sufficient.
• Teaming up is important – simple average of i.e. 9th and 10th place finishes can sometimes
beat 1st place solution.
• No winning solution go straight to production due to their complexity or occasional
leakages.
• Sponsors benefit more from forum discussions and provided data visualizations.