This document discusses model selection and tuning at scale using large datasets. It describes using different percentages of a 1TB Criteo click-through dataset to test and tune gradient boosted trees (GBTs) and other models. Testing on small slices found GBT performed best. Tuning GBT on larger slices up to 10% of the data showed tree depth should increase logarithmically with data size. Online learning with VW was also efficient, needing minimal tuning. The document cautions that true model selection and tuning at scale involves starting with larger data samples than GBs to avoid extrapolating from small data.
Winning data science competitions, presented by Owen ZhangVivian S. Zhang
<featured> Meetup event hosted by NYC Open Data Meetup, NYC Data Science Academy. Speaker: Owen Zhang, Event Info: http://www.meetup.com/NYC-Open-Data/events/219370251/
Winning Kaggle 101: Introduction to StackingTed Xiao
An Introduction to Stacking by Erin LeDell, from H2O.ai
Presented as part of the "Winning Kaggle 101" event, hosted by Machine Learning at Berkeley and Data Science Society at Berkeley. Special thanks to the Berkeley Institute of Data Science for the venue!
H2O.ai: http://www.h2o.ai/
ML@B: ml.berkeley.edu
DSSB: http://dssberkeley.org
BIDS: http://bids.berkeley.edu/
How to Win Machine Learning Competitions ? HackerEarth
This presentation was given by Marios Michailidis (a.k.a Kazanova), Current Kaggle Rank #3 to help community learn machine learning better. It comprises of useful ML tips and techniques to perform better in machine learning competitions. Read the full blog: http://blog.hackerearth.com/winning-tips-machine-learning-competitions-kazanova-current-kaggle-3
Winning data science competitions, presented by Owen ZhangVivian S. Zhang
<featured> Meetup event hosted by NYC Open Data Meetup, NYC Data Science Academy. Speaker: Owen Zhang, Event Info: http://www.meetup.com/NYC-Open-Data/events/219370251/
Winning Kaggle 101: Introduction to StackingTed Xiao
An Introduction to Stacking by Erin LeDell, from H2O.ai
Presented as part of the "Winning Kaggle 101" event, hosted by Machine Learning at Berkeley and Data Science Society at Berkeley. Special thanks to the Berkeley Institute of Data Science for the venue!
H2O.ai: http://www.h2o.ai/
ML@B: ml.berkeley.edu
DSSB: http://dssberkeley.org
BIDS: http://bids.berkeley.edu/
How to Win Machine Learning Competitions ? HackerEarth
This presentation was given by Marios Michailidis (a.k.a Kazanova), Current Kaggle Rank #3 to help community learn machine learning better. It comprises of useful ML tips and techniques to perform better in machine learning competitions. Read the full blog: http://blog.hackerearth.com/winning-tips-machine-learning-competitions-kazanova-current-kaggle-3
Slides from lecture style tutorial on data quality for ML delivered at SIGKDD 2021.
The quality of training data has a huge impact on the efficiency, accuracy and complexity of machine learning tasks. Data remains susceptible to errors or irregularities that may be introduced during collection, aggregation or annotation stage. This necessitates profiling and assessment of data to understand its suitability for machine learning tasks and failure to do so can result in inaccurate analytics and unreliable decisions. While researchers and practitioners have focused on improving the quality of models (such as neural architecture search and automated feature selection), there are limited efforts towards improving the data quality.
Assessing the quality of the data across intelligently designed metrics and developing corresponding transformation operations to address the quality gaps helps to reduce the effort of a data scientist for iterative debugging of the ML pipeline to improve model performance. This tutorial highlights the importance of analysing data quality in terms of its value for machine learning applications. Finding the data quality issues in data helps different personas like data stewards, data scientists, subject matter experts, or machine learning scientists to get relevant data insights and take remedial actions to rectify any issue. This tutorial surveys all the important data quality related approaches for structured, unstructured and spatio-temporal domains discussed in literature, focusing on the intuition behind them, highlighting their strengths and similarities, and illustrates their applicability to real-world problems.
xgboost를 이해하기 위해서 찾아보다가 내가 궁금한 내용을 따로 정리하였으나, 역시 구체적인 수식은 아직 모르겠다.
요즘 Kaggle에서 유명한 Xgboost가 뭘까?
Ensemble중 하나인 Boosting기법?
Ensemble 유형인 Bagging과 Boosting 차이는?
왜 Ensemble이 low bias, high variance 모델인가?
Bias 와 Variance 관계는?
Boosting 기법은 어떤게 있나?
Xgboost에서 사용하는 CART 알고리즘은?
How to fine-tune and develop your own large language model.pptxKnoldus Inc.
In this session, we will what are large language models, how we can fin-tune a pre-trained LLM with our data, including data preparation, model training, model evaluation.
Research on character level language modelling using LSTM for semi-supervised learning. The objective is learning inner layer representations of the language model for transfer learning into a classification one.
Generalizing NLP processes using Bi-directional LSTMs to learn character(byte) level embeddings of financial news headlines up too 8 bits ( 2**8 -1) in order to study the relationship between character vectors in financial news headlines in order to transfer learning in to classification models using UTF-8 encoding. Many traditional NLP steps (lemmatize, POS, NER, stemming...) are skipped when diving to byte level making the process more universal in terms of scope then task specific.
Overview of tree algorithms from decision tree to xgboostTakami Sato
For my understanding, I surveyed popular tree algorithms on Machine Learning and their evolution. This is the first time I wrote a presentation in English. So, I am happy if you give me a feedback.
Concept Drift: Monitoring Model Quality In Streaming ML ApplicationsLightbend
Most machine learning algorithms are designed to work with stationary data. Yet, real-life streaming data is rarely stationary. Machine learned models built on data observed within a fixed time period usually suffer loss of prediction quality due to what is known as concept drift.
The most common method to deal with concept drift is periodically retraining the models with new data. The length of the period is usually determined based on cost of retraining. The changes in the input data and the quality of predictions are not monitored, and the cost of inaccurate predictions is not included in these calculations.
A better alternative is monitoring the model quality by testing the inputs and predictions for changes over time, and using change points in retraining decisions. There has been significant development in this area within the last two decades.
In this webinar, Emre Velipasaoglu, Principal Data Scientist at Lightbend, Inc., will review the successful methods of machine learned model quality monitoring.
Netflix talk at ML Platform meetup Sep 2019Faisal Siddiqi
In this talk at the Netflix Machine Learning Platform Meetup on 12 Sep 2019, Fernando Amat and Elliot Chow from Netflix talk about the Bandit infrastructure for Personalized Recommendations
Feature Engineering for ML - Dmitry Larko, H2O.aiSri Ambati
This talk was given at H2O World 2018 NYC and can be viewed here: https://youtu.be/wcFdmQSX6hM
Description:
In this talk, Dmitry shares his approach to feature engineering which he used successfully in various Kaggle competitions. He covers common techniques used to convert your features into numeric representation used by ML algorithms.
Speaker's Bio:
Dmitry has more than 10 years of experience in IT. Starting with data warehousing and BI, now in big data and data science. He has a lot of experience in predictive analytics software development for different domains and tasks. He is also a Kaggle Grandmaster who loves to use his machine learning and data science skills on Kaggle competitions.
In this talk, Dmitry shares his approach to feature engineering which he used successfully in various Kaggle competitions. He covers common techniques used to convert your features into numeric representation used by ML algorithms.
Production-Ready BIG ML Workflows - from zero to heroDaniel Marcous
Data science isn't an easy task to pull of.
You start with exploring data and experimenting with models.
Finally, you find some amazing insight!
What now?
How do you transform a little experiment to a production ready workflow? Better yet, how do you scale it from a small sample in R/Python to TBs of production data?
Building a BIG ML Workflow - from zero to hero, is about the work process you need to take in order to have a production ready workflow up and running.
Covering :
* Small - Medium experimentation (R)
* Big data implementation (Spark Mllib /+ pipeline)
* Setting Metrics and checks in place
* Ad hoc querying and exploring your results (Zeppelin)
* Pain points & Lessons learned the hard way (is there any other way?)
Slides from lecture style tutorial on data quality for ML delivered at SIGKDD 2021.
The quality of training data has a huge impact on the efficiency, accuracy and complexity of machine learning tasks. Data remains susceptible to errors or irregularities that may be introduced during collection, aggregation or annotation stage. This necessitates profiling and assessment of data to understand its suitability for machine learning tasks and failure to do so can result in inaccurate analytics and unreliable decisions. While researchers and practitioners have focused on improving the quality of models (such as neural architecture search and automated feature selection), there are limited efforts towards improving the data quality.
Assessing the quality of the data across intelligently designed metrics and developing corresponding transformation operations to address the quality gaps helps to reduce the effort of a data scientist for iterative debugging of the ML pipeline to improve model performance. This tutorial highlights the importance of analysing data quality in terms of its value for machine learning applications. Finding the data quality issues in data helps different personas like data stewards, data scientists, subject matter experts, or machine learning scientists to get relevant data insights and take remedial actions to rectify any issue. This tutorial surveys all the important data quality related approaches for structured, unstructured and spatio-temporal domains discussed in literature, focusing on the intuition behind them, highlighting their strengths and similarities, and illustrates their applicability to real-world problems.
xgboost를 이해하기 위해서 찾아보다가 내가 궁금한 내용을 따로 정리하였으나, 역시 구체적인 수식은 아직 모르겠다.
요즘 Kaggle에서 유명한 Xgboost가 뭘까?
Ensemble중 하나인 Boosting기법?
Ensemble 유형인 Bagging과 Boosting 차이는?
왜 Ensemble이 low bias, high variance 모델인가?
Bias 와 Variance 관계는?
Boosting 기법은 어떤게 있나?
Xgboost에서 사용하는 CART 알고리즘은?
How to fine-tune and develop your own large language model.pptxKnoldus Inc.
In this session, we will what are large language models, how we can fin-tune a pre-trained LLM with our data, including data preparation, model training, model evaluation.
Research on character level language modelling using LSTM for semi-supervised learning. The objective is learning inner layer representations of the language model for transfer learning into a classification one.
Generalizing NLP processes using Bi-directional LSTMs to learn character(byte) level embeddings of financial news headlines up too 8 bits ( 2**8 -1) in order to study the relationship between character vectors in financial news headlines in order to transfer learning in to classification models using UTF-8 encoding. Many traditional NLP steps (lemmatize, POS, NER, stemming...) are skipped when diving to byte level making the process more universal in terms of scope then task specific.
Overview of tree algorithms from decision tree to xgboostTakami Sato
For my understanding, I surveyed popular tree algorithms on Machine Learning and their evolution. This is the first time I wrote a presentation in English. So, I am happy if you give me a feedback.
Concept Drift: Monitoring Model Quality In Streaming ML ApplicationsLightbend
Most machine learning algorithms are designed to work with stationary data. Yet, real-life streaming data is rarely stationary. Machine learned models built on data observed within a fixed time period usually suffer loss of prediction quality due to what is known as concept drift.
The most common method to deal with concept drift is periodically retraining the models with new data. The length of the period is usually determined based on cost of retraining. The changes in the input data and the quality of predictions are not monitored, and the cost of inaccurate predictions is not included in these calculations.
A better alternative is monitoring the model quality by testing the inputs and predictions for changes over time, and using change points in retraining decisions. There has been significant development in this area within the last two decades.
In this webinar, Emre Velipasaoglu, Principal Data Scientist at Lightbend, Inc., will review the successful methods of machine learned model quality monitoring.
Netflix talk at ML Platform meetup Sep 2019Faisal Siddiqi
In this talk at the Netflix Machine Learning Platform Meetup on 12 Sep 2019, Fernando Amat and Elliot Chow from Netflix talk about the Bandit infrastructure for Personalized Recommendations
Feature Engineering for ML - Dmitry Larko, H2O.aiSri Ambati
This talk was given at H2O World 2018 NYC and can be viewed here: https://youtu.be/wcFdmQSX6hM
Description:
In this talk, Dmitry shares his approach to feature engineering which he used successfully in various Kaggle competitions. He covers common techniques used to convert your features into numeric representation used by ML algorithms.
Speaker's Bio:
Dmitry has more than 10 years of experience in IT. Starting with data warehousing and BI, now in big data and data science. He has a lot of experience in predictive analytics software development for different domains and tasks. He is also a Kaggle Grandmaster who loves to use his machine learning and data science skills on Kaggle competitions.
In this talk, Dmitry shares his approach to feature engineering which he used successfully in various Kaggle competitions. He covers common techniques used to convert your features into numeric representation used by ML algorithms.
Production-Ready BIG ML Workflows - from zero to heroDaniel Marcous
Data science isn't an easy task to pull of.
You start with exploring data and experimenting with models.
Finally, you find some amazing insight!
What now?
How do you transform a little experiment to a production ready workflow? Better yet, how do you scale it from a small sample in R/Python to TBs of production data?
Building a BIG ML Workflow - from zero to hero, is about the work process you need to take in order to have a production ready workflow up and running.
Covering :
* Small - Medium experimentation (R)
* Big data implementation (Spark Mllib /+ pipeline)
* Setting Metrics and checks in place
* Ad hoc querying and exploring your results (Zeppelin)
* Pain points & Lessons learned the hard way (is there any other way?)
Lessons learned from designing a QA Automation for analytics databases (big d...Omid Vahdaty
Have a big data product / database / DBMS? need to test it? don't know where to start? some things to consider while you design your Automation QA.
Link to Video
https://www.youtube.com/watch?v=MlT4pP7BGFQ
Using SigOpt to Tune Deep Learning Models with Nervana CloudSigOpt
In this talk I'll show how the Bayesian Optimization methods used by SigOpt, coupled with the incredibly scalable deep learning architecture provided with ncloud and neon, allow anyone it easily tune their models to quickly achieve higher accuracy. I'll walk through the techniques and show an explicit example with results.
Re-imagine Data Monitoring with whylogs and SparkDatabricks
In the era of microservices, decentralized ML architectures and complex data pipelines, data quality has become a bigger challenge than ever. When data is involved in complex business processes and decisions, bad data can, and will, affect the bottom line. As a result, ensuring data quality across the entire ML pipeline is both costly, and cumbersome while data monitoring is often fragmented and performed ad hoc. To address these challenges, we built whylogs, an open source standard for data logging. It is a lightweight data profiling library that enables end-to-end data profiling across the entire software stack. The library implements a language and platform agnostic approach to data quality and data monitoring. It can work with different modes of data operations, including streaming, batch and IoT data.
In this talk, we will provide an overview of the whylogs architecture, including its lightweight statistical data collection approach and various integrations. We will demonstrate how the whylogs integration with Apache Spark achieves large scale data profiling, and we will show how users can apply this integration into existing data and ML pipelines.
Winning Data Science Competitions (Owen Zhang) - 2014 Boston Data Festivalfreshdatabos
Owen Zhang is no stranger to data science competitions. He has competed in and won several high profile challenges, and is currently ranked 1st out of a community of 200,000 data scientists on Kaggle. This is an opportunity to learn the tips, tricks and techniques Owen employs in building world-class predictive analytic solutions
Similar to Model selection and tuning at scale (20)
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
2. About us
Owen Zhang
Chief Product Officer @ DataRobot
Former #1 ranked Data Scientist on
Kaggle
Former VP, Science @ AIG
Peter Prettenhofer
Software Engineer @ DataRobot
Scikit-learn core developer
4. Model Selection
● Estimating the performance of different models in order to choose the best one.
● K-Fold Cross-validation
● The devil is in the detail:
○ Partitioning
○ Leakage
○ Sample size
○ Stacked-models require nested layers
Train Validation Holdout
1 2 3 4 5
8. Model Tuning
● Optimizing the performance of a model
● Example: Gradient Boosted Trees
○ Nr of trees
○ Learning rate
○ Tree depth / Nr of leaf nodes
○ Min leaf size
○ Example subsampling rate
○ Feature subsampling rate
9. Search Space
Hyperparameter GBRT (naive) GBRT RandomForest
Nr of trees 5 1 1
Learning rate 5 5 -
Tree depth 5 5 1
Min leaf size 3 3 3
Example subsample rate 3 1 1
Feature subsample rate 2 2 5
Total 2250 150 15
11. Challenges at Scale
● Why learning with more data is harder?
○ Paradox: we could use more complex models due to more data but we cannot because
of computational constraints*
○ => we need more efficient ways for creating complex models!
● Need to account for the combined cost: model fitting + model selection / tuning
○ Smart hyperparameter tuning tries to decrease the # of model fits
○ … we can accomplish this with fewer hyperparameters too**
* Pedro Domingos, A few useful things to know about machine learning, 2012.
** Practitioners often favor algorithms with few hyperparameters such as RandomForest or
AveragedPerceptron (see http://nlpers.blogspot.co.at/2014/10/hyperparameter-search-bayesian.html)
12. A case study -- binary classification on 1TB of data
● Criteo click through data
● Down sampled ads impression data on 24 days
● Fully anonymized dataset:
○ 1 target
○ 13 integer features
○ 26 hashed categorical features
● Experiment setup:
○ Using day 0 - day 22 data for training, day 23 data for testing
13. Big Data?
Data size:
● ~46GB/day
● ~180,000,000/day
However it is very imbalanced (even after downsampling non-events)
● ~3.5% events rate
Further downsampling of non-events to a balanced dataset will reduce the size of data to ~70GB
● Will fit into a single node under “optimal” conditions
● Loss of model accuracy is negligible in most situations
Assuming 0.1% raw event (click through) rate:
Raw Data:
35TB@.1%
Data:
1TB@3.5%
Data:
70GB@50%
14. Where to start?
● 70GB (~260,000,000 data points) is still a lot of data
● Let’s take a tiny slice of that to experiment
○ Take 0.25%, then .5%, then 1%, and do grid search on them
Time (Seconds)
RF
ASVM
Regularized
Regression
GBM (with Count)
GBM (without Count)Better
15. GBM is the way to go, let’s go up to 10% data
# of Trees
Sample Size/Depth of Tree/Time to Finish
16. A “Fairer” Way of Comparing Models
A better model
when time is the
constraint
18. Tree Depth vs Data Size
● A natural heuristic -- increment tree depth by 1 every time data size doubles
1%
2%
4%
10%
Optimal Depth = a + b * log(DataSize)
19. What about VW?
● Highly efficient online learning algorithm
● Support adaptive learning rate
● Inherently linear, user needs to specify non-linear feature or interactions explicitly
● 2-way and 3-way interactions can be generated on the fly
● Supports “every k” validation
● The only “tuning” REQUIRED is specification of interactions
○ Due to availability of progressive validation, bad interactions can be detected immediately
thus don’t waste time:
20. Data pipeline for VW
Training
Test
T1
T2
Tm
Test
T1s
Random
Split
T2s
Tms
Random
Shuffle
Concat +
Interleave
It takes longer to
prep the data than
to run the model!
23. Do We Really “Tune/Select Model @ Scale”?
● What we claim we do:
○ Model tuning and selection on big data
● We we actually do:
○ Model tuning and selection on small data
○ Re-run the model and expect/hope performance/hyper
parameters extrapolate as expected
● If you start the model tuning/selection process with GBs (even
100s of MBs) of data, you are doing it wrong!
24. Some Interesting Observations
● At least for some datasets, it is very hard for “pure linear” model to outperform (accuracy-wise)
non-linear model, even with much larger data
● There is meaningful structure in the hyper parameter space
● When we have limited time (relative to data size), running “deeper” models on smaller data
sample may actually yield better results
● To fully exploit data, model estimation time is usually at least proportional to n*log(n) and We
need models that has # of parameters that can scale with # of data points
○ GBM can have any many parameters as we want
○ So does factorization machines
● For any data any model we will run into a “diminishing return” issue, as data get bigger and
bigger