<featured> Meetup event hosted by NYC Open Data Meetup, NYC Data Science Academy. Speaker: Owen Zhang, Event Info: http://www.meetup.com/NYC-Open-Data/events/219370251/
How to Win Machine Learning Competitions ? HackerEarth
This presentation was given by Marios Michailidis (a.k.a Kazanova), Current Kaggle Rank #3 to help community learn machine learning better. It comprises of useful ML tips and techniques to perform better in machine learning competitions. Read the full blog: http://blog.hackerearth.com/winning-tips-machine-learning-competitions-kazanova-current-kaggle-3
In this talk, Dmitry shares his approach to feature engineering which he used successfully in various Kaggle competitions. He covers common techniques used to convert your features into numeric representation used by ML algorithms.
Winning Kaggle 101: Introduction to StackingTed Xiao
An Introduction to Stacking by Erin LeDell, from H2O.ai
Presented as part of the "Winning Kaggle 101" event, hosted by Machine Learning at Berkeley and Data Science Society at Berkeley. Special thanks to the Berkeley Institute of Data Science for the venue!
H2O.ai: http://www.h2o.ai/
ML@B: ml.berkeley.edu
DSSB: http://dssberkeley.org
BIDS: http://bids.berkeley.edu/
How to Win Machine Learning Competitions ? HackerEarth
This presentation was given by Marios Michailidis (a.k.a Kazanova), Current Kaggle Rank #3 to help community learn machine learning better. It comprises of useful ML tips and techniques to perform better in machine learning competitions. Read the full blog: http://blog.hackerearth.com/winning-tips-machine-learning-competitions-kazanova-current-kaggle-3
In this talk, Dmitry shares his approach to feature engineering which he used successfully in various Kaggle competitions. He covers common techniques used to convert your features into numeric representation used by ML algorithms.
Winning Kaggle 101: Introduction to StackingTed Xiao
An Introduction to Stacking by Erin LeDell, from H2O.ai
Presented as part of the "Winning Kaggle 101" event, hosted by Machine Learning at Berkeley and Data Science Society at Berkeley. Special thanks to the Berkeley Institute of Data Science for the venue!
H2O.ai: http://www.h2o.ai/
ML@B: ml.berkeley.edu
DSSB: http://dssberkeley.org
BIDS: http://bids.berkeley.edu/
One of the most important, yet often overlooked, aspects of predictive modeling is the transformation of data to create model inputs, better known as feature engineering (FE). This talk will go into the theoretical background behind FE, showing how it leverages existing data to produce better modeling results. It will then detail some important FE techniques that should be in every data scientist’s tool kit.
Tong is a data scientist in Supstat Inc and also a master students of Data Mining. He has been an active R programmer and developer for 5 years. He is the author of the R package of XGBoost, one of the most popular and contest-winning tools on kaggle.com nowadays.
Agenda:
Introduction of Xgboost
Real World Application
Model Specification
Parameter Introduction
Advanced Features
Kaggle Winning Solution
Feature Engineering - Getting most out of data for predictive modelsGabriel Moreira
How should data be preprocessed for use in machine learning algorithms? How to identify the most predictive attributes of a dataset? What features can generate to improve the accuracy of a model?
Feature Engineering is the process of extracting and selecting, from raw data, features that can be used effectively in predictive models. As the quality of the features greatly influences the quality of the results, knowing the main techniques and pitfalls will help you to succeed in the use of machine learning in your projects.
In this talk, we will present methods and techniques that allow us to extract the maximum potential of the features of a dataset, increasing flexibility, simplicity and accuracy of the models. The analysis of the distribution of features and their correlations, the transformation of numeric attributes (such as scaling, normalization, log-based transformation, binning), categorical attributes (such as one-hot encoding, feature hashing, Temporal (date / time), and free-text attributes (text vectorization, topic modeling).
Python, Python, Scikit-learn, and Spark SQL examples will be presented and how to use domain knowledge and intuition to select and generate features relevant to predictive models.
Open Source Tools & Data Science Competitions odsc
This talk shares the presenter’s experience with open source tools in data science competitions. In the past several years Kaggle and other competitions have created a large online community of data scientists. In addition to competing with each other for fame and glory, members of this community also generously share knowledge, insights using forum and open source code. The open competition and sharing have resulted in rapid progress in the sophistication of the entire community. This presentation will briefly cover this journey from a competitor’s perspective, and share hands on tips on some open source tools proven popular and useful in recent competitions.
Feature Engineering for ML - Dmitry Larko, H2O.aiSri Ambati
This talk was given at H2O World 2018 NYC and can be viewed here: https://youtu.be/wcFdmQSX6hM
Description:
In this talk, Dmitry shares his approach to feature engineering which he used successfully in various Kaggle competitions. He covers common techniques used to convert your features into numeric representation used by ML algorithms.
Speaker's Bio:
Dmitry has more than 10 years of experience in IT. Starting with data warehousing and BI, now in big data and data science. He has a lot of experience in predictive analytics software development for different domains and tasks. He is also a Kaggle Grandmaster who loves to use his machine learning and data science skills on Kaggle competitions.
Our fall 12-Week Data Science bootcamp starts on Sept 21st,2015. Apply now to get a spot!
If you are hiring Data Scientists, call us at (1)888-752-7585 or reach info@nycdatascience.com to share your openings and set up interviews with our excellent students.
---------------------------------------------------------------
Come join our meet-up and learn how easily you can use R for advanced Machine learning. In this meet-up, we will demonstrate how to understand and use Xgboost for Kaggle competition. Tong is in Canada and will do remote session with us through google hangout.
---------------------------------------------------------------
Speaker Bio:
Tong is a data scientist in Supstat Inc and also a master students of Data Mining. He has been an active R programmer and developer for 5 years. He is the author of the R package of XGBoost, one of the most popular and contest-winning tools on kaggle.com nowadays.
Pre-requisite(if any): R /Calculus
Preparation: A laptop with R installed. Windows users might need to have RTools installed as well.
Agenda:
Introduction of Xgboost
Real World Application
Model Specification
Parameter Introduction
Advanced Features
Kaggle Winning Solution
Event arrangement:
6:45pm Doors open. Come early to network, grab a beer and settle in.
7:00-9:00pm XgBoost Demo
Reference:
https://github.com/dmlc/xgboost
This is the slide from my talk at FULokoja Ingressive meetup.
XGBoost is a decision-tree-based ensemble Machine Learning algorithm that uses a gradient boosting framework. In prediction problems involving unstructured and structured data (images, text, etc.) artificial neural networks tend to outperform all other algorithms or frameworks. However, when it comes to small-to-medium structured/tabular data, decision tree-based algorithms are considered best-in-class right now. XGBoost model has the best combination of prediction performance and processing time compared to other algorithms.
Feature Engineering - Getting most out of data for predictive models - TDC 2017Gabriel Moreira
How should data be preprocessed for use in machine learning algorithms? How to identify the most predictive attributes of a dataset? What features can generate to improve the accuracy of a model?
Feature Engineering is the process of extracting and selecting, from raw data, features that can be used effectively in predictive models. As the quality of the features greatly influences the quality of the results, knowing the main techniques and pitfalls will help you to succeed in the use of machine learning in your projects.
In this talk, we will present methods and techniques that allow us to extract the maximum potential of the features of a dataset, increasing flexibility, simplicity and accuracy of the models. The analysis of the distribution of features and their correlations, the transformation of numeric attributes (such as scaling, normalization, log-based transformation, binning), categorical attributes (such as one-hot encoding, feature hashing, Temporal (date / time), and free-text attributes (text vectorization, topic modeling).
Python, Python, Scikit-learn, and Spark SQL examples will be presented and how to use domain knowledge and intuition to select and generate features relevant to predictive models.
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learntEugene Yan Ziyou
Our team achieved 85th position out of 3,514 at the very popular Kaggle Otto Product Classification Challenge. Here's an overview of how we did it, as well as some techniques we learnt from fellow Kagglers during and after the competition.
Data Scientists and Machine Learning practitioners, nowadays, seem to be churning out models by the dozen and they continuously experiment to find ways to improve their accuracies. They also use a variety of ML and DL frameworks & languages , and a typical organization may find that this results in a heterogenous, complicated bunch of assets that require different types of runtimes, resources and sometimes even specialized compute to operate efficiently.
But what does it mean for an enterprise to actually take these models to "production" ? How does an organization scale inference engines out & make them available for real-time applications without significant latencies ? There needs to be different techniques for batch (offline) inferences and instant, online scoring. Data needs to be accessed from various sources and cleansing, transformations of data needs to be enabled prior to any predictions. In many cases, there maybe no substitute for customized data handling with scripting either.
Enterprises also require additional auditing and authorizations built in, approval processes and still support a "continuous delivery" paradigm whereby a data scientist can enable insights faster. Not all models are created equal, nor are consumers of a model - so enterprises require both metering and allocation of compute resources for SLAs.
In this session, we will take a look at how machine learning is operationalized in IBM Data Science Experience (DSX), a Kubernetes based offering for the Private Cloud and optimized for the HortonWorks Hadoop Data Platform. DSX essentially brings in typical software engineering development practices to Data Science, organizing the dev->test->production for machine learning assets in much the same way as typical software deployments. We will also see what it means to deploy, monitor accuracies and even rollback models & custom scorers as well as how API based techniques enable consuming business processes and applications to remain relatively stable amidst all the chaos.
Speaker
Piotr Mierzejewski, Program Director Development IBM DSX Local, IBM
Feature Importance Analysis with XGBoost in Tax auditMichael BENESTY
Presentation of a real use case at TAJ law firm (Deloitte Paris) of applying Machine learning on accounting to help clients to prepare their tax audit.
Data Science is concerned with the analysis of large amounts of data. When the volume of data is really large, it requires the use of cooperating, distributed machines. The most popular method of doing this is Hadoop, a collection of programs to perform computations on connected machines in a cluster. Hadoop began life as an open-source implementation of MapReduce, an idea first developed and implemented by Google for its own clusters. Though Hadoop's MapReduce is Java-based, and quite complex, this talk focuses on the "streaming" facility, which allows Python programmers to use MapReduce in a clean and simple way. We will present the core ideas of MapReduce and show you how to implement a MapReduce computation using Python streaming. The presentation will also include an overview of the various components of the Hadoop "ecosystem."
NYC Data Science Academy is excited to welcome Sam Kamin who will be presenting an Introduction to Hadoop for Python Programmers a well as a discussion of MapReduce with Streaming Python.
Sam Kamin was a professor in the University of Illinois Computer Science Department. His research was in programming languages, high-performance computing, and educational technology. He taught a wide variety of courses, and served as the Director of Undergraduate Programs. He retired as Emeritus Associate Professor, and worked at Google until taking his current position as VP of Data Engineering in NYC Data Science Academy.
--------------------------------------
Our fall 12-Week Data Science bootcamp starts on Sept 21st,2015. Apply now to get a spot!
If you are hiring Data Scientists, call us at (1)888-752-7585 or reach info@nycdatascience.com to share your openings and set up interviews with our excellent students.
One of the most important, yet often overlooked, aspects of predictive modeling is the transformation of data to create model inputs, better known as feature engineering (FE). This talk will go into the theoretical background behind FE, showing how it leverages existing data to produce better modeling results. It will then detail some important FE techniques that should be in every data scientist’s tool kit.
Tong is a data scientist in Supstat Inc and also a master students of Data Mining. He has been an active R programmer and developer for 5 years. He is the author of the R package of XGBoost, one of the most popular and contest-winning tools on kaggle.com nowadays.
Agenda:
Introduction of Xgboost
Real World Application
Model Specification
Parameter Introduction
Advanced Features
Kaggle Winning Solution
Feature Engineering - Getting most out of data for predictive modelsGabriel Moreira
How should data be preprocessed for use in machine learning algorithms? How to identify the most predictive attributes of a dataset? What features can generate to improve the accuracy of a model?
Feature Engineering is the process of extracting and selecting, from raw data, features that can be used effectively in predictive models. As the quality of the features greatly influences the quality of the results, knowing the main techniques and pitfalls will help you to succeed in the use of machine learning in your projects.
In this talk, we will present methods and techniques that allow us to extract the maximum potential of the features of a dataset, increasing flexibility, simplicity and accuracy of the models. The analysis of the distribution of features and their correlations, the transformation of numeric attributes (such as scaling, normalization, log-based transformation, binning), categorical attributes (such as one-hot encoding, feature hashing, Temporal (date / time), and free-text attributes (text vectorization, topic modeling).
Python, Python, Scikit-learn, and Spark SQL examples will be presented and how to use domain knowledge and intuition to select and generate features relevant to predictive models.
Open Source Tools & Data Science Competitions odsc
This talk shares the presenter’s experience with open source tools in data science competitions. In the past several years Kaggle and other competitions have created a large online community of data scientists. In addition to competing with each other for fame and glory, members of this community also generously share knowledge, insights using forum and open source code. The open competition and sharing have resulted in rapid progress in the sophistication of the entire community. This presentation will briefly cover this journey from a competitor’s perspective, and share hands on tips on some open source tools proven popular and useful in recent competitions.
Feature Engineering for ML - Dmitry Larko, H2O.aiSri Ambati
This talk was given at H2O World 2018 NYC and can be viewed here: https://youtu.be/wcFdmQSX6hM
Description:
In this talk, Dmitry shares his approach to feature engineering which he used successfully in various Kaggle competitions. He covers common techniques used to convert your features into numeric representation used by ML algorithms.
Speaker's Bio:
Dmitry has more than 10 years of experience in IT. Starting with data warehousing and BI, now in big data and data science. He has a lot of experience in predictive analytics software development for different domains and tasks. He is also a Kaggle Grandmaster who loves to use his machine learning and data science skills on Kaggle competitions.
Our fall 12-Week Data Science bootcamp starts on Sept 21st,2015. Apply now to get a spot!
If you are hiring Data Scientists, call us at (1)888-752-7585 or reach info@nycdatascience.com to share your openings and set up interviews with our excellent students.
---------------------------------------------------------------
Come join our meet-up and learn how easily you can use R for advanced Machine learning. In this meet-up, we will demonstrate how to understand and use Xgboost for Kaggle competition. Tong is in Canada and will do remote session with us through google hangout.
---------------------------------------------------------------
Speaker Bio:
Tong is a data scientist in Supstat Inc and also a master students of Data Mining. He has been an active R programmer and developer for 5 years. He is the author of the R package of XGBoost, one of the most popular and contest-winning tools on kaggle.com nowadays.
Pre-requisite(if any): R /Calculus
Preparation: A laptop with R installed. Windows users might need to have RTools installed as well.
Agenda:
Introduction of Xgboost
Real World Application
Model Specification
Parameter Introduction
Advanced Features
Kaggle Winning Solution
Event arrangement:
6:45pm Doors open. Come early to network, grab a beer and settle in.
7:00-9:00pm XgBoost Demo
Reference:
https://github.com/dmlc/xgboost
This is the slide from my talk at FULokoja Ingressive meetup.
XGBoost is a decision-tree-based ensemble Machine Learning algorithm that uses a gradient boosting framework. In prediction problems involving unstructured and structured data (images, text, etc.) artificial neural networks tend to outperform all other algorithms or frameworks. However, when it comes to small-to-medium structured/tabular data, decision tree-based algorithms are considered best-in-class right now. XGBoost model has the best combination of prediction performance and processing time compared to other algorithms.
Feature Engineering - Getting most out of data for predictive models - TDC 2017Gabriel Moreira
How should data be preprocessed for use in machine learning algorithms? How to identify the most predictive attributes of a dataset? What features can generate to improve the accuracy of a model?
Feature Engineering is the process of extracting and selecting, from raw data, features that can be used effectively in predictive models. As the quality of the features greatly influences the quality of the results, knowing the main techniques and pitfalls will help you to succeed in the use of machine learning in your projects.
In this talk, we will present methods and techniques that allow us to extract the maximum potential of the features of a dataset, increasing flexibility, simplicity and accuracy of the models. The analysis of the distribution of features and their correlations, the transformation of numeric attributes (such as scaling, normalization, log-based transformation, binning), categorical attributes (such as one-hot encoding, feature hashing, Temporal (date / time), and free-text attributes (text vectorization, topic modeling).
Python, Python, Scikit-learn, and Spark SQL examples will be presented and how to use domain knowledge and intuition to select and generate features relevant to predictive models.
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learntEugene Yan Ziyou
Our team achieved 85th position out of 3,514 at the very popular Kaggle Otto Product Classification Challenge. Here's an overview of how we did it, as well as some techniques we learnt from fellow Kagglers during and after the competition.
Data Scientists and Machine Learning practitioners, nowadays, seem to be churning out models by the dozen and they continuously experiment to find ways to improve their accuracies. They also use a variety of ML and DL frameworks & languages , and a typical organization may find that this results in a heterogenous, complicated bunch of assets that require different types of runtimes, resources and sometimes even specialized compute to operate efficiently.
But what does it mean for an enterprise to actually take these models to "production" ? How does an organization scale inference engines out & make them available for real-time applications without significant latencies ? There needs to be different techniques for batch (offline) inferences and instant, online scoring. Data needs to be accessed from various sources and cleansing, transformations of data needs to be enabled prior to any predictions. In many cases, there maybe no substitute for customized data handling with scripting either.
Enterprises also require additional auditing and authorizations built in, approval processes and still support a "continuous delivery" paradigm whereby a data scientist can enable insights faster. Not all models are created equal, nor are consumers of a model - so enterprises require both metering and allocation of compute resources for SLAs.
In this session, we will take a look at how machine learning is operationalized in IBM Data Science Experience (DSX), a Kubernetes based offering for the Private Cloud and optimized for the HortonWorks Hadoop Data Platform. DSX essentially brings in typical software engineering development practices to Data Science, organizing the dev->test->production for machine learning assets in much the same way as typical software deployments. We will also see what it means to deploy, monitor accuracies and even rollback models & custom scorers as well as how API based techniques enable consuming business processes and applications to remain relatively stable amidst all the chaos.
Speaker
Piotr Mierzejewski, Program Director Development IBM DSX Local, IBM
Feature Importance Analysis with XGBoost in Tax auditMichael BENESTY
Presentation of a real use case at TAJ law firm (Deloitte Paris) of applying Machine learning on accounting to help clients to prepare their tax audit.
Data Science is concerned with the analysis of large amounts of data. When the volume of data is really large, it requires the use of cooperating, distributed machines. The most popular method of doing this is Hadoop, a collection of programs to perform computations on connected machines in a cluster. Hadoop began life as an open-source implementation of MapReduce, an idea first developed and implemented by Google for its own clusters. Though Hadoop's MapReduce is Java-based, and quite complex, this talk focuses on the "streaming" facility, which allows Python programmers to use MapReduce in a clean and simple way. We will present the core ideas of MapReduce and show you how to implement a MapReduce computation using Python streaming. The presentation will also include an overview of the various components of the Hadoop "ecosystem."
NYC Data Science Academy is excited to welcome Sam Kamin who will be presenting an Introduction to Hadoop for Python Programmers a well as a discussion of MapReduce with Streaming Python.
Sam Kamin was a professor in the University of Illinois Computer Science Department. His research was in programming languages, high-performance computing, and educational technology. He taught a wide variety of courses, and served as the Director of Undergraduate Programs. He retired as Emeritus Associate Professor, and worked at Google until taking his current position as VP of Data Engineering in NYC Data Science Academy.
--------------------------------------
Our fall 12-Week Data Science bootcamp starts on Sept 21st,2015. Apply now to get a spot!
If you are hiring Data Scientists, call us at (1)888-752-7585 or reach info@nycdatascience.com to share your openings and set up interviews with our excellent students.
Kaggle Top1% Solution: Predicting Housing Prices in Moscow Vivian S. Zhang
This project was completed by students graduated from NYC Data Science Academy 12-week Data Science Bootcamp. Learn more about the bootcamp: http://nycdatascience.com/data-science-bootcamp/
Watch the project presentation: https://youtu.be/W530d2ZdbJE
Ranked #15 out of 3,274 teams on Kaggle Team Members - Brandy Freitas, Chase Edge and Grant Webb
Given 4 years of housing price data in a foreign market, predicting the following year’s prices should be pretty straightforward, right? But what if in that last year of data, the country’s stock market, the value of its currency and the price of its number 1 export, all dropped by nearly 50%. And on top of all that, the country was slapped with economic sanctions by the EU and the US. This was Moscow in 2014 and as you can see, it was anything but straightforward.
We were able to overcome these challenges and in the two weeks of working together, were able to achieve a top 1% ranking on Kaggle. Our success is a product of our in depth data cleaning, feature engineering and our approach to modeling. With a focus on interpretability and simplicity, we begin modeling using linear regression and decision trees which gave us a better understanding of the data. We then utilized more complicated models such as random forests and XGBoost which ultimately resulted in our top submission.
Twitter: @NycDataSci
Learn with our NYC Data Science Program (weekend courses for working professionals and 12 week full time for whom are advancing their career into Data Science)
Our next 12-Week Data Science Bootcamp starts in Jun. (Deadline to apply is May 1st, all decisions will be made by May 15th.)
====================================
Max Kuhn, Director is Nonclinical Statistics of Pfizer and also the author of Applied Predictive Modeling.
He will join us and share his experience with Data Mining with R.
Max is a nonclinical statistician who has been applying predictive models in the diagnostic and pharmaceutical industries for over 15 years. He is the author and maintainer for a number of predictive modeling packages, including: caret, C50, Cubist and AppliedPredictiveModeling. He blogs about the practice of modeling on his website at ttp://appliedpredictivemodeling.com/blog
---------------------------------------------------------
His Feb 18th course can be RSVP at NYC Data Science Academy.
Syllabus
Predictive Modeling using R
Description
This class will get attendees up to speed in predictive modeling using the R programming language. The goal of the course is to understand the general predictive modeling process and how it can be implemented in R. A selection of important models (e.g. tree-based models, support vector machines) will be described in an intuitive manner to illustrate the process of training and evaluating models.
Prerequisites:
Attendees should have a working knowledge of basic R data structures (e.g. data frames, factors etc) and language fundamentals such as functions and subsetting data. Understanding of the content contained in Appendix B sections B1 though B8 of Applied Predictive Modeling (free PDF from publisher [1]) should suffice.
Outline:
- An introduction to predictive modeling
- R and predictive modeling: the good and bad
- Illustrative example
- Measuring performance
- Data splitting and resampling
- Data pre-processing
- Classification trees
- Boosted trees
- Support vector machines
If time allows, the following topics will also be covered
- Parallel processing
- Comparing models
- Feature selection
- Common pitfalls
Materials:
Attendees will be provided with a copy of Applied Predictive Modeling[2] as well as course notes, code and raw data. Participants will be able to reproduce the examples described in the workshop.
Attendees should have a computer with a relatively recent version of R installed.
About the Instructor:
More about Max's work:
[1] http://rd.springer.com/content/pdf/bbm%3A978-1-4614-6849-3%2F1.pdf
[2] http://appliedpredictivemodeling.com
Hack session for NYTimes Dialect Map Visualization( developed by R Shiny)Vivian S. Zhang
Data Science Academy, Hack session, NY Times, Dialect Map, Data science by R, Vivian S. Zhang, see www.nycdatascience.com for more details. Joint work by Data Scientist team of SupStat Inc. a New York based data analytic and visualization consulting firm.
Using Machine Learning to aid Journalism at the New York TimesVivian S. Zhang
This talk was presented to NYC Open Data Meetup Group on Nov 11, 2014.
Speaker:
Daeil Kim is currently a data scientist at the Times and is finishing up his Ph.D at Brown University on work related to developing scalable inference algorithms for Bayesian Nonparametric models. His work at the Times spans a variety of problems related to the company's business interests, audience development, as well as developing tools to aid journalism.
Topic:
This talk will focus mostly on how machine learning can help problems that prop up in journalism. We'll begin first by talking about using popular supervised learning algorithms such as regularized Logistic Regression to help assist a journalist's work in uncovering insights into a story regarding the recall of Takata airbags in cars. Afterwards, we'll think about using topic modeling to deal with large document dumps generated from FOIA (Freedom of Information Act) requests and Refinery, a simple web based tool to ease the implementation of such tasks. Finally, if there is time, we will go over how topic models have been extended to assist in the problem of designing an efficient recommendation engine for text-based content.
This project was completed by Scott Dobbins and Rachel Kogan, who enrolled in the NYC Data Science Academy's 12-Week Data Science Bootcamp. Learn more about the program: http://nycdatascience.com/data-science-bootcamp/
Given that both Wikipedia and comments sections of most websites are freely open to anyone to edit at any time, how has Wikipedia managed to remain such a useful resource while most comments sections are ridden with vandalism, ads, and other counterproductive user behavior?
We believe the answer is two-fold: 1) Wikipedia has an army of bots that quickly identify and revert vandalism so that the worst edits are usually never seen by people and the site generally maintains itself in a well-kempt state, and 2) Wikipedia has a strong community of administrators and other contributors who routinely clean the site’s flagged contents.
Vandalism is relatively easy to flag, though a few clever edits manage to stay on the site for a long time. What about site content problems that are more subjective, like bias? Wikipedia users do routinely manually flag pages with point-of-view (POV) issues, though with millions of pages and no machine-based approaches, the site can only manage to confidently maintain neutrality on the more well-trafficked pages.
Here we propose a solution to solve some of the more intractable content issues for Wikipedia and other sites using Natural Language Processing (NLP) and machine learning approaches. The sheer quantity of data managed by Wikipedia and similar sites requires distributed computing approaches, so we show here how Apache Spark can upgrade common algorithms to run on massive data sets.
A Hybrid Recommender with Yelp Challenge Data Vivian S. Zhang
Developed by Chao Shi, Sam O'Mullane, Sean Kickham, Reza Rad and Andrew Rubino
Watch the project presentation: https://youtu.be/gkKGnnBenyk
This project was completed by students from NYC Data Science Academy's 12-Week Bootcamp. Learn more about the bootcamp: http://nycdatascience.com/data-science-bootcamp/
People make decisions on where to eat based on friends’ recommendations. Since they know you, their suggestions matter more than those of strangers.
For the capstone project, we built a hybrid Yelp recommendation system that can provide individualized recommendations based on your friend’s reviews on the social network. We built the machine learning models using Spark, and set up a Flask-Kafka-RDS-Databricks pipeline that allows a continuous stream of user requests.
During the presentation, we will talk about the development framework and technical implementation of the pipeline.
Read on their project posts and code:
https://blog.nycdatascience.com/student-works/capstone/yelp-recommender-part-1/
https://blog.nycdatascience.com/student-works/yelp-recommender-part-2/
Winning Data Science Competitions (Owen Zhang) - 2014 Boston Data Festivalfreshdatabos
Owen Zhang is no stranger to data science competitions. He has competed in and won several high profile challenges, and is currently ranked 1st out of a community of 200,000 data scientists on Kaggle. This is an opportunity to learn the tips, tricks and techniques Owen employs in building world-class predictive analytic solutions
My presentation about how to get started with competitive data mining at the data mining research group of Department of Computer Science and Engineering, University of Moratuwa.
My presentation about how to get started with competitive data mining at the meeting of data mining research group of Department of Computer Science and Engineering, University of Moratuwa.
Valencian Summer School in Machine Learning 2017 - Day 1
Lectures Review: Summary Day 1 Sessions. By Mercè Martín (BigML).
https://bigml.com/events/valencian-summer-school-in-machine-learning-2017
VSSML16 LR1. Summary Day 1
Valencian Summer School in Machine Learning 2016
Day 1
Summary Day 1
Mercè Martin (BigML)
https://bigml.com/events/valencian-summer-school-in-machine-learning-2016
This talk will focus on Techniques, metrics and different tests (code, models, infra and features/data) that help the developers of machine learning systems to achieve CD.
We live with an abundance of ML resources; from open source tools, to GPU workstations, to cloud-hosted autoML. What’s more, the lines between AI research and everyday ML have blurred; you can recreate a state-of-the-art model from arxiv papers at home. But can you afford to? In this talk, we explore ways to recession-proof your ML process without sacrificing on accuracy, explainability, or value.
"What we learned from 5 years of building a data science software that actual...Dataconomy Media
"What we learned from 5 years of building a data science software that actually works for everybody." Dr. Dennis Proppe, CTO and Chief Data Scientist at GPredictive GmbH
Watch more from Data Natives Berlin 2016 here: http://bit.ly/2fE1sEo
Visit the conference website to learn more: www.datanatives.io
Follow Data Natives:
https://www.facebook.com/DataNatives
https://twitter.com/DataNativesConf
https://www.youtube.com/c/DataNatives
Stay Connected to Data Natives by Email: Subscribe to our newsletter to get the news first about Data Natives 2017: http://bit.ly/1WMJAqS
About the Author:
Dennis Proppe is the CTO and Chief Data Scientist at Gpredictive, where he helps building software that enables data scientists to build and deploy predictive models in a few minutes instead of weeks. He has 10 years+ of expertise in extracting business value from data. Before co-founding Gpredictive, he worked as a marketing science consultant. Dennis holds a Ph.D. in statistical marketing.
Production-Ready BIG ML Workflows - from zero to heroDaniel Marcous
Data science isn't an easy task to pull of.
You start with exploring data and experimenting with models.
Finally, you find some amazing insight!
What now?
How do you transform a little experiment to a production ready workflow? Better yet, how do you scale it from a small sample in R/Python to TBs of production data?
Building a BIG ML Workflow - from zero to hero, is about the work process you need to take in order to have a production ready workflow up and running.
Covering :
* Small - Medium experimentation (R)
* Big data implementation (Spark Mllib /+ pipeline)
* Setting Metrics and checks in place
* Ad hoc querying and exploring your results (Zeppelin)
* Pain points & Lessons learned the hard way (is there any other way?)
DN18 | Demystifying the Buzz in Machine Learning! (This Time for Real) | Dat ...Dataconomy Media
Abstract of the Presentation:
When Dat Tran started his data science career in 2013, everyone was into big data. In fact, big data was at the peak of inflated expectations (according to Gartner). You had to use tools like Hadoop and Spark to be one of the cool kids. Many data prophets out there told you that data is the new oil, or even gold. Year 2018, things haven’t changed. Data is still cool and going strong. It’s eating the world- and yes, you still need big data, and now also deep deep very deep learning. There’s a lot of bullshit bingo out there.
In this talk, Dat Tran wants to demystify the buzz in machine learning by presenting some simple guidelines for successful data projects and real practical use cases. He will also share use cases from idealo, Germany’s largest price comparison service. And yes it involves deep learning, and yes it can be quite technical sometimes as well.
About the Author:
Dat Tran is currently co-heading the data team at idealo.de, where he leads a team of Data Scientists and Data Engineers. His aim is to turn idealo into a machine learning powerhouse. His research interests are diverse, from traditional machine learning to deep learning. Previously, he worked for Pivotal Labs and Accenture. He is a regular speaker and has presented at PyData and Cloud Foundry Summit. He also blogs about his work on Medium. His background is in Operations Research and Econometrics. Dat received his MSc in Economics from Humboldt University of Berlin.
MOPs & ML Pipelines on GCP - Session 6, RGDCgdgsurrey
MLOps Lifecycle
ML problem framing
ML solution architecture
Data preparation and processing
ML model development
ML pipeline automation and orchestration
ML solution monitoring, optimization, and maintenance
This document list the reasons why our past alumni chose NYC Data Science Academy over other programs.
Machine Learning Bootcamp is our flagship program and well received by our community.
Nyc open-data-2015-andvanced-sklearn-expandedVivian S. Zhang
Scikit-learn is a machine learning library in Python, that has become a valuable tool for many data science practitioners.
This talk will cover some of the more advanced aspects of scikit-learn, such as building complex machine learning pipelines, model evaluation, parameter search, and out-of-core learning.
Apart from metrics for model evaluation, we will cover how to evaluate model complexity, and how to tune parameters with grid search, randomized parameter search, and what their trade-offs are. We will also cover out of core text feature processing via feature hashing.
---------------------------------------------------------
Andreas is an Assistant Research Scientist at the NYU Center for Data Science, building a group to work on open source software for data science. Previously he worked as a Machine Learning Scientist at Amazon, working on computer vision and forecasting problems. He is one of the core developers of the scikit-learn machine learning library, and maintained it for several years.
Material will be posted here:
https://github.com/amueller/pydata-nyc-advanced-sklearn
Blog:
peekaboo-vision.blogspot.com
Twitter:
https://twitter.com/t3kcit
R003 laila restaurant sanitation report(NYC Data Science Academy, Data Scienc...Vivian S. Zhang
NYC Data Science Academy, Data Science by R Intensive Beginner level, R003 student, Laila, presented on restaurant sanitation report using NYC Open Data Set, see her blog post at http://nycdatascience.com/2014/05/pizza-everyone-loves-pizza/
R003 jiten south park episode popularity analysis(NYC Data Science Academy, D...Vivian S. Zhang
NYC Data Science Academy, Data Science by R Intensive Beginner level, R003 student, Jiten presented how he scrapped dataset and did south park episode popularity analysis.
The Indian economy is classified into different sectors to simplify the analysis and understanding of economic activities. For Class 10, it's essential to grasp the sectors of the Indian economy, understand their characteristics, and recognize their importance. This guide will provide detailed notes on the Sectors of the Indian Economy Class 10, using specific long-tail keywords to enhance comprehension.
For more information, visit-www.vavaclasses.com
We all have good and bad thoughts from time to time and situation to situation. We are bombarded daily with spiraling thoughts(both negative and positive) creating all-consuming feel , making us difficult to manage with associated suffering. Good thoughts are like our Mob Signal (Positive thought) amidst noise(negative thought) in the atmosphere. Negative thoughts like noise outweigh positive thoughts. These thoughts often create unwanted confusion, trouble, stress and frustration in our mind as well as chaos in our physical world. Negative thoughts are also known as “distorted thinking”.
Unit 8 - Information and Communication Technology (Paper I).pdfThiyagu K
This slides describes the basic concepts of ICT, basics of Email, Emerging Technology and Digital Initiatives in Education. This presentations aligns with the UGC Paper I syllabus.
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdfTechSoup
In this webinar you will learn how your organization can access TechSoup's wide variety of product discount and donation programs. From hardware to software, we'll give you a tour of the tools available to help your nonprofit with productivity, collaboration, financial management, donor tracking, security, and more.
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptxEduSkills OECD
Andreas Schleicher presents at the OECD webinar ‘Digital devices in schools: detrimental distraction or secret to success?’ on 27 May 2024. The presentation was based on findings from PISA 2022 results and the webinar helped launch the PISA in Focus ‘Managing screen time: How to protect and equip students against distraction’ https://www.oecd-ilibrary.org/education/managing-screen-time_7c225af4-en and the OECD Education Policy Perspective ‘Students, digital devices and success’ can be found here - https://oe.cd/il/5yV
Operation “Blue Star” is the only event in the history of Independent India where the state went into war with its own people. Even after about 40 years it is not clear if it was culmination of states anger over people of the region, a political game of power or start of dictatorial chapter in the democratic setup.
The people of Punjab felt alienated from main stream due to denial of their just demands during a long democratic struggle since independence. As it happen all over the word, it led to militant struggle with great loss of lives of military, police and civilian personnel. Killing of Indira Gandhi and massacre of innocent Sikhs in Delhi and other India cities was also associated with this movement.
2024.06.01 Introducing a competency framework for languag learning materials ...Sandy Millin
http://sandymillin.wordpress.com/iateflwebinar2024
Published classroom materials form the basis of syllabuses, drive teacher professional development, and have a potentially huge influence on learners, teachers and education systems. All teachers also create their own materials, whether a few sentences on a blackboard, a highly-structured fully-realised online course, or anything in between. Despite this, the knowledge and skills needed to create effective language learning materials are rarely part of teacher training, and are mostly learnt by trial and error.
Knowledge and skills frameworks, generally called competency frameworks, for ELT teachers, trainers and managers have existed for a few years now. However, until I created one for my MA dissertation, there wasn’t one drawing together what we need to know and do to be able to effectively produce language learning materials.
This webinar will introduce you to my framework, highlighting the key competencies I identified from my research. It will also show how anybody involved in language teaching (any language, not just English!), teacher training, managing schools or developing language learning materials can benefit from using the framework.
Ethnobotany and Ethnopharmacology:
Ethnobotany in herbal drug evaluation,
Impact of Ethnobotany in traditional medicine,
New development in herbals,
Bio-prospecting tools for drug discovery,
Role of Ethnopharmacology in drug evaluation,
Reverse Pharmacology.
Instructions for Submissions thorugh G- Classroom.pptxJheel Barad
This presentation provides a briefing on how to upload submissions and documents in Google Classroom. It was prepared as part of an orientation for new Sainik School in-service teacher trainees. As a training officer, my goal is to ensure that you are comfortable and proficient with this essential tool for managing assignments and fostering student engagement.
2. A plug for myself
Current
● Chief Product Officer
Previous
● VP, Science
3. Agenda
● Structure of a Data Science Competition
● Philosophical considerations
● Sources of competitive advantage
● Some technical tips
● Two cases -- Amazon Allstate
● Apply what we learn out of competitions
Technique
Strategy
Philosophy
4. Data Science Competitions remind us that the purpose of a
predictive model is to predict on data that we have NOT seen.
Training Public LB
(validation)
Private LB
(holdout)
Build model using Training Data to
predict outcomes on Private LB Data
Structure of a Data Science Competition
Quick but often misleading feedback
5. A little “philosophy”
● There are many ways to overfit
● Beware of “multiple comparison fallacy”
○ There is a cost in “peeking at the answer”,
○ Usually the first idea (if it works) is the best
“Think” more, “try” less
6. Sources of Competitive Advantage (the Secret)
● Discipline (once bitten twice shy)
○ Proper validation framework
● Effort
● (Some) Domain knowledge
● Feature engineering
● The “right” model structure
● Machine/statistical learning packages
● Coding/data manipulation efficiency
● Luck
Be Disciplined
+
Work Hard
+
Learn from
everyone
+
Luck
7. Technical Tricks -- GBM
● My confession: I (over)use GBM
○ When in doubt, use GBM
● GBM automatically approximate
○ Non-linear transformations
○ Subtle and deep interactions
● GBM gracefully treats missing values
● GBM is invariant to monotonic transformation of
features
8. Technical Tricks -- GBM Tuning
● Tuning parameters
○ Learning rate + number of trees
■ Usually small learning rate + many trees work
well. I target 1000 trees and tune learning rate
○ Number of obs in leaf
■ How many obs you need to get a good mean
estimate?
○ Interaction depth
■ Don’t be afraid to use 10+, this is (roughly) the
number of leaf nodes
9. Technical Tricks -- when GBM needs help
● High cardinality features
○ These are very commonly encounterd -- zip code, injury type,
ICD9, text, etc.
○ Convert into numerical with preprocessing -- out-of-fold average,
counts, etc.
○ Use Ridge regression (or similar) and
■ use out-of-fold prediction as input to GBM
■ or blend
○ Be brave, use N-way interactions
■ I used 7-way interaction in the Amazon competition.
● GBM with out-of-fold treatment of high-cardinality feature performs
very well
10. Technical Tricks -- Stacking
● Basic idea -- use one model’s output as the next model’s input
● It is NOT a good idea to use in sample prediction for stacking
○ The problem is over-fitting
○ The more “over-fit” prediction1 is , the more weight it will get in
Model 2
Text Features
Model 2
GBM
Prediction 1
Model 1
Ridge
Regression
Final
Prediction
Num Features
11. Technical Tricks -- Stacking -- OOS / CV
● Use out of sample predictions
○ Take half of the training data to build model 1
○ Apply model 1 on the rest of the training data,
use the output as input to model 2
● Use cross-validation partitioning when data limited
○ Partition training data into K partitions
○ For each of the K partition, compute “prediction
1” by building a model with OTHER partitions
12. Technical Tricks -- feature engineering in GBM
● GBM only APPROXIMATE interactions and non-
linear transformations
● Strong interactions benefit from being explicitly
defined
○ Especially ratios/sums/differences among
features
● GBM cannot capture complex features such as
“average sales in the previous period for this type of
product”
13. Technical Tricks -- Glmnet
● From a methodology perspective, the opposite of
GBM
● Captures (log/logistic) linear relationship
● Work with very small # of rows (a few hundred or
even less)
● Complements GBM very well in a blend
● Need a lot of more work
○ missing values, outliers, transformations (log?),
interactions
● The sparsity assumption -- L1 vs L2
14. Technical Tricks -- Text mining
● tau package in R
● Python’s sklearn
● L2 penalty a must
● N-grams work well.
● Don’t forget the “trivial features”: length of text,
number of words, etc.
● Many “text-mining” competitions on kaggle are
actually dominated by structured fields -- KDD2014
15. Technical Tricks -- Blending
● All models are wrong, but some are useful (George
Box)
○ The hope is that they are wrong in different ways
● When in doubt, use average blender
● Beware of temptation to overfit public leaderboard
○ Use public LB + training CV
● The strongest individual model does not necessarily
make the best blend
○ Sometimes intentionally built weak models are good blending
candidates -- Liberty Mutual Competition
16. Technical Tricks -- blending continued
● Try to build “diverse” models
○ Different tools -- GBM, Glmnet, RF, SVM, etc.
○ Different model specifications -- Linear,
lognormal, poisson, 2 stage, etc.
○ Different subsets of features
○ Subsampled observations
○ Weighted/unweighted
○ …
● But, do not “peek at answers” (at least not too much)
17. Apply what we learn outside of competitions
● Competitions give us really good models, but we also need to
○ Select the right problem and structure it correctly
○ Find good (at least useful) data
○ Make sure models are used the right way
Competitions help us
● Understand how much “signal” exists in the data
● Identify flaws in data or data creation process
● Build generalizable models
● Broaden our technical horizon
● …
18. Case 1 -- Amazon User Access competition
● One of the most popular competitions on Kaggle to date
○ 1687 teams
● Use anonymized features to predict if employee access
request would be granted or denied
● All categorical features
○ Resource ID / Mgr ID / User ID / Dept ID …
○ Many features have high cardinality
● But I want to use GBM
19. Case 1 -- Amazon User Access competition
● Encode categorical features using observation counts
○ This is even available for holdout data!
● Encode categorical features using average response
○ Average all but one (example on next slide)
○ Add noise to the training features
● Build different kind of trees + ENET
○ GBM + ERT + ENET + RF + GBM2 + ERT2
● I didn't know VW (or similar), otherwise might have got better
results.
● https://github.com/owenzhang/Kaggle-AmazonChallenge2013
20. Case 1 -- Amazon User Access competition
“Leave-one-out” encoding of categorical features:
Split User ID Y mean(Y) random Exp_UID
Training A1 0 .667 1.05 0.70035
Training A1 1 .333 .97 0.32301
Training A1 1 .333 .98 0.32634
Training A1 0 .667 1.02 0.68034
Test A1 - .5 1 .5
Test A1 - .5 1 .5
Training A2 0
21. Case 2 -- Allstate User Purchase Option Prediction
● Predict final purchased product options based on earlier
transactions.
○ 7 correlated targets
● This turns out to be very difficult because:
○ The evaluation criteria is all-or-nothing: all 7 predictions
need to be correct
○ The baseline “last quoted” is very hard to beat.
■ Last quoted 53.269%
■ #3 (me) : 53.713% (+0.444%)
■ #1 solution 53.743% (+0.474%)
● Key challenges -- capture correlation, and not to lose to
baseline
22. Case 2 -- Allstate User Purchase Option Prediction
● Dependency -- Chained models
○ First build stand-alone model for F
○ Then model for G, given F
○ F => G => B => A => C => E => D
○ “Free models” first, “dependent” model later
○ In training time, use actual data
○ In prediction time, use most likely predicted value
● Not to lose to baseline -- 2 stage models
○ One model to predict which one to use: chained prediction,
or baseline