NYC Data Science Academy, NYC Open Data Meetup, Big Data, Data Science, NYC, Vivian Zhang, SupStat Inc,NYC, Machine learning, Kaggle, amazon employee access challenge
Nyc open-data-2015-andvanced-sklearn-expandedVivian S. Zhang
Scikit-learn is a machine learning library in Python, that has become a valuable tool for many data science practitioners.
This talk will cover some of the more advanced aspects of scikit-learn, such as building complex machine learning pipelines, model evaluation, parameter search, and out-of-core learning.
Apart from metrics for model evaluation, we will cover how to evaluate model complexity, and how to tune parameters with grid search, randomized parameter search, and what their trade-offs are. We will also cover out of core text feature processing via feature hashing.
---------------------------------------------------------
Andreas is an Assistant Research Scientist at the NYU Center for Data Science, building a group to work on open source software for data science. Previously he worked as a Machine Learning Scientist at Amazon, working on computer vision and forecasting problems. He is one of the core developers of the scikit-learn machine learning library, and maintained it for several years.
Material will be posted here:
https://github.com/amueller/pydata-nyc-advanced-sklearn
Blog:
peekaboo-vision.blogspot.com
Twitter:
https://twitter.com/t3kcit
Tong is a data scientist in Supstat Inc and also a master students of Data Mining. He has been an active R programmer and developer for 5 years. He is the author of the R package of XGBoost, one of the most popular and contest-winning tools on kaggle.com nowadays.
Agenda:
Introduction of Xgboost
Real World Application
Model Specification
Parameter Introduction
Advanced Features
Kaggle Winning Solution
The caret package is a unified interface to a large number of predictive mode...odsc
The caret package is a unified interface to a large number of predictive model functions in R.
First created in 2005, the home for the source code and documentation has changed several times.
In this talk, we will outline the somewhat unique aspects of the package and how it impacts the development environment (including documentation and testing). Friction points with CRAN and their resolution will also be discussed.
Nyc open-data-2015-andvanced-sklearn-expandedVivian S. Zhang
Scikit-learn is a machine learning library in Python, that has become a valuable tool for many data science practitioners.
This talk will cover some of the more advanced aspects of scikit-learn, such as building complex machine learning pipelines, model evaluation, parameter search, and out-of-core learning.
Apart from metrics for model evaluation, we will cover how to evaluate model complexity, and how to tune parameters with grid search, randomized parameter search, and what their trade-offs are. We will also cover out of core text feature processing via feature hashing.
---------------------------------------------------------
Andreas is an Assistant Research Scientist at the NYU Center for Data Science, building a group to work on open source software for data science. Previously he worked as a Machine Learning Scientist at Amazon, working on computer vision and forecasting problems. He is one of the core developers of the scikit-learn machine learning library, and maintained it for several years.
Material will be posted here:
https://github.com/amueller/pydata-nyc-advanced-sklearn
Blog:
peekaboo-vision.blogspot.com
Twitter:
https://twitter.com/t3kcit
Tong is a data scientist in Supstat Inc and also a master students of Data Mining. He has been an active R programmer and developer for 5 years. He is the author of the R package of XGBoost, one of the most popular and contest-winning tools on kaggle.com nowadays.
Agenda:
Introduction of Xgboost
Real World Application
Model Specification
Parameter Introduction
Advanced Features
Kaggle Winning Solution
The caret package is a unified interface to a large number of predictive mode...odsc
The caret package is a unified interface to a large number of predictive model functions in R.
First created in 2005, the home for the source code and documentation has changed several times.
In this talk, we will outline the somewhat unique aspects of the package and how it impacts the development environment (including documentation and testing). Friction points with CRAN and their resolution will also be discussed.
The caret package, short for classification and regression training, contains numerous tools for developing predictive models using the rich set of models available in R. The package focuses on simplifying model training and tuning across a wide variety of modeling techniques. It also includes methods for pre-processing training data, calculating variable importance, and model visualizations.
An example from classification of music genres is used to illustrate the functionality on a real data set and to benchmark the benefits of parallel processing with several types of models.
Bio: Max is a Director of Nonclinical Statistics at Pfizer Global R&D in Connecticut. He has been applying models in the molecular diagnostic and pharmaceutical industries for over 15 years. He is the author of several R packages including the caret package that provides a simple and consistent interface to over 100 predictive models available in R.
Max has taught courses on modeling within Pfizer and externally. Recently, he taught modeling classes for the American Society of Chemistry, the Indian Ministry of Information Technology and Predictive Analytics World. He is a co-author of the forthcoming Spring book "Applied Predictive Modeling".
Our fall 12-Week Data Science bootcamp starts on Sept 21st,2015. Apply now to get a spot!
If you are hiring Data Scientists, call us at (1)888-752-7585 or reach info@nycdatascience.com to share your openings and set up interviews with our excellent students.
---------------------------------------------------------------
Come join our meet-up and learn how easily you can use R for advanced Machine learning. In this meet-up, we will demonstrate how to understand and use Xgboost for Kaggle competition. Tong is in Canada and will do remote session with us through google hangout.
---------------------------------------------------------------
Speaker Bio:
Tong is a data scientist in Supstat Inc and also a master students of Data Mining. He has been an active R programmer and developer for 5 years. He is the author of the R package of XGBoost, one of the most popular and contest-winning tools on kaggle.com nowadays.
Pre-requisite(if any): R /Calculus
Preparation: A laptop with R installed. Windows users might need to have RTools installed as well.
Agenda:
Introduction of Xgboost
Real World Application
Model Specification
Parameter Introduction
Advanced Features
Kaggle Winning Solution
Event arrangement:
6:45pm Doors open. Come early to network, grab a beer and settle in.
7:00-9:00pm XgBoost Demo
Reference:
https://github.com/dmlc/xgboost
Gradient Boosted Regression Trees in scikit-learnDataRobot
Slides of the talk "Gradient Boosted Regression Trees in scikit-learn" by Peter Prettenhofer and Gilles Louppe held at PyData London 2014.
Abstract:
This talk describes Gradient Boosted Regression Trees (GBRT), a powerful statistical learning technique with applications in a variety of areas, ranging from web page ranking to environmental niche modeling. GBRT is a key ingredient of many winning solutions in data-mining competitions such as the Netflix Prize, the GE Flight Quest, or the Heritage Health Price.
I will give a brief introduction to the GBRT model and regression trees -- focusing on intuition rather than mathematical formulas. The majority of the talk will be dedicated to an in depth discussion how to apply GBRT in practice using scikit-learn. We will cover important topics such as regularization, model tuning and model interpretation that should significantly improve your score on Kaggle.
Lab 2: Classification and Regression Prediction Models, training and testing ...Yao Yao
https://github.com/yaowser/data_mining_group_project
https://www.kaggle.com/c/zillow-prize-1/data
From the Zillow real estate data set of properties in the southern California area, conduct the following data cleaning, data analysis, predictive analysis, and machine learning algorithms:
Lab 2: Classification and Regression Prediction Models, training and testing splits, optimization of K Nearest Neighbors (KD tree), optimization of Random Forest, optimization of Naive Bayes (Gaussian), advantages and model comparisons, feature importance, Feature ranking with recursive feature elimination, Two dimensional Linear Discriminant Analysis
Data Science is concerned with the analysis of large amounts of data. When the volume of data is really large, it requires the use of cooperating, distributed machines. The most popular method of doing this is Hadoop, a collection of programs to perform computations on connected machines in a cluster. Hadoop began life as an open-source implementation of MapReduce, an idea first developed and implemented by Google for its own clusters. Though Hadoop's MapReduce is Java-based, and quite complex, this talk focuses on the "streaming" facility, which allows Python programmers to use MapReduce in a clean and simple way. We will present the core ideas of MapReduce and show you how to implement a MapReduce computation using Python streaming. The presentation will also include an overview of the various components of the Hadoop "ecosystem."
NYC Data Science Academy is excited to welcome Sam Kamin who will be presenting an Introduction to Hadoop for Python Programmers a well as a discussion of MapReduce with Streaming Python.
Sam Kamin was a professor in the University of Illinois Computer Science Department. His research was in programming languages, high-performance computing, and educational technology. He taught a wide variety of courses, and served as the Director of Undergraduate Programs. He retired as Emeritus Associate Professor, and worked at Google until taking his current position as VP of Data Engineering in NYC Data Science Academy.
--------------------------------------
Our fall 12-Week Data Science bootcamp starts on Sept 21st,2015. Apply now to get a spot!
If you are hiring Data Scientists, call us at (1)888-752-7585 or reach info@nycdatascience.com to share your openings and set up interviews with our excellent students.
Comparison Study of Decision Tree Ensembles for RegressionSeonho Park
Nowadays, decision tree ensemble methods are widely used for solving classification and regression problem due to their rigorousness and robustness. To compare with classification, the performance in regression problem so far has not been yet addressed in detail. In this presentation, we review the state-of-art decision tree ensemble methodology in scikit-learn and xgboost for regression. Also, empirical study results are illustrated to compare their performance and computational efficiency.
This is the slide from my talk at FULokoja Ingressive meetup.
XGBoost is a decision-tree-based ensemble Machine Learning algorithm that uses a gradient boosting framework. In prediction problems involving unstructured and structured data (images, text, etc.) artificial neural networks tend to outperform all other algorithms or frameworks. However, when it comes to small-to-medium structured/tabular data, decision tree-based algorithms are considered best-in-class right now. XGBoost model has the best combination of prediction performance and processing time compared to other algorithms.
Probabilistic Data Structures and Approximate SolutionsOleksandr Pryymak
Will your decisions change if you'll know that the audience of your website isn't 5M users, but rather 5'042'394'953? Unlikely, so why should we always calculate the exact solution at any cost? An approximate solution for this and many similar problems would take only a fraction of memory and runtime in comparison to calculating the exact solution.
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...Yao Yao
https://github.com/yaowser/data_mining_group_project
https://www.kaggle.com/c/zillow-prize-1/data
From the Zillow real estate data set of properties in the southern California area, conduct the following data cleaning, data analysis, predictive analysis, and machine learning algorithms:
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regression Model Performance, Optimizing Support Vector Machine Classifier, Accuracy of results and efficiency, Logistic Regression Feature Importance, interpretation of support vectors, Density Graph
The caret package, short for classification and regression training, contains numerous tools for developing predictive models using the rich set of models available in R. The package focuses on simplifying model training and tuning across a wide variety of modeling techniques. It also includes methods for pre-processing training data, calculating variable importance, and model visualizations.
An example from classification of music genres is used to illustrate the functionality on a real data set and to benchmark the benefits of parallel processing with several types of models.
Bio: Max is a Director of Nonclinical Statistics at Pfizer Global R&D in Connecticut. He has been applying models in the molecular diagnostic and pharmaceutical industries for over 15 years. He is the author of several R packages including the caret package that provides a simple and consistent interface to over 100 predictive models available in R.
Max has taught courses on modeling within Pfizer and externally. Recently, he taught modeling classes for the American Society of Chemistry, the Indian Ministry of Information Technology and Predictive Analytics World. He is a co-author of the forthcoming Spring book "Applied Predictive Modeling".
Our fall 12-Week Data Science bootcamp starts on Sept 21st,2015. Apply now to get a spot!
If you are hiring Data Scientists, call us at (1)888-752-7585 or reach info@nycdatascience.com to share your openings and set up interviews with our excellent students.
---------------------------------------------------------------
Come join our meet-up and learn how easily you can use R for advanced Machine learning. In this meet-up, we will demonstrate how to understand and use Xgboost for Kaggle competition. Tong is in Canada and will do remote session with us through google hangout.
---------------------------------------------------------------
Speaker Bio:
Tong is a data scientist in Supstat Inc and also a master students of Data Mining. He has been an active R programmer and developer for 5 years. He is the author of the R package of XGBoost, one of the most popular and contest-winning tools on kaggle.com nowadays.
Pre-requisite(if any): R /Calculus
Preparation: A laptop with R installed. Windows users might need to have RTools installed as well.
Agenda:
Introduction of Xgboost
Real World Application
Model Specification
Parameter Introduction
Advanced Features
Kaggle Winning Solution
Event arrangement:
6:45pm Doors open. Come early to network, grab a beer and settle in.
7:00-9:00pm XgBoost Demo
Reference:
https://github.com/dmlc/xgboost
Gradient Boosted Regression Trees in scikit-learnDataRobot
Slides of the talk "Gradient Boosted Regression Trees in scikit-learn" by Peter Prettenhofer and Gilles Louppe held at PyData London 2014.
Abstract:
This talk describes Gradient Boosted Regression Trees (GBRT), a powerful statistical learning technique with applications in a variety of areas, ranging from web page ranking to environmental niche modeling. GBRT is a key ingredient of many winning solutions in data-mining competitions such as the Netflix Prize, the GE Flight Quest, or the Heritage Health Price.
I will give a brief introduction to the GBRT model and regression trees -- focusing on intuition rather than mathematical formulas. The majority of the talk will be dedicated to an in depth discussion how to apply GBRT in practice using scikit-learn. We will cover important topics such as regularization, model tuning and model interpretation that should significantly improve your score on Kaggle.
Lab 2: Classification and Regression Prediction Models, training and testing ...Yao Yao
https://github.com/yaowser/data_mining_group_project
https://www.kaggle.com/c/zillow-prize-1/data
From the Zillow real estate data set of properties in the southern California area, conduct the following data cleaning, data analysis, predictive analysis, and machine learning algorithms:
Lab 2: Classification and Regression Prediction Models, training and testing splits, optimization of K Nearest Neighbors (KD tree), optimization of Random Forest, optimization of Naive Bayes (Gaussian), advantages and model comparisons, feature importance, Feature ranking with recursive feature elimination, Two dimensional Linear Discriminant Analysis
Data Science is concerned with the analysis of large amounts of data. When the volume of data is really large, it requires the use of cooperating, distributed machines. The most popular method of doing this is Hadoop, a collection of programs to perform computations on connected machines in a cluster. Hadoop began life as an open-source implementation of MapReduce, an idea first developed and implemented by Google for its own clusters. Though Hadoop's MapReduce is Java-based, and quite complex, this talk focuses on the "streaming" facility, which allows Python programmers to use MapReduce in a clean and simple way. We will present the core ideas of MapReduce and show you how to implement a MapReduce computation using Python streaming. The presentation will also include an overview of the various components of the Hadoop "ecosystem."
NYC Data Science Academy is excited to welcome Sam Kamin who will be presenting an Introduction to Hadoop for Python Programmers a well as a discussion of MapReduce with Streaming Python.
Sam Kamin was a professor in the University of Illinois Computer Science Department. His research was in programming languages, high-performance computing, and educational technology. He taught a wide variety of courses, and served as the Director of Undergraduate Programs. He retired as Emeritus Associate Professor, and worked at Google until taking his current position as VP of Data Engineering in NYC Data Science Academy.
--------------------------------------
Our fall 12-Week Data Science bootcamp starts on Sept 21st,2015. Apply now to get a spot!
If you are hiring Data Scientists, call us at (1)888-752-7585 or reach info@nycdatascience.com to share your openings and set up interviews with our excellent students.
Comparison Study of Decision Tree Ensembles for RegressionSeonho Park
Nowadays, decision tree ensemble methods are widely used for solving classification and regression problem due to their rigorousness and robustness. To compare with classification, the performance in regression problem so far has not been yet addressed in detail. In this presentation, we review the state-of-art decision tree ensemble methodology in scikit-learn and xgboost for regression. Also, empirical study results are illustrated to compare their performance and computational efficiency.
This is the slide from my talk at FULokoja Ingressive meetup.
XGBoost is a decision-tree-based ensemble Machine Learning algorithm that uses a gradient boosting framework. In prediction problems involving unstructured and structured data (images, text, etc.) artificial neural networks tend to outperform all other algorithms or frameworks. However, when it comes to small-to-medium structured/tabular data, decision tree-based algorithms are considered best-in-class right now. XGBoost model has the best combination of prediction performance and processing time compared to other algorithms.
Probabilistic Data Structures and Approximate SolutionsOleksandr Pryymak
Will your decisions change if you'll know that the audience of your website isn't 5M users, but rather 5'042'394'953? Unlikely, so why should we always calculate the exact solution at any cost? An approximate solution for this and many similar problems would take only a fraction of memory and runtime in comparison to calculating the exact solution.
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...Yao Yao
https://github.com/yaowser/data_mining_group_project
https://www.kaggle.com/c/zillow-prize-1/data
From the Zillow real estate data set of properties in the southern California area, conduct the following data cleaning, data analysis, predictive analysis, and machine learning algorithms:
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regression Model Performance, Optimizing Support Vector Machine Classifier, Accuracy of results and efficiency, Logistic Regression Feature Importance, interpretation of support vectors, Density Graph
Target leakage is one of the most difficult problems in developing real-world machine learning models. Leakage occurs when the training data gets contaminated with information that will not be known at prediction time. Additionally, there can be multiple sources of leakage, from data collection and feature engineering to partitioning and model validation. As a result, even experienced data scientists can inadvertently introduce leaks and become overly optimistic about the performance of the models they deploy. In this talk, we will look through real-life examples of data leakage at different stages of the data science project lifecycle, and discuss various countermeasures and best practices for model validation.
As part of our team's enrollment for Data Science Super Specialization course under UpX Academy, we submitted many projects for our final assessments, one of them was Telecom Churn Analysis Model.
The input data was provided by UpX academy and language we used is R. As part of the project, our main objective was :-
-> To predict Customer Churn.
-> To Highlight the main variables/factors influencing Customer Churn.
-> To Use various ML algorithms to build prediction models, evaluate the accuracy and performance of these models.
-> Finding out the best model for our business case & providing executive Summary.
To address the mentioned business problem, we tried to follow a thorough approach. We did a detailed level Exploratory Data Analysis which consists of various Box Plots, Bar Plots etc..
Further we tried our best to build as many Classification models possible which fits our business case (Logistic Regression/kNN/Decision Trees/Random Forest/SVM) and also tried to touch Cox Hazard Survival analysis Model. Later for every model we tried to boost their performances by applying various performance tuning techniques.
As we all are still into our learning mode w.r.t these concepts & starting new, please feel free to provide feedback on our work. Any suggestions are most welcome... :)
Thanks!!
Machine learning key to your formulation challengesMarc Borowczak
You develop pharmaceutical, cosmetic, food, industrial or civil engineered products, and are often confronted with the challenge of blending and formulating to meet process or performance properties. While traditional Research and Development does approach the problem with experimentation, it generally involves designs, time and resource constraints, and can be considered slow, expensive and often times redundant, fast forgotten or perhaps obsolete.
Consider the alternative Machine Learning tools offers today. We will show this is not only quick, efficient and ultimately the only way Front End of Innovation should proceed, and how it is particularly suited for formulation and classification.
Today, we will explain how Machine Learning can shed new light on this generic and very persistent formulation challenge. We will discuss the other important aspect of classification and clustering often associated with these formulations challenges in a forthcoming communication.
Scaling AutoML-Driven Anomaly Detection With LuminaireDatabricks
Organizations rely heavily on time series metrics to measure and model key aspects of operational and business performance. The ability to reliably detect issues with these metrics is imperative to identifying early indicators of major problems before they become pervasive. This is a difficult machine learning and systems problem because temporal patterns are complex, ever changing, and often very noisy, traditionally requiring significant manual configuration and model maintenance.
At Zillow, we have built an orchestration framework around Luminaire, our open-source python library for hands-off time-series Anomaly Detection. Luminaire provides a suite of models and built-in AutoML capabilities which we process with Spark for distributed training and scoring of thousands of metrics. In this talk, we will cover the architecture of this framework and performance of the Luminaire package across detection and prediction accuracy as well as runtime efficiency.
It's a prison management system which shows all the prisoner details along with warden details and it can further be updated or manipulated according to the user opinion in this program.
Agile Data Science 2.0 covers the theory and practice of applying agile methods to the practice of applied analytics research called data science. The book takes the stance that data products are the preferred output format for data science teams to effect change in an organization. Accordingly, we show how to "get meta" to enable agility in building applications describing the applied research process itself. Then we show how to use 'big data' tools to iteratively build, deploy and refine analytics applications. Tracking data-product development through the five stages of the "data value pyramid", we show you how to build applications from conception through development through deployment and then through iterative improvement. Application development is a fundamental skill for a data scientist, and by publishing your data science work as a web application, we show you how to effect maximal change within your organization.
Technologies covered include Python, Apache Spark (Spark MLlib, Spark Streaming), Apache Kafka, MongoDB, ElasticSearch and Apache Airflow.
In this talk about performance for the 2018-01-25 Dallas Oracle Users Group meeting, Cary Millsap talks about application design, ways to control Oracle sql_trace, measurement intrusion, and function interposition.
[Tutorial] building machine learning models for predictive maintenance applic...PAPIs.io
This talk introduces the landscape and challenges of predictive maintenance applications in the industry, illustrates how to formulate (data labeling and feature engineering) the problem with three machine learning models (regression, binary classification, multi-class classification) using a publicly available aircraft engine run-to-failure data set, and showcases how the models can be conveniently trained and compared with different algorithms in Azure ML.
Running Intelligent Applications inside a Database: Deep Learning with Python...Miguel González-Fierro
In this talk we present a new paradigm of computation where the intelligence is computed inside the database. Standard software systems must get the data from the database to execute a routine. If the size of the data is big, there are inefficiencies due to the data movement. Store procedures tried to solve this issue in the past, allowing for computing simple functions inside the database. However, only simple routines can be executed.
To showcase the capabilities of our new system, we created a lung cancer detection algorithm using Microsoft’s Cognitive Toolkit, also known as CNTK. We used transfer learning between ImageNet dataset, which contains natural images, and a lung cancer dataset, which contains scans of horizontal sections of the lung for healthy and sick patients. Specifically, a pretrained Convolutional Neural Network on ImageNet is used on the lung cancer dataset to generate features. Once the features are computed, a boosted tree is applied to predict whether the patient has cancer or not.
All this process is computed inside the database, so the data movement is minimized. We are even able to execute the algorithm using the GPU of the virtual machine that hosts the database. Using a GPU, we can compute the featurization in less than 1h, in contrast to using a CPU, that would take up to 32h. Finally, we set up an API to connect the solution to a web app, where a doctor can analyze the images and get a prediction of a patient.
Target Leakage in Machine Learning (ODSC East 2020)Yuriy Guts
Target leakage is one of the most difficult problems in developing real-world machine learning models. Leakage occurs when the training data gets contaminated with information that will not be known at prediction time. Additionally, there can be multiple sources of leakage, from data collection and feature engineering to partitioning and model validation. As a result, even experienced data scientists can inadvertently introduce leaks and become overly optimistic about the performance of the models they deploy. In this talk, we will look through real-life examples of data leakage at different stages of the data science project lifecycle, and discuss various countermeasures and best practices for model validation.
Similar to Kaggle talk series top 0.2% kaggler on amazon employee access challenge (20)
This document list the reasons why our past alumni chose NYC Data Science Academy over other programs.
Machine Learning Bootcamp is our flagship program and well received by our community.
This project was completed by Scott Dobbins and Rachel Kogan, who enrolled in the NYC Data Science Academy's 12-Week Data Science Bootcamp. Learn more about the program: http://nycdatascience.com/data-science-bootcamp/
Given that both Wikipedia and comments sections of most websites are freely open to anyone to edit at any time, how has Wikipedia managed to remain such a useful resource while most comments sections are ridden with vandalism, ads, and other counterproductive user behavior?
We believe the answer is two-fold: 1) Wikipedia has an army of bots that quickly identify and revert vandalism so that the worst edits are usually never seen by people and the site generally maintains itself in a well-kempt state, and 2) Wikipedia has a strong community of administrators and other contributors who routinely clean the site’s flagged contents.
Vandalism is relatively easy to flag, though a few clever edits manage to stay on the site for a long time. What about site content problems that are more subjective, like bias? Wikipedia users do routinely manually flag pages with point-of-view (POV) issues, though with millions of pages and no machine-based approaches, the site can only manage to confidently maintain neutrality on the more well-trafficked pages.
Here we propose a solution to solve some of the more intractable content issues for Wikipedia and other sites using Natural Language Processing (NLP) and machine learning approaches. The sheer quantity of data managed by Wikipedia and similar sites requires distributed computing approaches, so we show here how Apache Spark can upgrade common algorithms to run on massive data sets.
A Hybrid Recommender with Yelp Challenge Data Vivian S. Zhang
Developed by Chao Shi, Sam O'Mullane, Sean Kickham, Reza Rad and Andrew Rubino
Watch the project presentation: https://youtu.be/gkKGnnBenyk
This project was completed by students from NYC Data Science Academy's 12-Week Bootcamp. Learn more about the bootcamp: http://nycdatascience.com/data-science-bootcamp/
People make decisions on where to eat based on friends’ recommendations. Since they know you, their suggestions matter more than those of strangers.
For the capstone project, we built a hybrid Yelp recommendation system that can provide individualized recommendations based on your friend’s reviews on the social network. We built the machine learning models using Spark, and set up a Flask-Kafka-RDS-Databricks pipeline that allows a continuous stream of user requests.
During the presentation, we will talk about the development framework and technical implementation of the pipeline.
Read on their project posts and code:
https://blog.nycdatascience.com/student-works/capstone/yelp-recommender-part-1/
https://blog.nycdatascience.com/student-works/yelp-recommender-part-2/
Kaggle Top1% Solution: Predicting Housing Prices in Moscow Vivian S. Zhang
This project was completed by students graduated from NYC Data Science Academy 12-week Data Science Bootcamp. Learn more about the bootcamp: http://nycdatascience.com/data-science-bootcamp/
Watch the project presentation: https://youtu.be/W530d2ZdbJE
Ranked #15 out of 3,274 teams on Kaggle Team Members - Brandy Freitas, Chase Edge and Grant Webb
Given 4 years of housing price data in a foreign market, predicting the following year’s prices should be pretty straightforward, right? But what if in that last year of data, the country’s stock market, the value of its currency and the price of its number 1 export, all dropped by nearly 50%. And on top of all that, the country was slapped with economic sanctions by the EU and the US. This was Moscow in 2014 and as you can see, it was anything but straightforward.
We were able to overcome these challenges and in the two weeks of working together, were able to achieve a top 1% ranking on Kaggle. Our success is a product of our in depth data cleaning, feature engineering and our approach to modeling. With a focus on interpretability and simplicity, we begin modeling using linear regression and decision trees which gave us a better understanding of the data. We then utilized more complicated models such as random forests and XGBoost which ultimately resulted in our top submission.
Twitter: @NycDataSci
Learn with our NYC Data Science Program (weekend courses for working professionals and 12 week full time for whom are advancing their career into Data Science)
Our next 12-Week Data Science Bootcamp starts in Jun. (Deadline to apply is May 1st, all decisions will be made by May 15th.)
====================================
Max Kuhn, Director is Nonclinical Statistics of Pfizer and also the author of Applied Predictive Modeling.
He will join us and share his experience with Data Mining with R.
Max is a nonclinical statistician who has been applying predictive models in the diagnostic and pharmaceutical industries for over 15 years. He is the author and maintainer for a number of predictive modeling packages, including: caret, C50, Cubist and AppliedPredictiveModeling. He blogs about the practice of modeling on his website at ttp://appliedpredictivemodeling.com/blog
---------------------------------------------------------
His Feb 18th course can be RSVP at NYC Data Science Academy.
Syllabus
Predictive Modeling using R
Description
This class will get attendees up to speed in predictive modeling using the R programming language. The goal of the course is to understand the general predictive modeling process and how it can be implemented in R. A selection of important models (e.g. tree-based models, support vector machines) will be described in an intuitive manner to illustrate the process of training and evaluating models.
Prerequisites:
Attendees should have a working knowledge of basic R data structures (e.g. data frames, factors etc) and language fundamentals such as functions and subsetting data. Understanding of the content contained in Appendix B sections B1 though B8 of Applied Predictive Modeling (free PDF from publisher [1]) should suffice.
Outline:
- An introduction to predictive modeling
- R and predictive modeling: the good and bad
- Illustrative example
- Measuring performance
- Data splitting and resampling
- Data pre-processing
- Classification trees
- Boosted trees
- Support vector machines
If time allows, the following topics will also be covered
- Parallel processing
- Comparing models
- Feature selection
- Common pitfalls
Materials:
Attendees will be provided with a copy of Applied Predictive Modeling[2] as well as course notes, code and raw data. Participants will be able to reproduce the examples described in the workshop.
Attendees should have a computer with a relatively recent version of R installed.
About the Instructor:
More about Max's work:
[1] http://rd.springer.com/content/pdf/bbm%3A978-1-4614-6849-3%2F1.pdf
[2] http://appliedpredictivemodeling.com
Winning data science competitions, presented by Owen ZhangVivian S. Zhang
<featured> Meetup event hosted by NYC Open Data Meetup, NYC Data Science Academy. Speaker: Owen Zhang, Event Info: http://www.meetup.com/NYC-Open-Data/events/219370251/
Using Machine Learning to aid Journalism at the New York TimesVivian S. Zhang
This talk was presented to NYC Open Data Meetup Group on Nov 11, 2014.
Speaker:
Daeil Kim is currently a data scientist at the Times and is finishing up his Ph.D at Brown University on work related to developing scalable inference algorithms for Bayesian Nonparametric models. His work at the Times spans a variety of problems related to the company's business interests, audience development, as well as developing tools to aid journalism.
Topic:
This talk will focus mostly on how machine learning can help problems that prop up in journalism. We'll begin first by talking about using popular supervised learning algorithms such as regularized Logistic Regression to help assist a journalist's work in uncovering insights into a story regarding the recall of Takata airbags in cars. Afterwards, we'll think about using topic modeling to deal with large document dumps generated from FOIA (Freedom of Information Act) requests and Refinery, a simple web based tool to ease the implementation of such tasks. Finally, if there is time, we will go over how topic models have been extended to assist in the problem of designing an efficient recommendation engine for text-based content.
Hack session for NYTimes Dialect Map Visualization( developed by R Shiny)Vivian S. Zhang
Data Science Academy, Hack session, NY Times, Dialect Map, Data science by R, Vivian S. Zhang, see www.nycdatascience.com for more details. Joint work by Data Scientist team of SupStat Inc. a New York based data analytic and visualization consulting firm.
About
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
Technical Specifications
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
Key Features
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface
• Compatible with MAFI CCR system
• Copatiable with IDM8000 CCR
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
Application
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
Student information management system project report ii.pdfKamal Acharya
Our project explains about the student management. This project mainly explains the various actions related to student details. This project shows some ease in adding, editing and deleting the student details. It also provides a less time consuming process for viewing, adding, editing and deleting the marks of the students.
Saudi Arabia stands as a titan in the global energy landscape, renowned for its abundant oil and gas resources. It's the largest exporter of petroleum and holds some of the world's most significant reserves. Let's delve into the top 10 oil and gas projects shaping Saudi Arabia's energy future in 2024.
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...Amil Baba Dawood bangali
Contact with Dawood Bhai Just call on +92322-6382012 and we'll help you. We'll solve all your problems within 12 to 24 hours and with 101% guarantee and with astrology systematic. If you want to take any personal or professional advice then also you can call us on +92322-6382012 , ONLINE LOVE PROBLEM & Other all types of Daily Life Problem's.Then CALL or WHATSAPP us on +92322-6382012 and Get all these problems solutions here by Amil Baba DAWOOD BANGALI
#vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore#blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #blackmagicforlove #blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #Amilbabainuk #amilbabainspain #amilbabaindubai #Amilbabainnorway #amilbabainkrachi #amilbabainlahore #amilbabaingujranwalan #amilbabainislamabad
Explore the innovative world of trenchless pipe repair with our comprehensive guide, "The Benefits and Techniques of Trenchless Pipe Repair." This document delves into the modern methods of repairing underground pipes without the need for extensive excavation, highlighting the numerous advantages and the latest techniques used in the industry.
Learn about the cost savings, reduced environmental impact, and minimal disruption associated with trenchless technology. Discover detailed explanations of popular techniques such as pipe bursting, cured-in-place pipe (CIPP) lining, and directional drilling. Understand how these methods can be applied to various types of infrastructure, from residential plumbing to large-scale municipal systems.
Ideal for homeowners, contractors, engineers, and anyone interested in modern plumbing solutions, this guide provides valuable insights into why trenchless pipe repair is becoming the preferred choice for pipe rehabilitation. Stay informed about the latest advancements and best practices in the field.
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)MdTanvirMahtab2
This presentation is about the working procedure of Shahjalal Fertilizer Company Limited (SFCL). A Govt. owned Company of Bangladesh Chemical Industries Corporation under Ministry of Industries.
2. AAggeennddaa
Introduction to the Challenge1.
Look into the Data2.
Model Building3.
Summary4.
2/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
2 of 65 6/13/14, 2:01 PM
4. IInnttrroodduuccttiioonnttootthheeCChhaalllleennggee
the mission
build an auto-access model based on the historical data
to determine the access privilege according to the employee's job role and the resource he applied
for
4/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
4 of 65 6/13/14, 2:01 PM
5. IInnttrroodduuccttiioonnttootthheeCChhaalllleennggee
the data
The data consists of real historical data collected from 2010 & 2011.
Employees are manually allowed or denied access to resources over time.
the files
train.csv - The training set. Each row has the ACTION (ground truth), RESOURCE, and
information about the employee's role at the time of approval
test.csv - The test set for which predictions should be made. Each row asks whether an
employee having the listed characteristics should have access to the listed resource.
·
·
5/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
5 of 65 6/13/14, 2:01 PM
6. IInnttrroodduuccttiioonnttootthheeCChhaalllleennggee
the variables
COLUMN NAME DESCRIPTION
ACTION ACTION is 1 if the resource was approved, 0 if the resource was not
RESOURCE An ID for each resource
MGR_ID The EMPLOYEE ID of the manager of the current EMPLOYEE ID record
ROLE_ROLLUP_1 Company role grouping category id 1 (e.g. US Engineering)
ROLE_ROLLUP_2 Company role grouping category id 2 (e.g. US Retail)
ROLE_DEPTNAME Company role department description (e.g. Retail)
ROLE_TITLE Company role business title description (e.g. Senior Engineering Retail Manager)
ROLE_FAMILY_DESC Company role family extended description (e.g. Retail Manager, Software Engineering)
ROLE_FAMILY Company role family description (e.g. Retail Manager)
ROLE_CODE Company role code; this code is unique to each role (e.g. Manager)
6/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
6 of 65 6/13/14, 2:01 PM
7. IInnttrroodduuccttiioonnttootthheeCChhaalllleennggee
the metric
AUC(area under the ROC curve)
is a metric used to judge predictions in binary response (0/1) problem
is only sensitive to the order determined by the predictions and not their magnitudes
package verification or ROCR in R
·
·
·
7/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
7 of 65 6/13/14, 2:01 PM
22. LLooookkiinnttootthheeDDaattaa
check the distribution - role_family_desc
hist(train$role_family_desc, breaks = 100) hist(test$role_family_desc, breaks = 100)
22/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
22 of 65 6/13/14, 2:01 PM
23. LLooookkiinnttootthheeDDaattaa
check the distribution - resource
hist(train$resource, breaks = 100) hist(test$resource, breaks = 100)
23/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
23 of 65 6/13/14, 2:01 PM
24. LLooookkiinnttootthheeDDaattaa
check the distribution - mgr_id
hist(train$mgr_id, breaks = 100) hist(test$mgr_id, breaks = 100)
24/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
24 of 65 6/13/14, 2:01 PM
25. LLooookkiinnttootthheeDDaattaa
treat the features as Categorical or Numerical?
YetiMan shared his findings in the forum:
1) My analyses so far leads me to believe that there is "information" in some of the categorical
labels themselves. My hunch is that they imply some sort of chronology, but I can't be certain.
2) Just for fun I increased the max classes for R's gbm package to 8192 and built a model (using
plain vanilla training data). The leader board result was 0.87 - slightly worse than the all-numeric
gbm. Food for thought.
·
·
25/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
25 of 65 6/13/14, 2:01 PM
26. LLooookkiinnttootthheeDDaattaa
our approach
treat all features as Categorical1.
treat all features as Numerical2.
treat mgr_id as Numerical, the others as Categorical3.
26/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
26 of 65 6/13/14, 2:01 PM
29. MMooddeellBBuuiillddiinngg
Feature Extraction
the raw features(as numerical)1.
the raw features(as categorical) with level reduction2.
the dummies(in sparse Matrix)3.
the dummies including the interaction4.
some derived variables(count & ratio)5.
29/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
29 of 65 6/13/14, 2:01 PM
30. MMooddeellBBuuiillddiinngg
1. the raw features(as numerical)
30/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
30 of 65 6/13/14, 2:01 PM
31. MMooddeellBBuuiillddiinngg
2. the raw features(as categorical) with level reduction
2.1 choose the top frequency categories
VAR_RAW FREQUENCY VAR_WITH_LEVEL_REDUCTION
a 3 a
a 3 a
a 3 a
b 2 b
b 2 b
c 1 other
d 1 other
for (i in 1:ncol(x)) {
the_labels <- names(sort(table(x[, i]), decreasing = T)[1:2])
x[!x[, i] %in% the_labels, i] <- "other"
}
31/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
31 of 65 6/13/14, 2:01 PM
32. MMooddeellBBuuiillddiinngg
2. the raw features(as categorical) with level reduction
2.2 use Pearson's Chi-squared Test
table(y$y, ifelse(x$mgr_id == 770, "mgr_770", "mgr_not_770"))
##
## mgr_770 mgr_not_770
## 0 5 1892
## 1 147 30725
chisq.test(y$y, ifelse(x$mgr_id == 770, "mgr_770", "mgr_not_770"))$p.value
## [1] 0.2507
32/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
32 of 65 6/13/14, 2:01 PM
33. MMooddeellBBuuiillddiinngg
3. the dummies(in sparse Matrix)
ID VAR VAR_A VAR_B VAR_C
1 a 1 0 0
2 a 1 0 0
3 a 1 0 0
4 b 0 1 0
5 c 0 0 1
33/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
33 of 65 6/13/14, 2:01 PM
35. MMooddeellBBuuiillddiinngg
4. the dummies including the interaction
ID M N MN_AP MN_AQ MN_BP MN_BQ
1 a p 1 0 0 0
2 a p 1 0 0 0
3 a q 0 1 0 0
4 b p 0 0 1 0
5 b q 0 0 0 1
35/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
35 of 65 6/13/14, 2:01 PM
36. MMooddeellBBuuiillddiinngg
5. some derived variables(count & ratio)
the frequency of every category
the frequency of the interactions
the proportion
·
·
·
36/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
36 of 65 6/13/14, 2:01 PM
38. MMooddeellBBuuiillddiinngg
Base Learners
Regularized Generalized Linear Model1.
Support Vector Machine2.
Random Forest3.
Gradient Boosting Machine4.
38/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
38 of 65 6/13/14, 2:01 PM
39. MMooddeellBBuuiillddiinngg
Ensemble
mean prediction of all models1.
two-stage stacking2.
based on 5-fold cv holdout predictions·
39/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
39 of 65 6/13/14, 2:01 PM
40. MMooddeellBBuuiillddiinngg
Ensemble
mean prediction of all models1.
two-stage stacking2.
based on 5-fold cv holdout predictions
algorithms in level-1(Regularized Generalized Linear Model & Gradient Boosting Machine)
algorithms in level-2(Regularized Generalized Linear Model)
·
·
·
40/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
40 of 65 6/13/14, 2:01 PM
41. MMooddeellBBuuiillddiinngg
1. Regularized Generalized Linear Model
generalized linear model(glm)
convex penalties
·
·
41/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
41 of 65 6/13/14, 2:01 PM
42. MMooddeellBBuuiillddiinngg
1. Regularized Generalized Linear Model
logistic regression·
x <- sort(rnorm(100))
set.seed(114)
y <- c(sample(x=c(0,1),size=30,prob=c(0.9,0.1),re=T),
sample(x=c(0,1),size=20,prob=c(0.7,0.3),re=T),
sample(x=c(0,1),size=20,prob=c(0.3,0.7),re=T),
sample(x=c(0,1),size=30,prob=c(0.1,0.9),re=T))
m1 <- lm(y~x)
m2 <- glm(y~x,family=binomial(link=logit))
y2 <- predict(m2,data=x,type='response')
par(mar=c(5,4,0,0))
plot(y~x);abline(m1,lwd=3,col=2)
points(x,y2,type='l',lwd=3,col=3)
42/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
42 of 65 6/13/14, 2:01 PM
43. MMooddeellBBuuiillddiinngg
1. Regularized Generalized Linear Model
logistic regression·
convex penalties·
43/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
43 of 65 6/13/14, 2:01 PM
44. MMooddeellBBuuiillddiinngg
1. Regularized Generalized Linear Model
convex penalties·
L1 (lasso)
L2 (ridge regression)
mixture of L1&L2 (elastic net)
-
-
-
44/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
44 of 65 6/13/14, 2:01 PM
45. MMooddeellBBuuiillddiinngg
1. Regularized Generalized Linear Model
the dummies(in sparse Matrix)
the dummies including the interaction
R package:glmnet
·
·
·
45/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
45 of 65 6/13/14, 2:01 PM
46. MMooddeellBBuuiillddiinngg
2. Support Vector Machine(just for Diversity)
46/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
46 of 65 6/13/14, 2:01 PM
47. MMooddeellBBuuiillddiinngg
2. Support Vector Machine(just for Diversity)
47/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
47 of 65 6/13/14, 2:01 PM
48. MMooddeellBBuuiillddiinngg
2. Support Vector Machine(just for Diversity)
48/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
48 of 65 6/13/14, 2:01 PM
49. MMooddeellBBuuiillddiinngg
2. Support Vector Machine(just for Diversity)
the dummies including the interaction
some derived variables(count & ratio)
R package:kernlab,e1071
·
·
·
49/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
49 of 65 6/13/14, 2:01 PM
52. MMooddeellBBuuiillddiinngg
3. Random Forest
the raw features(as numerical)
the raw features(as categorical) with level reduction
some derived variables(count & ratio)
R package:randomForest
·
·
·
·
52/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
52 of 65 6/13/14, 2:01 PM
54. MMooddeellBBuuiillddiinngg
4. Gradient Boosting Machine
the raw features(as numerical)
the raw features(as categorical) with level reduction
some derived variables(count & ratio)
R package:gbm
·
·
·
·
54/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
54 of 65 6/13/14, 2:01 PM
64. SSuummmmaarryy
useful discussions
Python code to achieve 0.90 AUC with Logistic Regression
http://www.kaggle.com/c/amazon-employee-access-challenge/forums/t/4838/python-code-to-
achieve-0-90-auc-with-logistic-regression
Starter code in python with scikit-learn (AUC .885)
http://www.kaggle.com/c/amazon-employee-access-challenge/forums/t/4797/starter-code-in-python-
with-scikit-learn-auc-885
Patterns in Training data set
http://www.kaggle.com/c/amazon-employee-access-challenge/forums/t/4886/patterns-in-training-
data-set
64/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
64 of 65 6/13/14, 2:01 PM