The document discusses predicting matches in speed dating using machine learning models. It summarizes the steps taken which include:
1) Cleaning and exploring the speed dating dataset to engineer relevant features like combined interests and age differences.
2) Initial modeling with random forest, logistic regression, SVC and gradient boosting classifiers showed underfitting.
3) Feature selection using random forest importance improved results by focusing on combined interest features.
4) Parameter tuning of random forest and SVC classifiers further optimized performance, with the best models achieving over 95% accuracy.
The purpose of this report was to understand data classifiers, analyze numerical data from multiple search engine queries, and employ data mining models to successfully predict the relevancy of a page. Google, Bing, and several other search engines utilize a Page Rank/Relevancy model based on similar techniques.
To Get any Project for CSE, IT ECE, EEE Contact Me @ 09849539085, 09966235788 or mail us - ieeefinalsemprojects@gmail.co¬m-Visit Our Website: www.finalyearprojects.org
To Get any Project for CSE, IT ECE, EEE Contact Me @ 09849539085, 09966235788 or mail us - ieeefinalsemprojects@gmail.com-Visit Our Website: www.finalyearprojects.org
This report includes information about:
1. Pre-Processing Variables
a. Treating Missing Values
b. Treating correlated variables
2. Selection of Variables using random forest weights
3. Building model to predict donors and amount expected to be donated.
This Machine Learning Algorithms presentation will help you learn you what machine learning is, and the various ways in which you can use machine learning to solve a problem. At the end, you will see a demo on linear regression, logistic regression, decision tree and random forest. This Machine Learning Algorithms presentation is designed for beginners to make them understand how to implement the different Machine Learning Algorithms.
Below topics are covered in this Machine Learning Algorithms Presentation:
1. Real world applications of Machine Learning
2. What is Machine Learning?
3. Processes involved in Machine Learning
4. Type of Machine Learning Algorithms
5. Popular Algorithms with a hands-on demo
- Linear regression
- Logistic regression
- Decision tree and Random forest
- N Nearest neighbor
What is Machine Learning: Machine Learning is an application of Artificial Intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed.
- - - - - - - -
About Simplilearn Machine Learning course:
A form of artificial intelligence, Machine Learning is revolutionizing the world of computing as well as all people’s digital interactions. Machine Learning powers such innovative automated technologies as recommendation engines, facial recognition, fraud protection and even self-driving cars.This Machine Learning course prepares engineers, data scientists and other professionals with knowledge and hands-on skills required for certification and job competency in Machine Learning.
- - - - - - -
Why learn Machine Learning?
Machine Learning is taking over the world- and with that, there is a growing need among companies for professionals to know the ins and outs of Machine Learning
The Machine Learning market size is expected to grow from USD 1.03 Billion in 2016 to USD 8.81 Billion by 2022, at a Compound Annual Growth Rate (CAGR) of 44.1% during the forecast period.
- - - - - -
What skills will you learn from this Machine Learning course?
By the end of this Machine Learning course, you will be able to:
1. Master the concepts of supervised, unsupervised and reinforcement learning concepts and modeling.
2. Gain practical mastery over principles, algorithms, and applications of Machine Learning through a hands-on approach which includes working on 28 projects and one capstone project.
3. Acquire thorough knowledge of the mathematical and heuristic aspects of Machine Learning.
4. Understand the concepts and operation of support vector machines, kernel SVM, naive Bayes, decision tree classifier, random forest classifier, logistic regression, K-nearest neighbors, K-means clustering and more.
5. Be able to model a wide variety of robust Machine Learning algorithms including deep learning, clustering, and recommendation systems
- - - - - - -
This project requires us to understand what mode of transport employees prefers to commute to their office. The available data includes employee information about their mode of transport as well as their personal and professional details like age, salary, work exp. We need to predict whether or not an employee will use Car as a mode of transport. Also, which variables are a significant predictor behind this decision.
Following is expected in this assessment.
EDA
Perform an EDA on the data
Illustrate the insights based on EDA
Check for Multicollinearity - Plot the graph based on Multicollinearity & treat it.
Data Preparation
Prepare the data for analysis (SMOTE)
Modeling
Create multiple models and explore how each model perform using appropriate model performance metrics
KNN
Naive Bayes
Logistic Regression
Apply both bagging and boosting modeling procedures to create 2 models and compare its accuracy with the best model of the above step.
-The goal of this project is to build a predictive model to help determine the default payment for upcoming month on existing credit cards.
This project was conducted on a research dataset related to customers’ credit card default payments in Taiwan. The data were collected by the card issuing banks in Taiwan.
-This dataset contains information on default payments, demographic factors, credit data, histories of payments, and bill statements of credit card clients from April 2005 to September 2005 (Lichman, (2013). This dataset contains 30000 records and is available in kaggle.
-This dataset and business problem is a Supervised Classification Problem where intention is to predict if a customer would default (on next payment ) or not based on the customer’s historical and demographic information. Therefore, the response variable (or outcome variable) is “default payment.”
-The ability of a bank to predict if a customer is about to default would help banks and financial organizations to take preemptive actions to mitigate this risk.
R - what do the numbers mean? #RStats This is the presentation for my Demo at Orlando Live60 AILIve. We go through statistics interpretation with examples
Explore the latest techniques and technologies used in classifying fetal health, from traditional methods to cutting-edge AI approaches. Understand the importance of accurate classification for prenatal care and fetal well-being. Join us to delve into this critical aspect of healthcare. visit https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/ for more data science insights
DataIDSalaryCompaMidpoint AgePerformance RatingServiceGenderRaiseDegreeGender1GrStudents: Copy the Student Data file data values into this sheet to assist in doing your weekly assignments.157.71.012573485805.70METhe ongoing question that the weekly assignments will focus on is: Are males and females paid the same for equal work (under the Equal Pay Act)? 227.80.897315280703.90MBNote: to simplfy the analysis, we will assume that jobs within each grade comprise equal work.3341.096313075513.61FB459.21.03857421001605.51METhe column labels in the table mean:549.51.0314836901605.71MDID – Employee sample number Salary – Salary in thousands 675.71.1306736701204.51MFAge – Age in yearsPerformance Rating - Appraisal rating (employee evaluation score)741.71.0434032100815.71FCService – Years of service (rounded)Gender – 0 = male, 1 = female 823.41.018233290915.81FAMidpoint – salary grade midpoint Raise – percent of last raise980.81.206674910010041MFGrade – job/pay gradeDegree (0= BS\BA 1 = MS)1023.61.027233080714.71FAGender1 (Male or Female)Compa - salary divided by midpoint1123.61.02423411001914.81FA1266.91.1745752952204.50ME1341.61.0414030100214.70FC1421.50.93623329012161FA1524.41.059233280814.91FA16390.975404490405.70MC1768.81.2075727553131FE1834.91.1263131801115.60FB1923.21.008233285104.61MA20361.1603144701614.80FB2175.31.1246743951306.31MF2256.71.182484865613.81FD2322.60.984233665613.30FA2451.51.072483075913.80FD2525.51.1092341704040MA2622.90.994232295216.20FA2743.51.088403580703.91MC2874.41.111674495914.40FF2973.51.097675295505.40MF3045.70.9524845901804.30MD3123.71.031232960413.91FA3226.90.867312595405.60MB3355.10.967573590905.51ME34280.904312680204.91MB3521.90.953232390415.30FA3623.71.032232775314.30FA3723.21.010232295216.20FA3857.61.0105745951104.50ME3934.31.108312790615.50FB4024.41.062232490206.30MA4140.51.012402580504.30MC4223.31.0122332100815.71FA4377.21.1526742952015.50FF4456.90.9995745901605.21ME4557.71.202483695815.21FD4665.41.1485739752003.91ME4756.80.997573795505.51ME4859.71.0485734901115.31FE4962.41.0955741952106.60ME5056.50.9925738801204.60ME
Week 1Week 1.Measurement and Description - chapters 1 and 2The goal this week is to gain an understanding of our data set - what kind of data we are looking at, some descriptive measurse, and a look at how the data is distributed (shape).1Measurement issues. Data, even numerically coded variables, can be one of 4 levels - nominal, ordinal, interval, or ratio. It is important to identify which level a variable is, asthis impact the kind of analysis we can do with the data. For example, descriptive statistics such as means can only be done on interval or ratio level data.Please list under each label, the variables in our data set that belong in each group.NominalOrdinalIntervalRatiob.For each variable that you did not call ratio, why did you make that decision?2The first step in analyzing data sets is to find some summary descriptive statistics for key variables.For salary, compa, age, .
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
The purpose of this report was to understand data classifiers, analyze numerical data from multiple search engine queries, and employ data mining models to successfully predict the relevancy of a page. Google, Bing, and several other search engines utilize a Page Rank/Relevancy model based on similar techniques.
To Get any Project for CSE, IT ECE, EEE Contact Me @ 09849539085, 09966235788 or mail us - ieeefinalsemprojects@gmail.co¬m-Visit Our Website: www.finalyearprojects.org
To Get any Project for CSE, IT ECE, EEE Contact Me @ 09849539085, 09966235788 or mail us - ieeefinalsemprojects@gmail.com-Visit Our Website: www.finalyearprojects.org
This report includes information about:
1. Pre-Processing Variables
a. Treating Missing Values
b. Treating correlated variables
2. Selection of Variables using random forest weights
3. Building model to predict donors and amount expected to be donated.
This Machine Learning Algorithms presentation will help you learn you what machine learning is, and the various ways in which you can use machine learning to solve a problem. At the end, you will see a demo on linear regression, logistic regression, decision tree and random forest. This Machine Learning Algorithms presentation is designed for beginners to make them understand how to implement the different Machine Learning Algorithms.
Below topics are covered in this Machine Learning Algorithms Presentation:
1. Real world applications of Machine Learning
2. What is Machine Learning?
3. Processes involved in Machine Learning
4. Type of Machine Learning Algorithms
5. Popular Algorithms with a hands-on demo
- Linear regression
- Logistic regression
- Decision tree and Random forest
- N Nearest neighbor
What is Machine Learning: Machine Learning is an application of Artificial Intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed.
- - - - - - - -
About Simplilearn Machine Learning course:
A form of artificial intelligence, Machine Learning is revolutionizing the world of computing as well as all people’s digital interactions. Machine Learning powers such innovative automated technologies as recommendation engines, facial recognition, fraud protection and even self-driving cars.This Machine Learning course prepares engineers, data scientists and other professionals with knowledge and hands-on skills required for certification and job competency in Machine Learning.
- - - - - - -
Why learn Machine Learning?
Machine Learning is taking over the world- and with that, there is a growing need among companies for professionals to know the ins and outs of Machine Learning
The Machine Learning market size is expected to grow from USD 1.03 Billion in 2016 to USD 8.81 Billion by 2022, at a Compound Annual Growth Rate (CAGR) of 44.1% during the forecast period.
- - - - - -
What skills will you learn from this Machine Learning course?
By the end of this Machine Learning course, you will be able to:
1. Master the concepts of supervised, unsupervised and reinforcement learning concepts and modeling.
2. Gain practical mastery over principles, algorithms, and applications of Machine Learning through a hands-on approach which includes working on 28 projects and one capstone project.
3. Acquire thorough knowledge of the mathematical and heuristic aspects of Machine Learning.
4. Understand the concepts and operation of support vector machines, kernel SVM, naive Bayes, decision tree classifier, random forest classifier, logistic regression, K-nearest neighbors, K-means clustering and more.
5. Be able to model a wide variety of robust Machine Learning algorithms including deep learning, clustering, and recommendation systems
- - - - - - -
This project requires us to understand what mode of transport employees prefers to commute to their office. The available data includes employee information about their mode of transport as well as their personal and professional details like age, salary, work exp. We need to predict whether or not an employee will use Car as a mode of transport. Also, which variables are a significant predictor behind this decision.
Following is expected in this assessment.
EDA
Perform an EDA on the data
Illustrate the insights based on EDA
Check for Multicollinearity - Plot the graph based on Multicollinearity & treat it.
Data Preparation
Prepare the data for analysis (SMOTE)
Modeling
Create multiple models and explore how each model perform using appropriate model performance metrics
KNN
Naive Bayes
Logistic Regression
Apply both bagging and boosting modeling procedures to create 2 models and compare its accuracy with the best model of the above step.
-The goal of this project is to build a predictive model to help determine the default payment for upcoming month on existing credit cards.
This project was conducted on a research dataset related to customers’ credit card default payments in Taiwan. The data were collected by the card issuing banks in Taiwan.
-This dataset contains information on default payments, demographic factors, credit data, histories of payments, and bill statements of credit card clients from April 2005 to September 2005 (Lichman, (2013). This dataset contains 30000 records and is available in kaggle.
-This dataset and business problem is a Supervised Classification Problem where intention is to predict if a customer would default (on next payment ) or not based on the customer’s historical and demographic information. Therefore, the response variable (or outcome variable) is “default payment.”
-The ability of a bank to predict if a customer is about to default would help banks and financial organizations to take preemptive actions to mitigate this risk.
R - what do the numbers mean? #RStats This is the presentation for my Demo at Orlando Live60 AILIve. We go through statistics interpretation with examples
Explore the latest techniques and technologies used in classifying fetal health, from traditional methods to cutting-edge AI approaches. Understand the importance of accurate classification for prenatal care and fetal well-being. Join us to delve into this critical aspect of healthcare. visit https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/ for more data science insights
DataIDSalaryCompaMidpoint AgePerformance RatingServiceGenderRaiseDegreeGender1GrStudents: Copy the Student Data file data values into this sheet to assist in doing your weekly assignments.157.71.012573485805.70METhe ongoing question that the weekly assignments will focus on is: Are males and females paid the same for equal work (under the Equal Pay Act)? 227.80.897315280703.90MBNote: to simplfy the analysis, we will assume that jobs within each grade comprise equal work.3341.096313075513.61FB459.21.03857421001605.51METhe column labels in the table mean:549.51.0314836901605.71MDID – Employee sample number Salary – Salary in thousands 675.71.1306736701204.51MFAge – Age in yearsPerformance Rating - Appraisal rating (employee evaluation score)741.71.0434032100815.71FCService – Years of service (rounded)Gender – 0 = male, 1 = female 823.41.018233290915.81FAMidpoint – salary grade midpoint Raise – percent of last raise980.81.206674910010041MFGrade – job/pay gradeDegree (0= BS\BA 1 = MS)1023.61.027233080714.71FAGender1 (Male or Female)Compa - salary divided by midpoint1123.61.02423411001914.81FA1266.91.1745752952204.50ME1341.61.0414030100214.70FC1421.50.93623329012161FA1524.41.059233280814.91FA16390.975404490405.70MC1768.81.2075727553131FE1834.91.1263131801115.60FB1923.21.008233285104.61MA20361.1603144701614.80FB2175.31.1246743951306.31MF2256.71.182484865613.81FD2322.60.984233665613.30FA2451.51.072483075913.80FD2525.51.1092341704040MA2622.90.994232295216.20FA2743.51.088403580703.91MC2874.41.111674495914.40FF2973.51.097675295505.40MF3045.70.9524845901804.30MD3123.71.031232960413.91FA3226.90.867312595405.60MB3355.10.967573590905.51ME34280.904312680204.91MB3521.90.953232390415.30FA3623.71.032232775314.30FA3723.21.010232295216.20FA3857.61.0105745951104.50ME3934.31.108312790615.50FB4024.41.062232490206.30MA4140.51.012402580504.30MC4223.31.0122332100815.71FA4377.21.1526742952015.50FF4456.90.9995745901605.21ME4557.71.202483695815.21FD4665.41.1485739752003.91ME4756.80.997573795505.51ME4859.71.0485734901115.31FE4962.41.0955741952106.60ME5056.50.9925738801204.60ME
Week 1Week 1.Measurement and Description - chapters 1 and 2The goal this week is to gain an understanding of our data set - what kind of data we are looking at, some descriptive measurse, and a look at how the data is distributed (shape).1Measurement issues. Data, even numerically coded variables, can be one of 4 levels - nominal, ordinal, interval, or ratio. It is important to identify which level a variable is, asthis impact the kind of analysis we can do with the data. For example, descriptive statistics such as means can only be done on interval or ratio level data.Please list under each label, the variables in our data set that belong in each group.NominalOrdinalIntervalRatiob.For each variable that you did not call ratio, why did you make that decision?2The first step in analyzing data sets is to find some summary descriptive statistics for key variables.For salary, compa, age, .
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
3. Introduction
The Problem: The dating process is inefficient, and dates are unsuccessful far too often.
The Solution: If we can successfully predict the likelihood that two people will be a
match for each other, then we can improve the success rate of dates.
How We Do It: We analyze the Speed Dating Experiment dataset from Kaggle.com to
find out what makes two people a match for each other. Then, we create a machine
learning model that can predict match likelihood.
4. Data Cleaning
The data set started out with 195 columns and 8,378 rows. We reduced this to
38 columns and 8,038 rows. We also reformatted some of the variables. The
actions taken and the reasons why are described in this section.
5. Data Cleaning: Filtering Columns
Remove fields with significant missing data- Any field that was missing 10% or more of our data was
removed.
This step removed 108 fields
Remove repetitive fields- Some fields were the same questions being asked over and over again.
This step removed 6 fields
Remove varied fields- Some fields were simply too varied to give any insight.
This step removed 6 fields
Remove after-date information- Some fields contained information gathered after the date, and we
wanted to keep our analysis to information gathered pre-date.
This step removed 16 fields
Remove irrelevant fields- Fields where nothing of importance could be found were removed.
This step removed 11 fields
6. Data Cleaning: Filtering Rows
Remove dates with null values- We had to remove rows with null values because we could not
make educated guesses as to what their values were.
This step removed 263 rows
Remove dates with 55 year old person- This was person was an outlier in terms of age.
This step removed 6 rows
Remove dates that had a partner who was never a primary- Every date consisted of two
people, a “primary”, and a “partner”. Later, we engineer variables that contain information from
both daters, so we had to remove anyone who was never a primary.
This step removed 71 rows
7. Data Cleaning: Reformat Variables
Fix “Date” and “Go Out” Variables- The rating scale for these variables was 1 through 7, with 1
being “very often”, and 7 being “never”. We found these variables to be easier to interpret
when the scale was reversed, and so we subtracted each row from 8 to accomplish this.
Fix Race Variable Errors- There was an error where some participants were being recorded as
the wrong race. We were able to correct these by inputting the race that the person was listed
as most frequently.
Change “Other” Race from 6 to 5- Race was listed as an integer. Since there was no race with
the integer 5, we changed “Other” from 6 to 5.
8. Data Exploration
Our dataset consists of 538 people, who went on a combined total of 4,019 dates
All dates were of heterosexual nature, and had one male and one female
Our variable of interest is match– A date is a match if both daters say yes to wanting to
see their date again
The average match rate for all dates was 16.62%
9. Data Exploration: Gender
Males were more accepting than
females
Males said yes to females 47.87% of
the time
Females said yes to males 36.63% of
the time
11. Data Exploration: Age
Daters preferred to date those who were closer to their own age
We saw a decrease in match rate as the age gap between daters increased
12. Data Exploration: Race
The race distribution was not balanced
Over half were Caucasian/European
About one fourth were Asian/Pacific
Islander/Asian American
1- African American
2- Caucasian/European
3- Latino/Hispanic
4- Asian/Pacific Islander/Asian-American
5- Other
13. Data Exploration: Same Race
We looked at match rates where both daters were the same race.
We saw that African Americans had the highest match rate increase– However, there were only
8 dates where both daters were African American, and so our sample size was too small. We
would need to get a more racially diverse dataset to confirm this trend.
1- African American
2- Caucasian/European
3- Latino/Hispanic
4- Asian/Pacific Islander/Asian-American
5- Other
14. Data Exploration: Interests
Participants were asked to rate their level of interest on a variety of activities such as sports,
movies, art, etc.
To get a look at each interest’s relationship with match rate, we performed a correlation test
Those who rated clubbing and yoga highly, had a higher match rate (on average)
Those who rated movies and tv highly, had a lower match rate (on average)
15. Data Exploration: Interests
We engineered a new variable that was a combination of both participants’ interest rating
We found that for some of the variables, match rate was higher when the combined interest
variable was high
This makes sense, if we reason that people with shared interests are more likely to be a match
Here is an example with our combined interest variable for clubbing, labeled clubbing_com
16. Data Exploration:
Desires and Preferences
Desires were ratings of attributes in response to the question “what do you look for in a date?”
Preferences were ratings of a diverse, mixed-bag of questions
We perform correlation tests to see the relationship between each variable and match
Those who desired their partner to be fun, had a higher match rate (on average)
Those who preferred going out and dating, also had a higher match rate (on average)
*See Appendix 1 for full list
of the variables’ definitions
17. Modeling
For modeling, we go through the following steps
Recap all of our engineered features
Pre-process the data
Model the data
Perform feature selection and re-model the data
Tune our models’ parameters
18. Modeling: Feature Engineering
The following is a list of the steps taken for feature engineering:
Create interaction terms (primary rating * partner rating) for all variables in the interests,
desires, and preferences categories
Create age difference variable (male age – female age)
Create age group variable that separates age into bins of [18-24, 25-30, 31-42]
Create combined age variable (primary age + partner age)
Create category variable that contains both the male and female’s race
19. Modeling: Data Pre-Processing
We split the dataset into two parts, X and y.
X- all variables except for dec, dec_o, and match
y- match
We then split each of these up into training and testing sets. Our training set uses 75% of the
data and our test set is the remaining 25%.
20. Modeling: First Try
We use four different supervised learning algorithms in modeling our data
Since the overall match rate, is 16.62%, a model that predicts “no match” for every date, would be
83.38% accurate. This means we want our models to strive for higher than this.
Our first run is a disappointment, as even our highest test accuracy score barely beats 83.38%.
Model
Train
Accuracy
Test
Accuracy
Precision Recall ROC AUC PR AUC
Random Forest 98.47% 84.03% 74.51% 10.98% 75.32% 47.60%
Logistic Regression 83.69% 82.74% 40.00% 0.58% 63.44% 29.22%
SVC 100.00% 82.79% 0.00% 0.00% 93.54% 88.45%
Gradient Boosting 85.95% 84.13% 93.55% 8.38% 77.17% 47.58%
21. Modeling: Feature Selection
Using the feature importance attribute in our
Random Forest model, we take a look to see
which features are performing well
We see that many of the top-performing
features are our combined interest variables
that we created
We will reduce the amount of features in X to
only the combined interest variables, and re-
model the data
Features Importance
clubbing_com 0.035848
yoga_com 0.027607
date_com 0.026963
exercise_com 0.02595
shopping_com 0.02567
concerts_com 0.025075
hiking_com 0.025022
sinc1_com 0.02416
pid 0.023971
exphappy_com 0.023931
22. Modeling- Second Try
We see tremendous improvement with the Random Forest Classifier and the SVC
The next slide shows the AUC for the ROC and PR curves
Model
Train
Accuracy
Test
Accuracy
Precision Recall ROC AUC PR AUC
Random Forest 99.45% 94.08% 100.00% 63.38% 93.15% 87.06%
Logistic
Regression
83.23% 83.83% 0.00% 0.00% 58.40% 22.17%
SVC 100.00% 95.42% 100.00% 71.69% 96.67% 92.23%
Gradient Boosting 85.50% 84.88% 86.21% 7.69% 69.65% 38.54%
24. Modeling- RF Parameter Tuning
We move forward with only the Random Forest and SVC and tune the parameters using Grid-
Search CV.
We’ll start with Random Forest. On the left are the range of values we tried for each
parameter, and on the right are the optimal parameters.
Ranges Attempted
n_estimators – [10, 25, 50, 75, 100]
min_samples_leaf – [1, 25, 50, 75, 100]
max_features – [.1, .25, .5, .75]
Optimal
n_estimators – 100
min_samples_leaf – 1
max_features – .1
25. Modeling- RF Parameter Tuning
We re-model the Random Forest with the tuned parameters
We see a strong improvement, especially in recall
The RF model now looks to be about as accurate as the SVC
Model
Train
Accuracy
Test
Accuracy
Precision Recall ROC AUC PR AUC
RF (default) 99.45% 94.08% 100.00% 63.38% 93.15% 87.06%
RF (new params) 100.00% 95.52% 100.00% 73.13% 96.15% 90.30%
Improvement 0.55% 1.44% 0.00% 9.75% 3.00% 3.24%
26. Modeling- SVC Parameter Tuning
The following is the parameter tuning for the SVC
The gamma parameter proved to be irrelevant, so we removed it
The optimal parameters end up being the same as SVC’s default parameters
We don’t need to re-model it, since we know our previous model was already optimal
Ranges Attempted
C – [.0001,.001,.01,.1,1]
kernel – ['linear', 'rbf', 'sigmoid', 'poly']
gamma –
[0.01,0.02,0.03,0.04,0.05,0.10,0.2,0.3,0.4,0.5]
Optimal
C – 1
kernel – 'rbf’
gamma – irrelevant
27. Conclusions
The most important features in predicting a match were the combined interests variables
We exceeded our goal for our models, with our highest accuracy score reaching 95.52%
The best performing models were the Random Forest Classifier with tuned parameters, and the
SVC with default parameters
Both models performed well in test accuracy score and AUC of PR Curve, so we are comfortable
using either one