SlideShare a Scribd company logo
1 of 20
Download to read offline
Capstone Project - IS 6596
Project Supervisor:
Dr. Rohit Aggarwal
Project Contributors:
Mayank Badjatya - u1085897
Sagar Singh - u1088202
MARKETING ANALYTICS USING
R/PYTHON
1
Capstone Project – IS 6596
Contents
Executive Summary.......................................................................................................................................2
Book Description...........................................................................................................................................3
Why Data Science?........................................................................................................................................5
Skill sets required for a Data Science............................................................................................................6
7 Steps to effective Predictive Modelling.....................................................................................................7
Marketing Analysis........................................................................................................................................9
Fraud Detection ......................................................................................................................................10
Market Segmentation.............................................................................................................................13
Advertising..............................................................................................................................................16
Lessons Learned..........................................................................................................................................19
Next Steps...................................................................................................................................................19
2
Capstone Project – IS 6596
Executive Summary
The objective of this project is to discuss the importance of Machine Learning in different sectors and how
does it solve the problems in the Marketing Analytics field. We have discussed Marketing Segmentation,
Advertisement, and Fraud detection in our project. We used different Machine Learning algorithms and
used R and Python library to predict and solve these problems. After making models and running test data
on those models we got following results:
• We trained a Decision tree and Random Forest classifier model which has 73% accuracy to predict
whether a person will be a defaulter or not based on credit history, income, job type, dependents
etc.
• We segmented the Social networking profiles based on the likes and dislikes of a person using K-
Means Clustering.
• We made a predictive model on the messages a customer receives and determined whether a
message will be a Spam or not a spam with an accuracy of 97%. We used Naïve Bayes classifier
for this model.
• We created several other models using different algorithms, but these are beyond the scope of
this report.
3
Capstone Project – IS 6596
Book Description
An Introduction to Statistical Learning provides an accessible overview of the field of statistical learning,
an essential toolset for making sense of the vast and complex data sets that have emerged in fields ranging
from biology to finance to marketing to astrophysics in the past twenty years. This book presents some of
the most important modeling and prediction techniques, along with relevant applications. Topics include
linear regression, classification, resampling methods, shrinkage approaches, tree-based methods, support
vector machines, clustering, and more. Color graphics and real-world examples are used to illustrate the
methods presented. Since the goal of this textbook is to facilitate the use of these statistical learning
techniques by practitioners in science, industry, and other fields, each chapter contains a tutorial on
implementing the analyses and methods presented in R, an extremely popular open source statistical
software platform. An Introduction to Statistical Learning covers many of the same topics, but at a level
accessible to a much broader audience. This book is targeted at statisticians and non-statisticians alike
who wish to use innovative statistical learning techniques to analyze their data. The text assumes only a
previous course in linear regression and no knowledge of matrix algebra.
Machine Learning with R: This book is intended for anybody hoping to use data for action. Perhaps you
already know a bit about machine learning, but have never used R; or perhaps you know a little about R,
but are new to machine learning. In any case, this book will get you up and running quickly. It would be
helpful to have a bit of familiarity with basic math and programming concepts, but no prior experience is
required. All you need is curiosity.
Machine learning, at its core, is concerned with the algorithms that transform information into actionable
intelligence. This fact makes machine learning well-suited to the present-day era of big data. Without
machine learning, it would be nearly impossible to keep up with the massive stream of information. Given
the growing prominence of R—a cross-platform, zero-cost statistical programming environment—there
has never been a better time to start using machine learning. R offers a powerful but easy-to-learn set of
tools that can assist you with finding data insights. By combining hands-on case studies with the essential
4
Capstone Project – IS 6596
theory that you need to understand how things work under the hood, this book provides all the knowledge
that you will need to start applying machine learning to your own projects.
Marketing Analytics Data Driven Techniques: This book helps tech-savvy marketers and data analysts
solve real-world business problems with Excel.
Using data-driven business analytics to understand customers and improve results is a great idea in
theory, but in today's busy offices, marketers and analysts need simple, low-cost ways to process and
make the most of all that data. This expert book offers the perfect solution. Written by data analysis expert
Wayne L. Winston, this practical resource shows you how to tap a simple and cost-effective tool, Microsoft
Excel, to solve specific business problems using powerful analytic techniques—and achieve optimum
results. Practical exercises in each chapter helped us to apply and reinforce techniques as you learn.
Shows you how to perform sophisticated business analyses using the cost-effective and widely available
Microsoft Excel instead of expensive, proprietary analytical tools
• Reveals how to target and retain profitable customers and avoid high-risk customers
• Helps you forecast sales and improve response rates for marketing campaigns
• Explores how to optimize price points for products and services, optimize store layouts, and
improve online advertising
• Covers social media, viral marketing, and how to exploit both effectively.
5
Capstone Project – IS 6596
Why Data Science?
Data Science is a field, which can be implemented anywhere. Here is the list of people who uses data
science as a tool in their field and are not from IT background.
• Politics: We may have heard how statistical wizard Nate Silver predicted the electoral votes for
each state in the 2012 presidential election, showing that raw data crunching of polls is much
more reliable than traditional punditry.
• Healthcare: The role of big data in medicine is one where we can build better health profiles and
better predictive models around individual patients so that we can better diagnose and treat
disease. Big data comes into play around aggregating increasingly information around multiple
scales for what constitutes a disease—from the DNA, proteins, and metabolites to cells, tissues,
organs, organisms, and ecosystems.
• Automotive Industry: Areas in the automotive industry impacted by Big Data include:
a. Conceptual Design: Real-world data collected from billions of miles driven will undoubtedly
influence safety, aerodynamics, power algorithms and other fundamental elements of the vehicle.
b. Drawing Boards: Efficiency gained in design, production volumes and manufacturing through
Big Data in the auto industry will make it economically feasible to make today’s options
tomorrow’s standard equipment.
c. Procurement: Supply chain management optimized by Big Data will help manufacturers
continue to wring new efficiency from the procurement process.
d. Manufacturing: On the assembly line, data gathered throughout the building process will be
used in predictive analytics to improve manufacturing simulations and watch machine
performance, making the next assembly line even more efficient and flexible.
• Marketing: Big Data is already having a major influence on vehicle marketing. Social sentiment
will play a growing role in manufacturers’ plans to design new vehicles. Customer feedback on
current models also helps marketing experts identify key themes and messages for new
campaigns.
• Finance: Understanding consumer habits, preferences and buying power across market segments
gives manufacturers insights needed to develop more-effective financing programs. But that’s just
the first step. New insights from Big Data analyses of sales and in-field use data will help captive
financing companies develop new services and new revenue streams.
• Services: Like performance, service will benefit as both a contributor and a user of Big Data in the
automotive industry. Information gathered through millions of service events will provide
feedback to designers.
6
Capstone Project – IS 6596
Skill sets required for a Data Science
Technical Skills:
Python Coding – Python is the most common coding language I typically see required in data science roles,
along with Java, Perl, or C/C++.
Hadoop Platform – Although this isn’t always a requirement, it is heavily preferred in many cases. Having
experience with Hive or Pig is also a strong selling point. Familiarity with cloud tools such as Amazon S3
can also be beneficial.
SQL Database/Coding – Even though NoSQL and Hadoop have become a large component of data science,
it is still expected that a candidate will be able to write and execute complex queries in SQL.
Unstructured data – It is critical that a data scientist be able to work with unstructured data, whether it is
from social media, video feeds or audio.
Non-Technical Skills
Intellectual curiosity – No doubt we have seen this phrase everywhere lately, especially as it relates to
data scientists. Frank Lo describes what it means, and talks about other necessary “soft skills” in his guest
blog posted a few months ago.
Business acumen – To be a data scientist we’ll need a solid understanding of the industry we’re working
in, and know what business problems your company is trying to solve. In terms of data science, being able
to discern which problems are important to solve for the business is critical, in addition to identifying new
ways the business should be leveraging its data.
Communication skills – Companies searching for a strong data scientist are looking for someone who can
clearly and fluently translate their technical findings to a non-technical team, such as the Marketing or
Sales departments. A data scientist must enable the business to make decisions by arming them with
quantified insights, in addition to understanding the needs of their non-technical colleagues to wrangle
the data appropriately.
7
Capstone Project – IS 6596
7 Steps to effective Predictive Modelling
Step 1: Defining the Objective
The first step in any modeling process is defining the objective. We see in what field does the problem fall
in. There are many fields like Target Marketing, Risk & Fraud Management, Strategy Implementation and
Change Management, Operational Efficiency, Increase Customer Experience, Manage Marketing,
Campaigns Forecast, Revenue or Loss, Workforce Management, Financial Modeling, Churn Management,
and Social Media Influencers
Step 2: Gathering the Data
Accurate, actionable, accessible data is the lifeblood of any successful model. So we collect enough data
to make a predictive model on it.
Step 3: Preparing the Data for Modeling
The average modeler spends 70% of his or her time preparing data. In this step we need to prepare data
into right format for analysis and the tool we may want use.
1. Do initial cleaning up
2. Define Variables and Create Data Dictionary
3. Joining/Appending multiple datasets
4. Validate for correctness
5. Produce Basic Summary Reports
Step 4: Selecting and Transforming the Variables
Determining the best fit is essential to good model performance. The underlying structure of the
independent variables in relation to the dependent variable, determines the power and longevity of a
model.
Special consideration is given to the fact that marketing data can have hundreds or even thousands of
variables. We apply methods for identifying the best candidate variables. Programs are introduced that
automatically segment and transform the most powerful variables, to ensure the best fit.
Step 5: Processing and Evaluating the Model
All the preparation works up to this point makes this next step run smoothly. Weights of Evidence and
Information Values are calculated. For our main case study, we used various options within PROC LOGISTIC
to determine the model with the best fit. Validation data are scored, tabulated, and compared using both
SAS® & MSExcel®.
Step 6: Validating the Model
Models should perform well on the development data. Plus, if the hold-out sample is randomly selected,
the model performance should score the validation data with similar results. A true test of model
performance is how well it performs on data from a different time or market area. So, we used three
powerful methods for ensuring model fit. 1) Scoring alternate data is the best way to tell if our model will
8
Capstone Project – IS 6596
perform in a real campaign; 2) Bootstrapping uses simple resampling techniques to find confidence
intervals around our estimates; 3) Key Variable Analysis calculates important market factors as they are
affected by the model, thus ensuring reasonable results.
Step 7: Implementing and Maintaining the Model
Effective implementation is a combination of business intelligence and well-designed procedures. So, we
score a new data set with the new model. Several auditing procedures are done and tracking, and model
maintenance are emphasized as best practices.
Figure 1 7 Steps of Predictive Model
9
Capstone Project – IS 6596
Marketing Analysis
Figure 2 : Facets of Marketing Analysis
An accurate customer risk assessment will help us acquire the most profitable consumers while
minimizing risk. For business-to-consumer companies, Experian offers consumer credit information,
advanced scoring software, prescreening systems, and application decisioning tools. For companies
looking to acquire business customers, our business reports and public records, portfolio data and risk
modeling tools allow clients to create comprehensive profiles of business prospects. Determine which
businesses are well-capitalized and financially suited for customer acquisition.
10
Capstone Project – IS 6596
Fraud Detection
Fraud is a billion-dollar business and it is increasing every year. The PwC global economic crime survey of
2016 suggests that more than one in three (36%) of organizations experienced economic crime.
Traditional methods of data analysis have long been used to detect fraud. They require complex and time-
consuming investigations that deal with different domains of knowledge like financial, economics,
business practices and law.
To know more about how Machine Learning algorithms, solve Fraud detection problem we took a dataset
from the “Machine Learning using R” credit data set.
The idea behind our credit model is to identify factors that make an applicant at higher risk of default.
Therefore, we need to obtain data on many past bank loans and whether the loan went into default, as
well as information about the applicant.
We can see that “job”, “phone”,
“checking_balance”,
“credit_history”, “purpose”,”
savings_balance”,
“employment_duration”,
“other_credit”, “housing” are the
categorical data so in Python we
use onehotencoder() to convert
the categorical data into 0s and 1s.
After applying the
onehotencoder() on all categorical
dataset we got 36 columns. The
credit dataset includes 1,000
examples of loans, plus a
combination of numeric and
nominal features indicating
characteristics of the loan and the
loan applicant. A class variable
indicates whether the loan went
into default.
Figure 3 Conversion of categorical data into 0s and 1s
11
Capstone Project – IS 6596
We did the initial data exploration and plotted that using matplotlib library.
Figure 4 Exploratory Data Analysis
We used decision tree to determine whether a person is a defaulter or not depending on the features.
The core algorithm for building decision trees called ID3. The Decision tree classifiers uses greedy
approach hence an attribute chooses at first step can’t be used anymore which can give better
classification if used in later steps. Also, it overfits the training data which can give poor results for unseen
data. It uses two concepts to determine on which feature it needs to divide the dataset.
Information Gain
The information gain is based on the decrease in entropy after a dataset is split on an attribute.
Constructing a decision tree is all about finding attribute that returns the highest information gain (i.e.,
the most homogeneous branches).
Entropy
A decision tree is built top-down from a root node and involves partitioning the data into subsets that
contain instances with similar values (homogenous). ID3 algorithm uses entropy to calculate the
homogeneity of a sample. If the sample is completely homogeneous the entropy is zero and if the sample
is an equally divided it has entropy of one.
After applying the Decision tree model, we got the following classification report.
12
Capstone Project – IS 6596
Figure 5 F1 Score for Decision Tree
F1 score is a measure of a test's accuracy. The F1 score is the harmonic average of the precision and recall,
where an F1 score reaches its best value at 1 (perfect precision and recall) and worst at 0.
Decision tree makes a model which is biased so to overcome this drawback we use Bagging.
Bagging is a way to decrease the variance of our prediction by generating additional data for training from
our original dataset using combinations with repetitions to produce multisets of the same cardinality/size
as our original data.
Random Forests is an ensemble classifier which uses many decision tree models to predict the result. A
different subset of training data is selected, with replacement to train each tree. A collection of trees is a
forest, and the trees are being trained on subsets which are being selected at random, hence random
forests. After applying Random Forest classifier, we got the following result.
Figure 6 F1 Score for Random Forest
We can clearly see the increase in the F1-score.
Now the next step in building model as discussed earlier is to fine tune the model. For this we use Grid
Search Cross Validation technique. After applying the GridSearchCV we got the following classification
report.
Figure 7 F1 Score after GridSearchCV
From this model we understand that the model will predict 73% of the time whether a person will be a
defaulter or not.
13
Capstone Project – IS 6596
Market Segmentation
One of the most fundamental marketing activities is in market segmentation. As companies cannot
connect with all their potential customers, they must divide markets into groups (segments) of consumers,
customers, or clients with similar needs and wants. Firms can then target each of these segments by
positioning themselves in a unique segment (such as Ferrari in the high-end sports car market).
While market researchers often form market, segments based on
practical grounds, industry practice and wisdom, cluster analysis
allows segments to be formed that are based on data that are less
dependent on subjectivity.
Cluster analysis is a convenient method for identifying homogeneous
groups of objects called clusters. Objects (or cases, observations) in a
specific cluster share many characteristics, but are very dissimilar to
objects not belonging to that cluster.
Below we have tried try this process from start to finish.
For this analysis, we used a dataset representing a random sample of 30,000 U.S. high school students
who had profiles on a well-known SNS in 2006. To protect the users' anonymity, the SNS will remain
unnamed. However, at the time the data was collected, the SNS was a popular web destination for US
teenagers. Therefore, it is reasonable to assume that the profiles represent a wide cross section of
American adolescents in 2006.
Let's take a quick look at the specifics of the data.
Figure 8 Description of the data set
14
Capstone Project – IS 6596
Figure 9 Min-Max of the Age Figure 10 Gender and Age anomaly
There is something strange around the gender row. On looking carefully, we noticed the NA value. We
see that 2,724 records (9 percent) have missing gender data.
Besides gender, only age has missing values. A total of 5,086 records (17 percent) have missing ages. Also
concerning is the fact that the minimum and maximum values seem to be unreasonable; it is unlikely that
a 3-year-old or a 106-year-old is attending high school. To ensure that these extreme values don't cause
problems for the analysis, we cleaned them up before moving on.
Figure 11 Box Plot for the age distribution
A more reasonable range of ages for the high school students includes those who are at least 13 years old
and not yet 20 years old. Any age value falling outside this range we treated the same as missing data.
An easy solution for handling the missing values is to exclude any record with a missing value. In this case,
we created dummy variables for female and unknown gender. We assigned teens$female the value 1 if
gender is equal to F and the gender is not equal to NA; otherwise, it assigns the value 0 .
Next, we eliminated the 5,523 missing age values. We have used a different strategy known as data
imputation, which involves filling in the missing data with a guess as to the true value. Most people in a
graduation cohort were born within a single calendar year. We have identified the typical age for each
cohort, we had a reasonable estimate of the age of a student in that graduation year.
15
Capstone Project – IS 6596
To cluster the teenagers into marketing segments, we used an implementation of k-means clustering. We
started our cluster analysis by considering only the 36 features that represent the number of times various
interests appeared on the teen SNS profiles.
Evaluating clustering results can be somewhat subjective. Ultimately, the success or failure of the model
hinges on whether the clusters are useful for their intended purpose. As the goal of this analysis was to
identify clusters of teenagers with similar interests for marketing purposes, we largely measured our
success in qualitative terms. For other clustering applications, more quantitative measures of success may
be needed. By examining whether the clusters fall above or below the mean level for each interest
category, we can notice patterns that distinguish the clusters from each other. Cluster 3 is substantially
above the mean interest level on all the sports. This suggests that this may be a group of Athletes per The
Breakfast Club stereotype.
Figure 12 Cluster segmentation
Cluster 0 includes the most mentions of "cheerleading," the word "hot," and is above the average level of
football interest. Hence, these are the so-called Princesses. Similarly, we tried to cluster the different
groups, and this is what we found.
We now focused our effort on turning these insights into action. We applied the clusters back onto the
full dataset.
We looked at the demographic characteristics of the clusters. The mean age does not vary much by
cluster, which is not too surprising as these teen identities are often determined before high school. On
the other hand, there are some substantial differences in the proportion of females by cluster. This is a
very interesting finding as we didn't use gender data to create the clusters, yet the clusters are still
Cluster 0 (N =
872) Princess
cute
hair
shopping
clothes
dance
Cluster 1 (N =
21308) Basket
Cases
???
Cluster 2 (N =
1041) Criminals
drunk
deaths
drugs
die
music
Cluster 3 (N =
5971) Athletes
basketball
soccer
football
volleyball
soccer
Cluster 4 (N =
808) Brains
band
marching
music
rock
16
Capstone Project – IS 6596
predictive of gender. Given our success in predicting gender, we also suspected that the clusters are
predictive of the number of friends the users have. This hypothesis seems to be supported by the data.
Our findings support the popular adage that "birds of a feather flock together." By using machine learning
methods to cluster teenagers with others who have similar interests, we were able to develop a typology
of teen identities that was predictive of personal characteristics, such as gender and the number of
friends. These same methods can be applied to other contexts with similar results.
Advertising
Compared to all the marketing techniques, email marketing is the cheapest way of sending a marketing
message to millions of people. Being so cheap, it is the tool of choice for marketing teams with a small
budget trying to sell cheap products. Most of the times, such products do not deliver what they promise.
Unfortunately, with email marketing, we run the risk of being exposed to malware and fraudulent emails.
Worms and viruses often make use of email and spam techniques to propagate. Phishing emails and
Nigerian 419 scams are examples of fraudulent emails which try to harvest either our money or our
personal information including credit card details. So, while email marketing is the tool of choice for most
marketing teams, it does require stringent regulations to ensure that it does not get abused. Below we
tried to build a model which predicts whether a composed message is spam or not.
The dataset included the text of SMS messages along with a label indicating whether the message is
unwanted. Junk messages are labeled spam, while legitimate messages are labeled ham. Since Naive
Bayes has been used successfully for e-mail spam filtering, it seems likely that it could also be applied to
SMS spam. However, relative to e-mail spam, SMS spam poses additional challenges for automated filters.
SMS messages are often limited to 160 characters, reducing the amount of text that can be used to identify
whether a message is junk.
Figure 13 Description of the data set
The first step towards constructing our classifier involves processing the raw data for analysis. SMS
messages are strings of text composed of words, spaces, numbers, and punctuation. Handling this type of
complex data takes a lot of thought and effort. One needs to consider how to remove numbers and
17
Capstone Project – IS 6596
punctuation; handle uninteresting words such as and, but, and or; and how to break apart sentences into
individual words.
Figure 14 Description of length of the Ham messages Figure15 Description of length of the Spam messages
Our first order of business was to standardize the messages to use only lowercase characters. To this end,
we used tolower() function that returns a lowercase version of text strings. Continuing with our cleanup
process, we also eliminated any punctuation from the text messages. Our next task was to remove filler
words such as to, and, but, and or from our SMS messages. These terms are known as stop words and are
typically removed prior to text mining. This is due to the fact that although they appear very frequently,
they do not provide much useful information for machine learning.
Another common standardization for text data involves reducing words to their root form in a process
called stemming. The stemming process takes words like learned, learning, and learns, and strips the suffix
to transform them into the base form, learn. These are left with the blank spaces that previously separated
the now-missing pieces. The final step in our text cleanup process was to remove additional whitespace.
A word cloud is a way to visually depict the frequency at which words appear in text data. The cloud is
composed of words scattered somewhat randomly around the figure. The resulting word clouds are
shown in the following diagram:
18
Capstone Project – IS 6596
Figure 16 Spam Word cloud Figure 17 Ham Word cloud
Now that the data are processed to our liking, the final step is to split the messages into individual
components through a process called vectorization. We took the corpus and created a data structure in
which rows indicate documents (SMS messages) and columns indicate terms (words). The final step in the
data preparation process was to transform the sparse matrix into a data structure that can be used to
train a Naive Bayes classifier. The sparse matrix included over 6,500 features; this is a feature for every
word that appears in at least one SMS message. It's unlikely that these are useful for classification. To
reduce the number of features, we eliminated any word that appear in less than five SMS messages, or in
less than about 0.1 percent of the records in the training data.
Figure 18 Vectorization
To evaluate the SMS classifier, we need to test its predictions on unseen messages in the test data. The
process of evaluating machine learning algorithms is very similar to the process of evaluating students.
Since algorithms have varying strengths and weaknesses, tests should distinguish among the learners.
Figure 19 Classification report
19
Capstone Project – IS 6596
A confusion matrix is a table that categorizes predictions according to whether they match the actual
value. One of the table's dimensions indicates the possible categories of predicted values, while the other
dimension indicates the same for actual values. Although we have only seen 2 x 2 confusion matrices so
far, a matrix can be created for models that predict any number of class value.
Lessons Learned
Lesson 1: Marketing research is fun- We get to work with a wide variety of datasets, dive in and learn all
about the market their operating in and relay valuable insights back to stakeholders. We dig up everything
from why consumers make certain purchase decisions to what they’re passionate about and what makes
them tick.
Lesson 2: Collaboration is key- While doing this project we found out that while they might be tremendous
innovators, but collaboration is very important.
Lesson 3: Check, re-check and then check again Projects move quickly which means we don’t have time
to go back and re-collect data or make corrections to a report. Questionnaires, surveys, and reports must
be checked, checked by our coworker and checked again.
Next Steps
The next step would be to discover the other facets of Marketing Analysis like “Upsell and Cross Sell”,
“Recommendation System” etc. We can use algorithms like Principal Component Analysis(PCA), QDA, LDA
to reduce the number of features. Also, we can make analysis on the time series data using ARIMA
algorithm.

More Related Content

What's hot

The Analytics and Data Science Landscape
The Analytics and Data Science LandscapeThe Analytics and Data Science Landscape
The Analytics and Data Science LandscapePhilip Bourne
 
Computer Aided Drug Design and Discovery : An Overview (2006)
Computer Aided Drug Design and Discovery : An Overview (2006)Computer Aided Drug Design and Discovery : An Overview (2006)
Computer Aided Drug Design and Discovery : An Overview (2006)Girinath Pillai
 
Application of swarm intelligence optimization in biomedical
Application of swarm intelligence optimization in biomedical  Application of swarm intelligence optimization in biomedical
Application of swarm intelligence optimization in biomedical Aboul Ella Hassanien
 
Introduction to Data Science.pptx
Introduction to Data Science.pptxIntroduction to Data Science.pptx
Introduction to Data Science.pptxVrishit Saraswat
 
Lect5 principal component analysis
Lect5 principal component analysisLect5 principal component analysis
Lect5 principal component analysishktripathy
 
TabPy Presentation
TabPy PresentationTabPy Presentation
TabPy PresentationSanjana Jami
 
Smart Data Slides: Machine Learning - Case Studies
Smart Data Slides: Machine Learning - Case StudiesSmart Data Slides: Machine Learning - Case Studies
Smart Data Slides: Machine Learning - Case StudiesDATAVERSITY
 
A Secure Model of IoT Using Blockchain
A Secure Model of IoT Using BlockchainA Secure Model of IoT Using Blockchain
A Secure Model of IoT Using BlockchainAltoros
 
Protein – ligand ppt
Protein – ligand pptProtein – ligand ppt
Protein – ligand pptprithvi singh
 
internet of medical things-IOMT
internet of medical things-IOMTinternet of medical things-IOMT
internet of medical things-IOMTAkshay Ambesange
 
Tools and techniques adopted for big data analytics
Tools and techniques adopted for big data analyticsTools and techniques adopted for big data analytics
Tools and techniques adopted for big data analyticsJOSEPH FRANCIS
 
Association Rule Mining Using WEKA
Association Rule Mining Using WEKAAssociation Rule Mining Using WEKA
Association Rule Mining Using WEKAProthoma Diteeya
 
Big data analysis and Internet of Things(IoT)
Big data analysis and Internet of Things(IoT)Big data analysis and Internet of Things(IoT)
Big data analysis and Internet of Things(IoT)Monica Kambala
 
Applications of Big Data Analytics in Businesses
Applications of Big Data Analytics in BusinessesApplications of Big Data Analytics in Businesses
Applications of Big Data Analytics in BusinessesT.S. Lim
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataVipin Batra
 

What's hot (20)

The Analytics and Data Science Landscape
The Analytics and Data Science LandscapeThe Analytics and Data Science Landscape
The Analytics and Data Science Landscape
 
Computer Aided Drug Design and Discovery : An Overview (2006)
Computer Aided Drug Design and Discovery : An Overview (2006)Computer Aided Drug Design and Discovery : An Overview (2006)
Computer Aided Drug Design and Discovery : An Overview (2006)
 
Application of swarm intelligence optimization in biomedical
Application of swarm intelligence optimization in biomedical  Application of swarm intelligence optimization in biomedical
Application of swarm intelligence optimization in biomedical
 
Introduction to Data Science.pptx
Introduction to Data Science.pptxIntroduction to Data Science.pptx
Introduction to Data Science.pptx
 
Lect5 principal component analysis
Lect5 principal component analysisLect5 principal component analysis
Lect5 principal component analysis
 
04 molecular dynamics
04 molecular dynamics04 molecular dynamics
04 molecular dynamics
 
AI in Healthcare.pptx
AI in Healthcare.pptxAI in Healthcare.pptx
AI in Healthcare.pptx
 
TabPy Presentation
TabPy PresentationTabPy Presentation
TabPy Presentation
 
Smart Data Slides: Machine Learning - Case Studies
Smart Data Slides: Machine Learning - Case StudiesSmart Data Slides: Machine Learning - Case Studies
Smart Data Slides: Machine Learning - Case Studies
 
Internet of things(IoT)
Internet of things(IoT)Internet of things(IoT)
Internet of things(IoT)
 
A Secure Model of IoT Using Blockchain
A Secure Model of IoT Using BlockchainA Secure Model of IoT Using Blockchain
A Secure Model of IoT Using Blockchain
 
Protein – ligand ppt
Protein – ligand pptProtein – ligand ppt
Protein – ligand ppt
 
internet of medical things-IOMT
internet of medical things-IOMTinternet of medical things-IOMT
internet of medical things-IOMT
 
Tools and techniques adopted for big data analytics
Tools and techniques adopted for big data analyticsTools and techniques adopted for big data analytics
Tools and techniques adopted for big data analytics
 
Association Rule Mining Using WEKA
Association Rule Mining Using WEKAAssociation Rule Mining Using WEKA
Association Rule Mining Using WEKA
 
Big data analysis and Internet of Things(IoT)
Big data analysis and Internet of Things(IoT)Big data analysis and Internet of Things(IoT)
Big data analysis and Internet of Things(IoT)
 
Artificial intelligence and Medicine.pptx
Artificial intelligence and Medicine.pptxArtificial intelligence and Medicine.pptx
Artificial intelligence and Medicine.pptx
 
Applications of Big Data Analytics in Businesses
Applications of Big Data Analytics in BusinessesApplications of Big Data Analytics in Businesses
Applications of Big Data Analytics in Businesses
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 

Similar to Marketing Analytics using R/Python

_What Is Data Science.pdf
_What Is Data Science.pdf_What Is Data Science.pdf
_What Is Data Science.pdfFlyWly
 
Data analytics presentation- Management career institute
Data analytics presentation- Management career institute Data analytics presentation- Management career institute
Data analytics presentation- Management career institute PoojaPatidar11
 
Applied_Data_Science_Presented_by_Yhat
Applied_Data_Science_Presented_by_YhatApplied_Data_Science_Presented_by_Yhat
Applied_Data_Science_Presented_by_YhatCharlie Hecht
 
Final Internship Report_Sachin Serigar
Final Internship Report_Sachin SerigarFinal Internship Report_Sachin Serigar
Final Internship Report_Sachin SerigarSachin Serigar
 
Practical Machine Learning
Practical Machine LearningPractical Machine Learning
Practical Machine LearningLynn Langit
 
Drive your business with predictive analytics
Drive your business with predictive analyticsDrive your business with predictive analytics
Drive your business with predictive analyticsThe Marketing Distillery
 
Barga, roger. predictive analytics with microsoft azure machine learning
Barga, roger. predictive analytics with microsoft azure machine learningBarga, roger. predictive analytics with microsoft azure machine learning
Barga, roger. predictive analytics with microsoft azure machine learningmaldonadojorge
 
Lecture3 business intelligence
Lecture3 business intelligenceLecture3 business intelligence
Lecture3 business intelligencehktripathy
 
Data science tutorial
Data science tutorialData science tutorial
Data science tutorialAakashdata
 
Big Data Analytics for Predicting Consumer Behaviour
Big Data Analytics for Predicting Consumer BehaviourBig Data Analytics for Predicting Consumer Behaviour
Big Data Analytics for Predicting Consumer BehaviourIRJET Journal
 
Making Advanced Analytics Work for You
Making Advanced Analytics Work for YouMaking Advanced Analytics Work for You
Making Advanced Analytics Work for YouSoumyadeep Sengupta
 
Big Data Courses In Mumbai
Big Data Courses In MumbaiBig Data Courses In Mumbai
Big Data Courses In Mumbaifaizrashid1995
 
Introduction to Data Science.pptx
Introduction to Data Science.pptxIntroduction to Data Science.pptx
Introduction to Data Science.pptxDr.Shweta
 
Data Science - Part I - Sustaining Predictive Analytics Capabilities
Data Science - Part I - Sustaining Predictive Analytics CapabilitiesData Science - Part I - Sustaining Predictive Analytics Capabilities
Data Science - Part I - Sustaining Predictive Analytics CapabilitiesDerek Kane
 
What is Data analytics? How is data analytics a better career option?
What is Data analytics? How is data analytics a better career option?What is Data analytics? How is data analytics a better career option?
What is Data analytics? How is data analytics a better career option?Aspire Techsoft Academy
 
Impact of Data Analytics in Changing the Future of Business and Challenges Fa...
Impact of Data Analytics in Changing the Future of Business and Challenges Fa...Impact of Data Analytics in Changing the Future of Business and Challenges Fa...
Impact of Data Analytics in Changing the Future of Business and Challenges Fa...IJSRP Journal
 

Similar to Marketing Analytics using R/Python (20)

_What Is Data Science.pdf
_What Is Data Science.pdf_What Is Data Science.pdf
_What Is Data Science.pdf
 
Data analytics presentation- Management career institute
Data analytics presentation- Management career institute Data analytics presentation- Management career institute
Data analytics presentation- Management career institute
 
Applied_Data_Science_Presented_by_Yhat
Applied_Data_Science_Presented_by_YhatApplied_Data_Science_Presented_by_Yhat
Applied_Data_Science_Presented_by_Yhat
 
What is data science ?
What is data science ?What is data science ?
What is data science ?
 
Mighty Guides Data Disruption
Mighty Guides Data DisruptionMighty Guides Data Disruption
Mighty Guides Data Disruption
 
Final Internship Report_Sachin Serigar
Final Internship Report_Sachin SerigarFinal Internship Report_Sachin Serigar
Final Internship Report_Sachin Serigar
 
Practical Machine Learning
Practical Machine LearningPractical Machine Learning
Practical Machine Learning
 
Drive your business with predictive analytics
Drive your business with predictive analyticsDrive your business with predictive analytics
Drive your business with predictive analytics
 
Barga, roger. predictive analytics with microsoft azure machine learning
Barga, roger. predictive analytics with microsoft azure machine learningBarga, roger. predictive analytics with microsoft azure machine learning
Barga, roger. predictive analytics with microsoft azure machine learning
 
Lecture3 business intelligence
Lecture3 business intelligenceLecture3 business intelligence
Lecture3 business intelligence
 
Certified Business Analytics Specialist (CBAS)
Certified Business Analytics Specialist (CBAS) Certified Business Analytics Specialist (CBAS)
Certified Business Analytics Specialist (CBAS)
 
Data science tutorial
Data science tutorialData science tutorial
Data science tutorial
 
Big Data Analytics for Predicting Consumer Behaviour
Big Data Analytics for Predicting Consumer BehaviourBig Data Analytics for Predicting Consumer Behaviour
Big Data Analytics for Predicting Consumer Behaviour
 
Making Advanced Analytics Work for You
Making Advanced Analytics Work for YouMaking Advanced Analytics Work for You
Making Advanced Analytics Work for You
 
Big Data Courses In Mumbai
Big Data Courses In MumbaiBig Data Courses In Mumbai
Big Data Courses In Mumbai
 
Introduction to Data Science.pptx
Introduction to Data Science.pptxIntroduction to Data Science.pptx
Introduction to Data Science.pptx
 
Data Science - Part I - Sustaining Predictive Analytics Capabilities
Data Science - Part I - Sustaining Predictive Analytics CapabilitiesData Science - Part I - Sustaining Predictive Analytics Capabilities
Data Science - Part I - Sustaining Predictive Analytics Capabilities
 
What is Data analytics? How is data analytics a better career option?
What is Data analytics? How is data analytics a better career option?What is Data analytics? How is data analytics a better career option?
What is Data analytics? How is data analytics a better career option?
 
Unlocking big data
Unlocking big dataUnlocking big data
Unlocking big data
 
Impact of Data Analytics in Changing the Future of Business and Challenges Fa...
Impact of Data Analytics in Changing the Future of Business and Challenges Fa...Impact of Data Analytics in Changing the Future of Business and Challenges Fa...
Impact of Data Analytics in Changing the Future of Business and Challenges Fa...
 

Recently uploaded

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 

Recently uploaded (20)

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 

Marketing Analytics using R/Python

  • 1. Capstone Project - IS 6596 Project Supervisor: Dr. Rohit Aggarwal Project Contributors: Mayank Badjatya - u1085897 Sagar Singh - u1088202 MARKETING ANALYTICS USING R/PYTHON
  • 2. 1 Capstone Project – IS 6596 Contents Executive Summary.......................................................................................................................................2 Book Description...........................................................................................................................................3 Why Data Science?........................................................................................................................................5 Skill sets required for a Data Science............................................................................................................6 7 Steps to effective Predictive Modelling.....................................................................................................7 Marketing Analysis........................................................................................................................................9 Fraud Detection ......................................................................................................................................10 Market Segmentation.............................................................................................................................13 Advertising..............................................................................................................................................16 Lessons Learned..........................................................................................................................................19 Next Steps...................................................................................................................................................19
  • 3. 2 Capstone Project – IS 6596 Executive Summary The objective of this project is to discuss the importance of Machine Learning in different sectors and how does it solve the problems in the Marketing Analytics field. We have discussed Marketing Segmentation, Advertisement, and Fraud detection in our project. We used different Machine Learning algorithms and used R and Python library to predict and solve these problems. After making models and running test data on those models we got following results: • We trained a Decision tree and Random Forest classifier model which has 73% accuracy to predict whether a person will be a defaulter or not based on credit history, income, job type, dependents etc. • We segmented the Social networking profiles based on the likes and dislikes of a person using K- Means Clustering. • We made a predictive model on the messages a customer receives and determined whether a message will be a Spam or not a spam with an accuracy of 97%. We used Naïve Bayes classifier for this model. • We created several other models using different algorithms, but these are beyond the scope of this report.
  • 4. 3 Capstone Project – IS 6596 Book Description An Introduction to Statistical Learning provides an accessible overview of the field of statistical learning, an essential toolset for making sense of the vast and complex data sets that have emerged in fields ranging from biology to finance to marketing to astrophysics in the past twenty years. This book presents some of the most important modeling and prediction techniques, along with relevant applications. Topics include linear regression, classification, resampling methods, shrinkage approaches, tree-based methods, support vector machines, clustering, and more. Color graphics and real-world examples are used to illustrate the methods presented. Since the goal of this textbook is to facilitate the use of these statistical learning techniques by practitioners in science, industry, and other fields, each chapter contains a tutorial on implementing the analyses and methods presented in R, an extremely popular open source statistical software platform. An Introduction to Statistical Learning covers many of the same topics, but at a level accessible to a much broader audience. This book is targeted at statisticians and non-statisticians alike who wish to use innovative statistical learning techniques to analyze their data. The text assumes only a previous course in linear regression and no knowledge of matrix algebra. Machine Learning with R: This book is intended for anybody hoping to use data for action. Perhaps you already know a bit about machine learning, but have never used R; or perhaps you know a little about R, but are new to machine learning. In any case, this book will get you up and running quickly. It would be helpful to have a bit of familiarity with basic math and programming concepts, but no prior experience is required. All you need is curiosity. Machine learning, at its core, is concerned with the algorithms that transform information into actionable intelligence. This fact makes machine learning well-suited to the present-day era of big data. Without machine learning, it would be nearly impossible to keep up with the massive stream of information. Given the growing prominence of R—a cross-platform, zero-cost statistical programming environment—there has never been a better time to start using machine learning. R offers a powerful but easy-to-learn set of tools that can assist you with finding data insights. By combining hands-on case studies with the essential
  • 5. 4 Capstone Project – IS 6596 theory that you need to understand how things work under the hood, this book provides all the knowledge that you will need to start applying machine learning to your own projects. Marketing Analytics Data Driven Techniques: This book helps tech-savvy marketers and data analysts solve real-world business problems with Excel. Using data-driven business analytics to understand customers and improve results is a great idea in theory, but in today's busy offices, marketers and analysts need simple, low-cost ways to process and make the most of all that data. This expert book offers the perfect solution. Written by data analysis expert Wayne L. Winston, this practical resource shows you how to tap a simple and cost-effective tool, Microsoft Excel, to solve specific business problems using powerful analytic techniques—and achieve optimum results. Practical exercises in each chapter helped us to apply and reinforce techniques as you learn. Shows you how to perform sophisticated business analyses using the cost-effective and widely available Microsoft Excel instead of expensive, proprietary analytical tools • Reveals how to target and retain profitable customers and avoid high-risk customers • Helps you forecast sales and improve response rates for marketing campaigns • Explores how to optimize price points for products and services, optimize store layouts, and improve online advertising • Covers social media, viral marketing, and how to exploit both effectively.
  • 6. 5 Capstone Project – IS 6596 Why Data Science? Data Science is a field, which can be implemented anywhere. Here is the list of people who uses data science as a tool in their field and are not from IT background. • Politics: We may have heard how statistical wizard Nate Silver predicted the electoral votes for each state in the 2012 presidential election, showing that raw data crunching of polls is much more reliable than traditional punditry. • Healthcare: The role of big data in medicine is one where we can build better health profiles and better predictive models around individual patients so that we can better diagnose and treat disease. Big data comes into play around aggregating increasingly information around multiple scales for what constitutes a disease—from the DNA, proteins, and metabolites to cells, tissues, organs, organisms, and ecosystems. • Automotive Industry: Areas in the automotive industry impacted by Big Data include: a. Conceptual Design: Real-world data collected from billions of miles driven will undoubtedly influence safety, aerodynamics, power algorithms and other fundamental elements of the vehicle. b. Drawing Boards: Efficiency gained in design, production volumes and manufacturing through Big Data in the auto industry will make it economically feasible to make today’s options tomorrow’s standard equipment. c. Procurement: Supply chain management optimized by Big Data will help manufacturers continue to wring new efficiency from the procurement process. d. Manufacturing: On the assembly line, data gathered throughout the building process will be used in predictive analytics to improve manufacturing simulations and watch machine performance, making the next assembly line even more efficient and flexible. • Marketing: Big Data is already having a major influence on vehicle marketing. Social sentiment will play a growing role in manufacturers’ plans to design new vehicles. Customer feedback on current models also helps marketing experts identify key themes and messages for new campaigns. • Finance: Understanding consumer habits, preferences and buying power across market segments gives manufacturers insights needed to develop more-effective financing programs. But that’s just the first step. New insights from Big Data analyses of sales and in-field use data will help captive financing companies develop new services and new revenue streams. • Services: Like performance, service will benefit as both a contributor and a user of Big Data in the automotive industry. Information gathered through millions of service events will provide feedback to designers.
  • 7. 6 Capstone Project – IS 6596 Skill sets required for a Data Science Technical Skills: Python Coding – Python is the most common coding language I typically see required in data science roles, along with Java, Perl, or C/C++. Hadoop Platform – Although this isn’t always a requirement, it is heavily preferred in many cases. Having experience with Hive or Pig is also a strong selling point. Familiarity with cloud tools such as Amazon S3 can also be beneficial. SQL Database/Coding – Even though NoSQL and Hadoop have become a large component of data science, it is still expected that a candidate will be able to write and execute complex queries in SQL. Unstructured data – It is critical that a data scientist be able to work with unstructured data, whether it is from social media, video feeds or audio. Non-Technical Skills Intellectual curiosity – No doubt we have seen this phrase everywhere lately, especially as it relates to data scientists. Frank Lo describes what it means, and talks about other necessary “soft skills” in his guest blog posted a few months ago. Business acumen – To be a data scientist we’ll need a solid understanding of the industry we’re working in, and know what business problems your company is trying to solve. In terms of data science, being able to discern which problems are important to solve for the business is critical, in addition to identifying new ways the business should be leveraging its data. Communication skills – Companies searching for a strong data scientist are looking for someone who can clearly and fluently translate their technical findings to a non-technical team, such as the Marketing or Sales departments. A data scientist must enable the business to make decisions by arming them with quantified insights, in addition to understanding the needs of their non-technical colleagues to wrangle the data appropriately.
  • 8. 7 Capstone Project – IS 6596 7 Steps to effective Predictive Modelling Step 1: Defining the Objective The first step in any modeling process is defining the objective. We see in what field does the problem fall in. There are many fields like Target Marketing, Risk & Fraud Management, Strategy Implementation and Change Management, Operational Efficiency, Increase Customer Experience, Manage Marketing, Campaigns Forecast, Revenue or Loss, Workforce Management, Financial Modeling, Churn Management, and Social Media Influencers Step 2: Gathering the Data Accurate, actionable, accessible data is the lifeblood of any successful model. So we collect enough data to make a predictive model on it. Step 3: Preparing the Data for Modeling The average modeler spends 70% of his or her time preparing data. In this step we need to prepare data into right format for analysis and the tool we may want use. 1. Do initial cleaning up 2. Define Variables and Create Data Dictionary 3. Joining/Appending multiple datasets 4. Validate for correctness 5. Produce Basic Summary Reports Step 4: Selecting and Transforming the Variables Determining the best fit is essential to good model performance. The underlying structure of the independent variables in relation to the dependent variable, determines the power and longevity of a model. Special consideration is given to the fact that marketing data can have hundreds or even thousands of variables. We apply methods for identifying the best candidate variables. Programs are introduced that automatically segment and transform the most powerful variables, to ensure the best fit. Step 5: Processing and Evaluating the Model All the preparation works up to this point makes this next step run smoothly. Weights of Evidence and Information Values are calculated. For our main case study, we used various options within PROC LOGISTIC to determine the model with the best fit. Validation data are scored, tabulated, and compared using both SAS® & MSExcel®. Step 6: Validating the Model Models should perform well on the development data. Plus, if the hold-out sample is randomly selected, the model performance should score the validation data with similar results. A true test of model performance is how well it performs on data from a different time or market area. So, we used three powerful methods for ensuring model fit. 1) Scoring alternate data is the best way to tell if our model will
  • 9. 8 Capstone Project – IS 6596 perform in a real campaign; 2) Bootstrapping uses simple resampling techniques to find confidence intervals around our estimates; 3) Key Variable Analysis calculates important market factors as they are affected by the model, thus ensuring reasonable results. Step 7: Implementing and Maintaining the Model Effective implementation is a combination of business intelligence and well-designed procedures. So, we score a new data set with the new model. Several auditing procedures are done and tracking, and model maintenance are emphasized as best practices. Figure 1 7 Steps of Predictive Model
  • 10. 9 Capstone Project – IS 6596 Marketing Analysis Figure 2 : Facets of Marketing Analysis An accurate customer risk assessment will help us acquire the most profitable consumers while minimizing risk. For business-to-consumer companies, Experian offers consumer credit information, advanced scoring software, prescreening systems, and application decisioning tools. For companies looking to acquire business customers, our business reports and public records, portfolio data and risk modeling tools allow clients to create comprehensive profiles of business prospects. Determine which businesses are well-capitalized and financially suited for customer acquisition.
  • 11. 10 Capstone Project – IS 6596 Fraud Detection Fraud is a billion-dollar business and it is increasing every year. The PwC global economic crime survey of 2016 suggests that more than one in three (36%) of organizations experienced economic crime. Traditional methods of data analysis have long been used to detect fraud. They require complex and time- consuming investigations that deal with different domains of knowledge like financial, economics, business practices and law. To know more about how Machine Learning algorithms, solve Fraud detection problem we took a dataset from the “Machine Learning using R” credit data set. The idea behind our credit model is to identify factors that make an applicant at higher risk of default. Therefore, we need to obtain data on many past bank loans and whether the loan went into default, as well as information about the applicant. We can see that “job”, “phone”, “checking_balance”, “credit_history”, “purpose”,” savings_balance”, “employment_duration”, “other_credit”, “housing” are the categorical data so in Python we use onehotencoder() to convert the categorical data into 0s and 1s. After applying the onehotencoder() on all categorical dataset we got 36 columns. The credit dataset includes 1,000 examples of loans, plus a combination of numeric and nominal features indicating characteristics of the loan and the loan applicant. A class variable indicates whether the loan went into default. Figure 3 Conversion of categorical data into 0s and 1s
  • 12. 11 Capstone Project – IS 6596 We did the initial data exploration and plotted that using matplotlib library. Figure 4 Exploratory Data Analysis We used decision tree to determine whether a person is a defaulter or not depending on the features. The core algorithm for building decision trees called ID3. The Decision tree classifiers uses greedy approach hence an attribute chooses at first step can’t be used anymore which can give better classification if used in later steps. Also, it overfits the training data which can give poor results for unseen data. It uses two concepts to determine on which feature it needs to divide the dataset. Information Gain The information gain is based on the decrease in entropy after a dataset is split on an attribute. Constructing a decision tree is all about finding attribute that returns the highest information gain (i.e., the most homogeneous branches). Entropy A decision tree is built top-down from a root node and involves partitioning the data into subsets that contain instances with similar values (homogenous). ID3 algorithm uses entropy to calculate the homogeneity of a sample. If the sample is completely homogeneous the entropy is zero and if the sample is an equally divided it has entropy of one. After applying the Decision tree model, we got the following classification report.
  • 13. 12 Capstone Project – IS 6596 Figure 5 F1 Score for Decision Tree F1 score is a measure of a test's accuracy. The F1 score is the harmonic average of the precision and recall, where an F1 score reaches its best value at 1 (perfect precision and recall) and worst at 0. Decision tree makes a model which is biased so to overcome this drawback we use Bagging. Bagging is a way to decrease the variance of our prediction by generating additional data for training from our original dataset using combinations with repetitions to produce multisets of the same cardinality/size as our original data. Random Forests is an ensemble classifier which uses many decision tree models to predict the result. A different subset of training data is selected, with replacement to train each tree. A collection of trees is a forest, and the trees are being trained on subsets which are being selected at random, hence random forests. After applying Random Forest classifier, we got the following result. Figure 6 F1 Score for Random Forest We can clearly see the increase in the F1-score. Now the next step in building model as discussed earlier is to fine tune the model. For this we use Grid Search Cross Validation technique. After applying the GridSearchCV we got the following classification report. Figure 7 F1 Score after GridSearchCV From this model we understand that the model will predict 73% of the time whether a person will be a defaulter or not.
  • 14. 13 Capstone Project – IS 6596 Market Segmentation One of the most fundamental marketing activities is in market segmentation. As companies cannot connect with all their potential customers, they must divide markets into groups (segments) of consumers, customers, or clients with similar needs and wants. Firms can then target each of these segments by positioning themselves in a unique segment (such as Ferrari in the high-end sports car market). While market researchers often form market, segments based on practical grounds, industry practice and wisdom, cluster analysis allows segments to be formed that are based on data that are less dependent on subjectivity. Cluster analysis is a convenient method for identifying homogeneous groups of objects called clusters. Objects (or cases, observations) in a specific cluster share many characteristics, but are very dissimilar to objects not belonging to that cluster. Below we have tried try this process from start to finish. For this analysis, we used a dataset representing a random sample of 30,000 U.S. high school students who had profiles on a well-known SNS in 2006. To protect the users' anonymity, the SNS will remain unnamed. However, at the time the data was collected, the SNS was a popular web destination for US teenagers. Therefore, it is reasonable to assume that the profiles represent a wide cross section of American adolescents in 2006. Let's take a quick look at the specifics of the data. Figure 8 Description of the data set
  • 15. 14 Capstone Project – IS 6596 Figure 9 Min-Max of the Age Figure 10 Gender and Age anomaly There is something strange around the gender row. On looking carefully, we noticed the NA value. We see that 2,724 records (9 percent) have missing gender data. Besides gender, only age has missing values. A total of 5,086 records (17 percent) have missing ages. Also concerning is the fact that the minimum and maximum values seem to be unreasonable; it is unlikely that a 3-year-old or a 106-year-old is attending high school. To ensure that these extreme values don't cause problems for the analysis, we cleaned them up before moving on. Figure 11 Box Plot for the age distribution A more reasonable range of ages for the high school students includes those who are at least 13 years old and not yet 20 years old. Any age value falling outside this range we treated the same as missing data. An easy solution for handling the missing values is to exclude any record with a missing value. In this case, we created dummy variables for female and unknown gender. We assigned teens$female the value 1 if gender is equal to F and the gender is not equal to NA; otherwise, it assigns the value 0 . Next, we eliminated the 5,523 missing age values. We have used a different strategy known as data imputation, which involves filling in the missing data with a guess as to the true value. Most people in a graduation cohort were born within a single calendar year. We have identified the typical age for each cohort, we had a reasonable estimate of the age of a student in that graduation year.
  • 16. 15 Capstone Project – IS 6596 To cluster the teenagers into marketing segments, we used an implementation of k-means clustering. We started our cluster analysis by considering only the 36 features that represent the number of times various interests appeared on the teen SNS profiles. Evaluating clustering results can be somewhat subjective. Ultimately, the success or failure of the model hinges on whether the clusters are useful for their intended purpose. As the goal of this analysis was to identify clusters of teenagers with similar interests for marketing purposes, we largely measured our success in qualitative terms. For other clustering applications, more quantitative measures of success may be needed. By examining whether the clusters fall above or below the mean level for each interest category, we can notice patterns that distinguish the clusters from each other. Cluster 3 is substantially above the mean interest level on all the sports. This suggests that this may be a group of Athletes per The Breakfast Club stereotype. Figure 12 Cluster segmentation Cluster 0 includes the most mentions of "cheerleading," the word "hot," and is above the average level of football interest. Hence, these are the so-called Princesses. Similarly, we tried to cluster the different groups, and this is what we found. We now focused our effort on turning these insights into action. We applied the clusters back onto the full dataset. We looked at the demographic characteristics of the clusters. The mean age does not vary much by cluster, which is not too surprising as these teen identities are often determined before high school. On the other hand, there are some substantial differences in the proportion of females by cluster. This is a very interesting finding as we didn't use gender data to create the clusters, yet the clusters are still Cluster 0 (N = 872) Princess cute hair shopping clothes dance Cluster 1 (N = 21308) Basket Cases ??? Cluster 2 (N = 1041) Criminals drunk deaths drugs die music Cluster 3 (N = 5971) Athletes basketball soccer football volleyball soccer Cluster 4 (N = 808) Brains band marching music rock
  • 17. 16 Capstone Project – IS 6596 predictive of gender. Given our success in predicting gender, we also suspected that the clusters are predictive of the number of friends the users have. This hypothesis seems to be supported by the data. Our findings support the popular adage that "birds of a feather flock together." By using machine learning methods to cluster teenagers with others who have similar interests, we were able to develop a typology of teen identities that was predictive of personal characteristics, such as gender and the number of friends. These same methods can be applied to other contexts with similar results. Advertising Compared to all the marketing techniques, email marketing is the cheapest way of sending a marketing message to millions of people. Being so cheap, it is the tool of choice for marketing teams with a small budget trying to sell cheap products. Most of the times, such products do not deliver what they promise. Unfortunately, with email marketing, we run the risk of being exposed to malware and fraudulent emails. Worms and viruses often make use of email and spam techniques to propagate. Phishing emails and Nigerian 419 scams are examples of fraudulent emails which try to harvest either our money or our personal information including credit card details. So, while email marketing is the tool of choice for most marketing teams, it does require stringent regulations to ensure that it does not get abused. Below we tried to build a model which predicts whether a composed message is spam or not. The dataset included the text of SMS messages along with a label indicating whether the message is unwanted. Junk messages are labeled spam, while legitimate messages are labeled ham. Since Naive Bayes has been used successfully for e-mail spam filtering, it seems likely that it could also be applied to SMS spam. However, relative to e-mail spam, SMS spam poses additional challenges for automated filters. SMS messages are often limited to 160 characters, reducing the amount of text that can be used to identify whether a message is junk. Figure 13 Description of the data set The first step towards constructing our classifier involves processing the raw data for analysis. SMS messages are strings of text composed of words, spaces, numbers, and punctuation. Handling this type of complex data takes a lot of thought and effort. One needs to consider how to remove numbers and
  • 18. 17 Capstone Project – IS 6596 punctuation; handle uninteresting words such as and, but, and or; and how to break apart sentences into individual words. Figure 14 Description of length of the Ham messages Figure15 Description of length of the Spam messages Our first order of business was to standardize the messages to use only lowercase characters. To this end, we used tolower() function that returns a lowercase version of text strings. Continuing with our cleanup process, we also eliminated any punctuation from the text messages. Our next task was to remove filler words such as to, and, but, and or from our SMS messages. These terms are known as stop words and are typically removed prior to text mining. This is due to the fact that although they appear very frequently, they do not provide much useful information for machine learning. Another common standardization for text data involves reducing words to their root form in a process called stemming. The stemming process takes words like learned, learning, and learns, and strips the suffix to transform them into the base form, learn. These are left with the blank spaces that previously separated the now-missing pieces. The final step in our text cleanup process was to remove additional whitespace. A word cloud is a way to visually depict the frequency at which words appear in text data. The cloud is composed of words scattered somewhat randomly around the figure. The resulting word clouds are shown in the following diagram:
  • 19. 18 Capstone Project – IS 6596 Figure 16 Spam Word cloud Figure 17 Ham Word cloud Now that the data are processed to our liking, the final step is to split the messages into individual components through a process called vectorization. We took the corpus and created a data structure in which rows indicate documents (SMS messages) and columns indicate terms (words). The final step in the data preparation process was to transform the sparse matrix into a data structure that can be used to train a Naive Bayes classifier. The sparse matrix included over 6,500 features; this is a feature for every word that appears in at least one SMS message. It's unlikely that these are useful for classification. To reduce the number of features, we eliminated any word that appear in less than five SMS messages, or in less than about 0.1 percent of the records in the training data. Figure 18 Vectorization To evaluate the SMS classifier, we need to test its predictions on unseen messages in the test data. The process of evaluating machine learning algorithms is very similar to the process of evaluating students. Since algorithms have varying strengths and weaknesses, tests should distinguish among the learners. Figure 19 Classification report
  • 20. 19 Capstone Project – IS 6596 A confusion matrix is a table that categorizes predictions according to whether they match the actual value. One of the table's dimensions indicates the possible categories of predicted values, while the other dimension indicates the same for actual values. Although we have only seen 2 x 2 confusion matrices so far, a matrix can be created for models that predict any number of class value. Lessons Learned Lesson 1: Marketing research is fun- We get to work with a wide variety of datasets, dive in and learn all about the market their operating in and relay valuable insights back to stakeholders. We dig up everything from why consumers make certain purchase decisions to what they’re passionate about and what makes them tick. Lesson 2: Collaboration is key- While doing this project we found out that while they might be tremendous innovators, but collaboration is very important. Lesson 3: Check, re-check and then check again Projects move quickly which means we don’t have time to go back and re-collect data or make corrections to a report. Questionnaires, surveys, and reports must be checked, checked by our coworker and checked again. Next Steps The next step would be to discover the other facets of Marketing Analysis like “Upsell and Cross Sell”, “Recommendation System” etc. We can use algorithms like Principal Component Analysis(PCA), QDA, LDA to reduce the number of features. Also, we can make analysis on the time series data using ARIMA algorithm.