This Data Science presentation will help you understand what is Data Science, who is a Data Scientist, what does a Data Scientist do and also how Python is used for Data Science. Data science is an interdisciplinary field of scientific methods, processes, algorithms and systems to extract knowledge or insights from data in various forms, either structured or unstructured, similar to data mining. This Data Science tutorial will help you establish your skills at analytical techniques using Python. With this Data Science video, you’ll learn the essential concepts of Data Science with Python programming and also understand how data acquisition, data preparation, data mining, model building & testing, data visualization is done. This Data Science tutorial is ideal for beginners who aspire to become a Data Scientist.
This Data Science presentation will cover the following topics:
1. What is Data Science?
2. Who is a Data Scientist?
3. What does a Data Scientist do?
This Data Science with Python course will establish your mastery of data science and analytics techniques using Python. With this Python for Data Science Course, you’ll learn the essential concepts of Python programming and become an expert in data analytics, machine learning, data visualization, web scraping and natural language processing. Python is a required skill for many data science positions, so jumpstart your career with this interactive, hands-on course.
Why learn Data Science?
Data Scientists are being deployed in all kinds of industries, creating a huge demand for skilled professionals. A data scientist is the pinnacle rank in an analytics organization. Glassdoor has ranked data scientist first in the 25 Best Jobs for 2016, and good data scientists are scarce and in great demand. As a data you will be required to understand the business problem, design the analysis, collect and format the required data, apply algorithms or techniques using the correct tools, and finally make recommendations backed by data.
You can gain in-depth knowledge of Data Science by taking our Data Science with python certification training course. With Simplilearn’s Data Science certification training course, you will prepare for a career as a Data Scientist as you master all the concepts and techniques. Those who complete the course will be able to:
1. Gain an in-depth understanding of data science processes, data wrangling, data exploration, data visualization, hypothesis building, and testing. You will also learn the basics of statistics.
Install the required Python environment and other auxiliary tools and libraries
2. Understand the essential concepts of Python programming such as data types, tuples, lists, dicts, basic operators and functions
3. Perform high-level mathematical computing using the NumPy package and its largelibrary of mathematical functions.
Learn more at: https://www.simplilearn.com
2. What’s in it for you?
What is Data Science? Who is a Data Scientist?
What are the skills of a Data Scientist?
What does a Data Scientist do?
Data Acquisition
Data Preparation
Data Mining with Tableau
Model Building & Testing
Model Maintenance
Summary
3. What is Data Science?
Data Science is the area of study which involves extracting knowledge
from all the data you can gather
4. What is Data Science?
Hence, translating a business problem into a research project, and then translating the results
back into a practical solution requires a deep understanding of the business domain and
creativity
5. What is Data Science?
Fraudulent transactions are ramping on the internet and the results can be devastating
6. What is Data Science?
Data Science can help in fraud detection using advanced machine learning algorithms and prevent great
monetary losses
7. What is Data Science?
And people doing all this work are called Data Scientists
8. What does a Data Scientist do?
Model Maintenance
Data Acquisition
Data Preparation
Data Mining
Data Modelling
Model Maintenance
9. Model Maintenance
What does a Data Scientist do?
Data Acquisition
Data Preparation
Data Mining
Data Modelling
Model Maintenance
10. Data Acquisition
Data Acquisition involves
extracting data from multiple
source systems
Integrating and transforming the
data into homogeneous format
Loading the transformed data
into data warehouse
Multiple source
systems (Database,
FlatFiles etc)
11. Data Acquisition
Data Acquisition involves
extracting data from multiple
source systems
Integrating and transforming the
data into homogeneous format
Loading the transformed data
into data warehouse
Transformation
12. Data Acquisition
Data Acquisition involves
extracting data from multiple
source systems
Integrating and transforming the
data into homogeneous format
Loading the transformed data
into data warehouse
Data Warehouse
14. Data Acquisition
Data warehouse
Multiple source
systems (Database,
FlatFiles etc)
Transformation
This process is mainly referred
as ETL (Extract, Transform,
Load)
16. Model Maintenance
What does a Data Scientist do?
Data Acquisition
Data Preparation
Data Mining
Model Building
Model Maintenance
17. Data Preparation
The most essential part of any Data Science project is Data Preparation. It consumes 60% of
the time spent on the project.
There are many things we can do to ensure
data is used in the most productive and
meaningful manner
18. Data Cleaning
Data Transformation
Handling Outliers
Data Integration
Data Reduction
Data Preparation
DATA CLEANING:
• Data cleaning is important because bad data may lead to a bad
model
• It handles missing values which may be due to various reasons
• In this, NULL or unwanted values is also handled
• It improves business decisions and increases productivity
19. Data Cleaning
Data Transformation
Handling Outliers
Data Integration
Data Reduction
Data Preparation
DATA TRANSFORMATION:
• Data Transformation turns raw data formats into desired outputs
• It involves normalization of data
• Min-max normalization
• Z-score normalization
20. Data Cleaning
Data Transformation
Handling Outliers
Data Integration
Data Reduction
Data Preparation
HANDLING OUTLIERS:
• Outliers are observations which are distant from the rest of the
data.
• They can be good or bad for the data
• Univariate and Multivariate methods are used for detection
• Outliers can be used for fraud detection
• Plots used: Scatter plots, box plots
21. Data Cleaning
Data Transformation
Handling Outliers
Data Integration
Data Reduction
Data Preparation
DATA INTEGRITY:
• Data integrity means that data is accurate and reliable
• It is important where business decisions are made on the basis of
data regularly
22. Data Cleaning
Data Transformation
Handling Outliers
Data Integration
Data Reduction
Data Preparation
DATA REDUCTION:
• This step reduces multitudinous amounts of data into meaningful
parts
• It increases storage capacity
• It is a factor helping in reducing costs
• Involves data deduplication which eliminates redundant data
23. Data Preparation – Data Cleaning
Data cleaning is one of the most common tasks of data preparation which involves ensuring that the data is:
Valid
Consistent
Uniform
Accurate
24. Data Preparation – Data Cleaning
Let’s have a look at a bank’s data with it’s customer details
25. Data Preparation – Data Cleaning
We can import and read the data using pandas library of Python
26. Let’s look at the Geography column which has some missing
values
In this case, we might not want to assume the location of the
customers
Data Preparation – Data Cleaning
27. So, let’s fill the missing values with empty string
Data Preparation – Data Cleaning
28. In numerical columns, filling the unwanted values with 'mean' can help us to even out the data set:
Before: After:
Data Preparation – Data Cleaning
Data.CreditScore =
data.CreditScore.fillna(data.creditScore.mean())
29. We can remove all rows with unwanted values, however this is a very aggressive step
Dropping all rows with
any NA values is easy:
data.dropna()
Data Preparation – Data Cleaning
30. Dropping all rows with
any NA values is easy:
data.dropna()
Data Preparation – Data Cleaning
We can remove all rows with unwanted values, however this is a very aggressive step
We can also drop rows that
have all NA values:
data.dropna(how=’all’)
31. Dropping all rows with
any NA values is easy:
data.dropna()
We can also drop rows that
have all NA values:
data.dropna(how=’all’)
Data Preparation – Data Cleaning
We can remove all rows with unwanted values, however this is a very aggressive step
We can also put a
limitation example, the
data needs to have at least
5 non-null values
data.dropna(thresh=5)
33. Data Mining
After acquiring the data, preparing and
cleaning it, Data Scientist discovers
patterns and relationship in data to make
better business decisions
34. Data Mining
It’s a discovery process, to get
hidden and useful knowledge.
It is commonly referred as ‘Data
Mining’
36. Data Mining using Tableau
You can easily download Tableau software from
https://www.tableau.com/
37. Data Mining using Tableau
You can connect your csv file to Tableau desktop to start
mining the data!
You can connect to various other sources also
38. Data Mining using Tableau
Tableau will automatically identify the fields as Dimensions
(descriptive) and Measures (numeric fields)
39. Problem Statement
It is important to understand the
problem statement first!
In this problem, we have a dataset of a bank’s customers.
We want to understand the exit behavior of the customers based
on different variables, for instance:
Data Mining using Tableau
• Gender
• Credit card holding
• Geography
40. Data Mining using Tableau
Now, we have a dataset of about 10,000 rows as shown below and we need to find out potential
customers who will exit first
41. Let’s first evaluate on the basis of gender, to understand if this variable will affect the model
Data Mining using Tableau
42. We have moved the variable ‘exited’ from Measures to Dimensions
Data Mining using Tableau
43. Here, Tableau recognizes ‘exited’ column as measure because of its
numerical datatype. However, ‘exited’ is categorical for us as it tells
whether a person has left or not.
So, we have placed ‘exited’ in dimensions as seen in the screenshot:
Data Mining using Tableau
44. Data Mining using Tableau
If we put ‘gender’ in columns and ‘exited’ on colors
Where:
0 – people who did not exit
1 – people who did exit
45. For better understanding, let’s
prepare a bar chart which
depicts male/female who
stayed/exited
Data Mining using Tableau
46. Data Mining using Tableau
Stayed
Exited
You can give alias describing that 0
means stayed and 1 means exited
47. To compare the exit rates, you can also add a
reference line by right clicking on the right
axis and selecting ‘Add Reference Line’
Data Mining using Tableau
49. Data Mining using Tableau
So, we have added the reference line and we can say
that female customers exit more than average as
compared to male customers!
50. We can also see on the basis of whether or not they have a credit card. Again if a person has credit
card or not is a categorical decision so we put it into dimensions and then evaluate:
Data Mining using Tableau
As explained earlier, we have added the
alias and a reference line.
51. Hence, a customer leaving the bank
is not dependent on a person holding
credit card
Data Mining using Tableau
Here, we can see that we cannot differentiate the exit behavior between people having a credit and those who do not
52. Here, we can see that customers leaving the
bank in Germany is more than that in France or
Spain. So, we treat this as an anomaly and it
can impact the model
Similarly, we can evaluate on the basis of geography:
Data Mining using Tableau
53. Data Mining using Tableau
So, we can ignore the variable ‘has
credit card’ because it will not impact the
model
54. Data Mining using Tableau
And focus on other variables like
‘gender’, ‘geography’ which can
actually have an impact
55. Data Mining using Tableau
So, we can make business
decisions based on these findings.
Lets see some advantages of data
mining
56. Advantages of Data Mining
Predicts future trends
Signifies customer patterns
Helps decision making
Quick fraud detection
Choosing right algorithm
58. What is Model Building?
The model is built by selecting a
machine learning algorithm that suits
the data, use case, and available
resources
59. What is Model Building?
Many choose to build more than
one model and go ahead only with
the best fit
60. What is Model Building?
There are many machine learning
algorithms, chosen based on the
data and the problem at hand
61. Machine Learning Algorithms used by Data ScientistsCategorialContinuous
Supervised Unsupervised
• Regression
Linear
Multiple
• Decision Trees
• Random Forest
• Classification
KNN
Logistic Regression
SVM
Naïve-Baiyes
• Clustering
K-Means
PCA
• Association Analysis
• Hidden Markov Model
62. Machine Learning Algorithms used by Data ScientistsCategorialContinuous
Supervised Unsupervised
• Regression
Linear
Multiple
• Decision Trees
• Random Forest
• Classification
KNN
Logistic Regression
SVM
Naïve-Baiyes
• Clustering
K-Means
PCA
• Association Analysis
• Hidden Markov Model
63. Machine Learning Algorithms used by Data ScientistsCategorialContinuous
Supervised Unsupervised
• Regression
Linear
Multiple
• Decision Trees
• Random Forest
• Classification
KNN
Logistic Regression
SVM
Naïve-Baiyes
• Clustering
K-Means
PCA
• Association Analysis
• Hidden Markov Model
64. Machine Learning Algorithms used by Data ScientistsCategorialContinuous
Supervised Unsupervised
• Regression
Linear
Multiple
• Decision Trees
• Random Forest
• Classification
KNN
Logistic Regression
SVM
Naïve-Baiyes
• Clustering
K-Means
PCA
• Association Analysis
• Hidden Markov Model
65. What is Model Building?
Let me take you through a use
case for better understanding
66. Predicting World happiness
Model Building - Multiple Linear Regression
Questions To Ask:
1) How to describe the data?
2) Can we make a predictive model to calculate happiness score?
67. Variables used in the model
Happiness Rank
Region
Health
Happiness Score
Economy
Freedom
Country
Family
Trust
Generosity Dystopia Residual
68. Variables Used in Data:
Happiness Rank: determined by a country’s happiness score
Happiness Score: A score given to a country based on adding up the rankings that a population has given to each category (normalized)
Country: The country in question
Region: The region that the country belongs too (different than continent)
Economy: GDP per capita of the country
Family: quality of family life, nuclear and joint family
Health: ranking healthcare availability and average life expectancy in the country
Freedom: how much an individual is able to conduct them self based on their free will
Trust: that the government to not be corrupt
Generosity: how much their country is involved in peacekeeping and global aid
Dystopia Residual: Dystopia happiness score (1.85) i.e. the score of a hypothetical country that has a lower rank than the lowest ranking country
on the report, plus the residual value of each country (a number that is left over from the normalization of the variables which cannot be
explained).
Model Building - Multiple Linear Regression
For instructor
70. Preparing and Describing the Data:
• Read csv file using pandas “read_csv” and put it in a dataframe ‘happiness_2015’
• Similarly for 2016 and 2017, store the data in happiness_2016 and happiness_2017 respectively
Model Building - Multiple Linear Regression
73. We can concatenate the three
data or we can build a model
separately for one csv
Head() shows top countries with
highest happiness_score
Model Building - Multiple Linear Regression
74. We can create a visual which gives us a more appealing view of where each country is placed in the World ranking
report
Model Building - Multiple Linear Regression
75. Model Building - Multiple Linear Regression
The lighter colored countries
have a lower ranking.
The darker colored countries have
higher ranking on the report (i.e. are
the “happiest)
76. Model Building - Multiple Linear Regression
We can find out the correlation between Happiness Score and Happiness Rank by plotting a scatterplot
77. Model Building - Multiple Linear Regression
And the code is right
here!
78. Model Building - Multiple Linear Regression
• Happiness score determines how the country is ranked,
so happiness score as predictor and the happiness rank
as the dependent variable.
• The higher the score the lower the numerical rank, and
higher the happiness rating
• Therefore, happiness score and rank are negatively
correlated(as score increases, rank decreases)
Let’s dissect the graph:
79. However, we see that rank is directly
dependent on Happiness Score. So, we can
drop rank from our data frame
Model Building - Multiple Linear Regression
80. Model Building - Multiple Linear Regression
We can draw a heat map and
see the correlation between the
variables
81. Model Building - Multiple Linear Regression
We can see that happiness score is
strongly correlated with economy and
health, followed by family and
freedom.
82. Thus, in our model, we should see strong correlation between variables when finding the
coefficients.
Model Building - Multiple Linear Regression
83. Now, we can start to use SkLearn to construct the model
We drop categorical values and happiness rank because we will not explore it in this report
Model Building - Multiple Linear Regression
84. Let’s divide the dataset into train and test for further model building and testing:
Model Building - Multiple Linear Regression
So, data is split in 80:20 ratio between and train and test dataset respectively
85. Then, we import sklearn’s linear regression to fit the regression to the training set
Model Building - Multiple Linear Regression
86. Model Building - Multiple Linear Regression
Evaluate the predicted values and calculate the difference:
87. Model Building - Multiple Linear Regression
We can print the intercept and coefficients for the train dataset:
88. Model Building - Multiple Linear Regression
Trick: Right after you have trained your model, you know the order of the coefficients
This will print the coefficients
and the corresponding
feature
Tip: you can also use dictionary to reuse the coefficients later
89. Model Building - Multiple Linear Regression
For testing the model, let’s find out mean error:
90. Model Building - Multiple Linear Regression
Using sklearn.predict, we can use this model to predict the happiness scores for the first 100 countries in our model
How do these predictions compare
to the actual values in our data?
91. Model Building - Multiple Linear Regression
Let’s create a plot of our actual
happiness score versus the
predicted happiness score.
92. Model Building - Multiple Linear Regression
Hence, we can say that our model is a pretty good indicator of the
actual happiness score as there is a strong positive correlation
between the actual and the predicted values
93. Model Building - Multiple Linear Regression
HappinessScore=(0.0001289) +
(1.000048 * Economy) +
(0.999997 * Family)+
(0.999865 * Health) +
(0.999920 * Freedom) +
(0.999993 * Trust) +
(1.000029 * Generosity) +
(0.999971 * DystopiaResidual)
The Multiple Linear Regression Model for Happiness Score:
95. Before you present your data model and insights, first understand:
What is your goal?Who is your target audience?
Communicate Results
But the battle is not over yet!!
96. Communicate Results
Before you present your data model and insights, first understand:
What is your goal?Who is your target audience?
97. Communicate Results
After you are clear with your objective, think about:
• What’s the question?
• What’s your
methodology?
• What’s the answer?
98. Communicate Results
Then, you’re finally ready to communicate your results to the
business teams so that it easily goes into execution phase
100. Model Maintenance
After we have deployed
the model, it is also
important to prevent the
model from deteriorating.
How can we do that?
101. ASSESS: Once in a while, we have to run a fresh
sample through data to assess our model
RETRAIN: After assessment, if you are not happy
with the performance of the model then you
retrain the model
REBUILD: If retrain fails too, then you have to rebuild by
finding all the variables again and build a new model
Model Maintenance
102. Summary
What is data science? What a data scientist does? Data acquisition
Model maintenanceMODEL BuildingData Mining using tableau
103. Model Building Using ARIMA
Plot the data:
As we can observe that mean and variance are
changing with time. Hence, this is non-stationary
series
The important deciding variable seems to be score (as it determines rank).
The darker red the square, the stronger the positive correlation, and obviously, variables will have a correlation of 1 with each other.
The darker red the square, the stronger the positive correlation, and obviously, variables will have a correlation of 1 with each other.
RETRAIN: The variables remain the same but you train your model with fresh sample of data so coefficients will change
Natural language processing to enable it to communicate successfully in English (or some other human language).
Knowledge representation to store information provided before or during the interrogation.
Automated reasoning to use the stored information to answer questions and to draw new conclusions.
Machine learning to adapt to new circumstances and to detect and extrapolate patterns.