Lecture3.pptx

What is Data Science?
Data Science is a combination of multiple disciplines that
uses statistics, data analysis, and machine learning to
analyze data and to extract knowledge and insights from
it.

By using Data Science, companies are able to make:
•Better decisions (should we choose A or B)
•Predictive analysis (what will happen next?)
•Pattern discoveries (find pattern, or maybe hidden
information in the data)

Where is Data Science Needed?
• Data Science is used in many industries in the world
today, e.g. banking, consultancy, healthcare, and
manufacturing.
• Examples of where Data Science is needed:

Data Science can be applied in nearly
every part of a business where data is
available. Examples are:
• Consumer goods
• Stock markets
• Industry
• Politics
• Logistic companies
• E-commerce

How Does a Data Scientist Work?
• A Data Scientist requires expertise in several
backgrounds:
• Machine Learning
• Statistics
• Programming (Python or R)
• Mathematics
• Databases

Here is how a Data Scientist works:
1.Ask the right questions - To understand the business problem.
2.Explore and collect data - From database, web logs, customer
feedback, etc.
3.Extract the data - Transform the data to a standardized format.
4.Clean the data - Remove erroneous values from the data.
5.Find and replace missing values - Check for missing values and
replace them with a suitable value (e.g. an average value).
6.Normalize data - Scale the values in a practical range (e.g. 140 cm is
smaller than 1,8 m. However, the number 140 is larger than 1,8. - so
scaling is important).
7.Analyze data, find patterns and make future predictions.
8.Represent the result - Present the result with useful insights in a
way the "company" can understand.

What is Data?
• Data is a collection of information.
• One purpose of Data Science is to structure data,
making it interpretable and easy to work with.

Data can be categorized into two groups:
• Structured data
• Unstructured data

Unstructured Data
• Unstructured data is not organized. We must organize
the data for analysis purposes.

Structured Data
• Structured data is organized and easier to work with.

How to Structure Data?
• We can use an array or a database table to structure or
present data.
• Example of an array:
• [80, 85, 90, 95, 100, 105, 110, 115, 120, 125]

Try this:
Array=[80, 85, 90, 95, 100, 105, 110, 115, 120, 125]
print(Array)

Database Table
• A database table is a table with structured data.
• The following table shows a database table with health
data extracted from a sports watch:

Database Table Structure
A database table consists of column(s)
and row(s):

Variables
• A variable is defined as something that can be
measured or counted.
• Examples can be characters, numbers or time.
• In the example under, we can observe that each column
represents a variable.

Data Science & Python
• Python is a programming language widely used by Data
Scientists.
• Python has in-built mathematical libraries and
functions, making it easier to calculate mathematical
problems and to perform data analysis.

Python Libraries
• Python has libraries with large collections of mathematical
functions and analytical tools.
• In this course, we will use the following libraries:
• Pandas - This library is used for structured data operations, like
import CSV files, create dataframes, and data preparation
• Numpy - This is a mathematical library. Has a powerful N-
dimensional array object, linear algebra, Fourier transform, etc.
• Matplotlib - This library is used for visualization of data.
• SciPy - This library has linear algebra modules

Data Science - Python DataFrame
• Create a DataFrame with Pandas
• A data frame is a structured representation of data.
import pandas as pd
d = {'col1': [1, 2, 3, 4, 7], 'col2': [4, 5, 6, 9, 5], 'col3':
[7, 8, 12, 1, 11]}
df = pd.DataFrame(data=d)
print(df)
We write pd. in front of DataFrame() to let Python know that we want to activate the
DataFrame() function from the Pandas library.
Be aware of the capital D and F in DataFrame!

Example 1
Count the number of columns:
• count_column = df.shape[1]
print(count_column)
Example 2
Count the number of rows:
count_row = df.shape[0]
print(count_row)

Data Science Functions
• This chapter shows three commonly used functions
when working with Data Science: max(), min(), and
mean().

The max() function
• The Python max() function is used to find the highest value in an
array.
• Ex
Average_pulse_max = max(80, 85, 90, 95, 100, 105, 110, 115, 120, 125)
print (Average_pulse_max)

The mean() function
The NumPy mean() function is used to find the average value of an
array.
Example:
import numpy as np
Calorie_burnage = [240, 250, 260, 270, 280, 290, 300, 310, 320, 330]
Average_calorie_burnage = np.mean(Calorie_burnage)
print(Average_calorie_burnage)

Data Science - Data Preparation
• Extract and Read Data With Pandas
• import pandas as pd
sample_data = pd.read_csv(‘music.csv’)
sample_data

Data Cleaning
• import pandas as pd
• sample_data = pd.read_csv("music.csv")
• X= sample_data.drop(columns=['genre'])
• print(X)

Data Categories
• Data can be split into three main categories:
1.Numerical - Contains numerical values. Can be divided
into two categories:
1.Discrete: Numbers are counted as "whole". Example: You cannot
have trained 2.5 sessions, it is either 2 or 3
2.Continuous: Numbers can be of infinite precision. For example, you
can sleep for 7 hours, 30 minutes and 20 seconds, or 7.533 hours
2.Categorical - Contains values that cannot be measured up
against each other. Example: A color or a type of training
3.Ordinal - Contains categorical data that can be measured
up against each other. Example: School grades where A is
better than B and so on

Data Types
We can use the info() function to list the data types
within our data set:
Ex: print(sample_data.info())

Analyze the Data
When we have cleaned the data set, we can start
analyzing the data.
We can use the describe() function in Python to
summarize data:

Lecture3.pptx

Recommended

Recommended

More Related Content

Similar to Lecture3.pptx

Similar to Lecture3.pptx (20)

More from JohnMichaelPadernill

More from JohnMichaelPadernill (9)

Recently uploaded

Recently uploaded (20)

Lecture3.pptx

Editor's Notes