SlideShare a Scribd company logo
What is Data Science?
Data Science is a combination of multiple disciplines that
uses statistics, data analysis, and machine learning to
analyze data and to extract knowledge and insights from
it.
By using Data Science, companies are able to make:
•Better decisions (should we choose A or B)
•Predictive analysis (what will happen next?)
•Pattern discoveries (find pattern, or maybe hidden
information in the data)
Where is Data Science Needed?
• Data Science is used in many industries in the world
today, e.g. banking, consultancy, healthcare, and
manufacturing.
• Examples of where Data Science is needed:
Data Science can be applied in nearly
every part of a business where data is
available. Examples are:
• Consumer goods
• Stock markets
• Industry
• Politics
• Logistic companies
• E-commerce
How Does a Data Scientist Work?
• A Data Scientist requires expertise in several
backgrounds:
• Machine Learning
• Statistics
• Programming (Python or R)
• Mathematics
• Databases
Here is how a Data Scientist works:
1.Ask the right questions - To understand the business problem.
2.Explore and collect data - From database, web logs, customer
feedback, etc.
3.Extract the data - Transform the data to a standardized format.
4.Clean the data - Remove erroneous values from the data.
5.Find and replace missing values - Check for missing values and
replace them with a suitable value (e.g. an average value).
6.Normalize data - Scale the values in a practical range (e.g. 140 cm is
smaller than 1,8 m. However, the number 140 is larger than 1,8. - so
scaling is important).
7.Analyze data, find patterns and make future predictions.
8.Represent the result - Present the result with useful insights in a
way the "company" can understand.
What is Data?
• Data is a collection of information.
• One purpose of Data Science is to structure data,
making it interpretable and easy to work with.
Data can be categorized into two groups:
• Structured data
• Unstructured data
Unstructured Data
• Unstructured data is not organized. We must organize
the data for analysis purposes.
Structured Data
• Structured data is organized and easier to work with.
How to Structure Data?
• We can use an array or a database table to structure or
present data.
• Example of an array:
• [80, 85, 90, 95, 100, 105, 110, 115, 120, 125]
Try this:
Array=[80, 85, 90, 95, 100, 105, 110, 115, 120, 125]
print(Array)
Database Table
• A database table is a table with structured data.
• The following table shows a database table with health
data extracted from a sports watch:
Database Table Structure
A database table consists of column(s)
and row(s):
Variables
• A variable is defined as something that can be
measured or counted.
• Examples can be characters, numbers or time.
• In the example under, we can observe that each column
represents a variable.
Data Science & Python
• Python is a programming language widely used by Data
Scientists.
• Python has in-built mathematical libraries and
functions, making it easier to calculate mathematical
problems and to perform data analysis.
Python Libraries
• Python has libraries with large collections of mathematical
functions and analytical tools.
• In this course, we will use the following libraries:
• Pandas - This library is used for structured data operations, like
import CSV files, create dataframes, and data preparation
• Numpy - This is a mathematical library. Has a powerful N-
dimensional array object, linear algebra, Fourier transform, etc.
• Matplotlib - This library is used for visualization of data.
• SciPy - This library has linear algebra modules
Data Science - Python DataFrame
• Create a DataFrame with Pandas
• A data frame is a structured representation of data.
import pandas as pd
d = {'col1': [1, 2, 3, 4, 7], 'col2': [4, 5, 6, 9, 5], 'col3':
[7, 8, 12, 1, 11]}
df = pd.DataFrame(data=d)
print(df)
We write pd. in front of DataFrame() to let Python know that we want to activate the
DataFrame() function from the Pandas library.
Be aware of the capital D and F in DataFrame!
Example 1
Count the number of columns:
• count_column = df.shape[1]
print(count_column)
Example 2
Count the number of rows:
count_row = df.shape[0]
print(count_row)
Data Science Functions
• This chapter shows three commonly used functions
when working with Data Science: max(), min(), and
mean().
The max() function
• The Python max() function is used to find the highest value in an
array.
• Ex
Average_pulse_max = max(80, 85, 90, 95, 100, 105, 110, 115, 120, 125)
print (Average_pulse_max)
The mean() function
The NumPy mean() function is used to find the average value of an
array.
Example:
import numpy as np
Calorie_burnage = [240, 250, 260, 270, 280, 290, 300, 310, 320, 330]
Average_calorie_burnage = np.mean(Calorie_burnage)
print(Average_calorie_burnage)
Data Science - Data Preparation
• Extract and Read Data With Pandas
• import pandas as pd
sample_data = pd.read_csv(‘music.csv’)
sample_data
Data Cleaning
• import pandas as pd
• sample_data = pd.read_csv("music.csv")
• X= sample_data.drop(columns=['genre'])
• print(X)
Data Categories
• Data can be split into three main categories:
1.Numerical - Contains numerical values. Can be divided
into two categories:
1.Discrete: Numbers are counted as "whole". Example: You cannot
have trained 2.5 sessions, it is either 2 or 3
2.Continuous: Numbers can be of infinite precision. For example, you
can sleep for 7 hours, 30 minutes and 20 seconds, or 7.533 hours
2.Categorical - Contains values that cannot be measured up
against each other. Example: A color or a type of training
3.Ordinal - Contains categorical data that can be measured
up against each other. Example: School grades where A is
better than B and so on
Data Types
We can use the info() function to list the data types
within our data set:
Ex: print(sample_data.info())
Analyze the Data
When we have cleaned the data set, we can start
analyzing the data.
We can use the describe() function in Python to
summarize data:

More Related Content

Similar to Lecture3.pptx

data science with python_UNIT 2_full notes.pdf
data science with python_UNIT 2_full notes.pdfdata science with python_UNIT 2_full notes.pdf
data science with python_UNIT 2_full notes.pdf
mukeshgarg02
 
Data Science Using Scikit-Learn
Data Science Using Scikit-LearnData Science Using Scikit-Learn
Data Science Using Scikit-Learn
Ducat India
 
Cssu dw dm
Cssu dw dmCssu dw dm
Cssu dw dmsumit621
 
data wrangling (1).pptx kjhiukjhknjbnkjh
data wrangling (1).pptx kjhiukjhknjbnkjhdata wrangling (1).pptx kjhiukjhknjbnkjh
data wrangling (1).pptx kjhiukjhknjbnkjh
VISHALMARWADE1
 
Introduction to data science.pdf
Introduction to data science.pdfIntroduction to data science.pdf
Introduction to data science.pdf
alsaid fathy
 
Chapter-1 - Notes.pptx
Chapter-1 - Notes.pptxChapter-1 - Notes.pptx
Chapter-1 - Notes.pptx
DATASCIENCE41
 
Certified Python Business Analyst
Certified Python Business AnalystCertified Python Business Analyst
Certified Python Business Analyst
AnkitSingh2134
 
Exploratory Data Analysis - Satyajit.pdf
Exploratory Data Analysis - Satyajit.pdfExploratory Data Analysis - Satyajit.pdf
Exploratory Data Analysis - Satyajit.pdf
AmmarAhmedSiddiqui2
 
Machine learning 101
Machine learning 101Machine learning 101
Machine learning 101
AmmarChalifah
 
Python data structures - best in class for data analysis
Python data structures -   best in class for data analysisPython data structures -   best in class for data analysis
Python data structures - best in class for data analysis
Rajesh M
 
Data Analytics & Visualization (Introduction)
Data Analytics & Visualization (Introduction)Data Analytics & Visualization (Introduction)
Data Analytics & Visualization (Introduction)
Dolapo Amusat
 
DataCamp Cheat Sheets 4 Python Users (2020)
DataCamp Cheat Sheets 4 Python Users (2020)DataCamp Cheat Sheets 4 Python Users (2020)
DataCamp Cheat Sheets 4 Python Users (2020)
EMRE AKCAOGLU
 
Introduction to data mining and data warehousing
Introduction to data mining and data warehousingIntroduction to data mining and data warehousing
Introduction to data mining and data warehousing
Er. Nawaraj Bhandari
 
ch2 DS.pptx
ch2 DS.pptxch2 DS.pptx
ch2 DS.pptx
derbew2112
 
data-mining.8460598.powerpoint.pptx
data-mining.8460598.powerpoint.pptxdata-mining.8460598.powerpoint.pptx
data-mining.8460598.powerpoint.pptx
iturielescom
 
Unit 2 - Data Manipulation with R.pptx
Unit 2 - Data Manipulation with R.pptxUnit 2 - Data Manipulation with R.pptx
Unit 2 - Data Manipulation with R.pptx
Malla Reddy University
 
BDACA - Tutorial5
BDACA - Tutorial5BDACA - Tutorial5
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
Mahir Haque
 
Machine learning introduction
Machine learning introductionMachine learning introduction
Machine learning introduction
Anas Jamil
 
Guide for a Data Scientist
Guide for a Data ScientistGuide for a Data Scientist
Guide for a Data Scientist
Rohit Dubey
 

Similar to Lecture3.pptx (20)

data science with python_UNIT 2_full notes.pdf
data science with python_UNIT 2_full notes.pdfdata science with python_UNIT 2_full notes.pdf
data science with python_UNIT 2_full notes.pdf
 
Data Science Using Scikit-Learn
Data Science Using Scikit-LearnData Science Using Scikit-Learn
Data Science Using Scikit-Learn
 
Cssu dw dm
Cssu dw dmCssu dw dm
Cssu dw dm
 
data wrangling (1).pptx kjhiukjhknjbnkjh
data wrangling (1).pptx kjhiukjhknjbnkjhdata wrangling (1).pptx kjhiukjhknjbnkjh
data wrangling (1).pptx kjhiukjhknjbnkjh
 
Introduction to data science.pdf
Introduction to data science.pdfIntroduction to data science.pdf
Introduction to data science.pdf
 
Chapter-1 - Notes.pptx
Chapter-1 - Notes.pptxChapter-1 - Notes.pptx
Chapter-1 - Notes.pptx
 
Certified Python Business Analyst
Certified Python Business AnalystCertified Python Business Analyst
Certified Python Business Analyst
 
Exploratory Data Analysis - Satyajit.pdf
Exploratory Data Analysis - Satyajit.pdfExploratory Data Analysis - Satyajit.pdf
Exploratory Data Analysis - Satyajit.pdf
 
Machine learning 101
Machine learning 101Machine learning 101
Machine learning 101
 
Python data structures - best in class for data analysis
Python data structures -   best in class for data analysisPython data structures -   best in class for data analysis
Python data structures - best in class for data analysis
 
Data Analytics & Visualization (Introduction)
Data Analytics & Visualization (Introduction)Data Analytics & Visualization (Introduction)
Data Analytics & Visualization (Introduction)
 
DataCamp Cheat Sheets 4 Python Users (2020)
DataCamp Cheat Sheets 4 Python Users (2020)DataCamp Cheat Sheets 4 Python Users (2020)
DataCamp Cheat Sheets 4 Python Users (2020)
 
Introduction to data mining and data warehousing
Introduction to data mining and data warehousingIntroduction to data mining and data warehousing
Introduction to data mining and data warehousing
 
ch2 DS.pptx
ch2 DS.pptxch2 DS.pptx
ch2 DS.pptx
 
data-mining.8460598.powerpoint.pptx
data-mining.8460598.powerpoint.pptxdata-mining.8460598.powerpoint.pptx
data-mining.8460598.powerpoint.pptx
 
Unit 2 - Data Manipulation with R.pptx
Unit 2 - Data Manipulation with R.pptxUnit 2 - Data Manipulation with R.pptx
Unit 2 - Data Manipulation with R.pptx
 
BDACA - Tutorial5
BDACA - Tutorial5BDACA - Tutorial5
BDACA - Tutorial5
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
Machine learning introduction
Machine learning introductionMachine learning introduction
Machine learning introduction
 
Guide for a Data Scientist
Guide for a Data ScientistGuide for a Data Scientist
Guide for a Data Scientist
 

More from JohnMichaelPadernill

SearchAlgorithm.pdf
SearchAlgorithm.pdfSearchAlgorithm.pdf
SearchAlgorithm.pdf
JohnMichaelPadernill
 
Lesson 2 The Maze Runner.pptx
Lesson 2 The Maze Runner.pptxLesson 2 The Maze Runner.pptx
Lesson 2 The Maze Runner.pptx
JohnMichaelPadernill
 
Lesson-1-Hack-Attack.pptx
Lesson-1-Hack-Attack.pptxLesson-1-Hack-Attack.pptx
Lesson-1-Hack-Attack.pptx
JohnMichaelPadernill
 
Lesson-6-Fruit-Slicer.pptx
Lesson-6-Fruit-Slicer.pptxLesson-6-Fruit-Slicer.pptx
Lesson-6-Fruit-Slicer.pptx
JohnMichaelPadernill
 
Lesson-7-Flappy-Bird.pptx
Lesson-7-Flappy-Bird.pptxLesson-7-Flappy-Bird.pptx
Lesson-7-Flappy-Bird.pptx
JohnMichaelPadernill
 
Lesson-8-Asteroid.pptx
Lesson-8-Asteroid.pptxLesson-8-Asteroid.pptx
Lesson-8-Asteroid.pptx
JohnMichaelPadernill
 

More from JohnMichaelPadernill (9)

SearchAlgorithm.pdf
SearchAlgorithm.pdfSearchAlgorithm.pdf
SearchAlgorithm.pdf
 
Lecture-6-7.pptx
Lecture-6-7.pptxLecture-6-7.pptx
Lecture-6-7.pptx
 
Lecture5.pptx
Lecture5.pptxLecture5.pptx
Lecture5.pptx
 
maze runner game.pptx
maze runner game.pptxmaze runner game.pptx
maze runner game.pptx
 
Lesson 2 The Maze Runner.pptx
Lesson 2 The Maze Runner.pptxLesson 2 The Maze Runner.pptx
Lesson 2 The Maze Runner.pptx
 
Lesson-1-Hack-Attack.pptx
Lesson-1-Hack-Attack.pptxLesson-1-Hack-Attack.pptx
Lesson-1-Hack-Attack.pptx
 
Lesson-6-Fruit-Slicer.pptx
Lesson-6-Fruit-Slicer.pptxLesson-6-Fruit-Slicer.pptx
Lesson-6-Fruit-Slicer.pptx
 
Lesson-7-Flappy-Bird.pptx
Lesson-7-Flappy-Bird.pptxLesson-7-Flappy-Bird.pptx
Lesson-7-Flappy-Bird.pptx
 
Lesson-8-Asteroid.pptx
Lesson-8-Asteroid.pptxLesson-8-Asteroid.pptx
Lesson-8-Asteroid.pptx
 

Recently uploaded

Tutorial for 16S rRNA Gene Analysis with QIIME2.pdf
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdfTutorial for 16S rRNA Gene Analysis with QIIME2.pdf
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdf
aqil azizi
 
14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application
SyedAbiiAzazi1
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
Massimo Talia
 
ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024
Rahul
 
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
bakpo1
 
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
thanhdowork
 
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesHarnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
Christina Lin
 
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
insn4465
 
ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...
ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...
ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...
Mukeshwaran Balu
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
zwunae
 
Hierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power SystemHierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power System
Kerry Sado
 
Fundamentals of Induction Motor Drives.pptx
Fundamentals of Induction Motor Drives.pptxFundamentals of Induction Motor Drives.pptx
Fundamentals of Induction Motor Drives.pptx
manasideore6
 
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsKuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
Victor Morales
 
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
obonagu
 
bank management system in java and mysql report1.pdf
bank management system in java and mysql report1.pdfbank management system in java and mysql report1.pdf
bank management system in java and mysql report1.pdf
Divyam548318
 
PROJECT FORMAT FOR EVS AMITY UNIVERSITY GWALIOR.ppt
PROJECT FORMAT FOR EVS AMITY UNIVERSITY GWALIOR.pptPROJECT FORMAT FOR EVS AMITY UNIVERSITY GWALIOR.ppt
PROJECT FORMAT FOR EVS AMITY UNIVERSITY GWALIOR.ppt
bhadouriyakaku
 
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
ydteq
 
Literature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptxLiterature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptx
Dr Ramhari Poudyal
 
6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)
ClaraZara1
 
Self-Control of Emotions by Slidesgo.pptx
Self-Control of Emotions by Slidesgo.pptxSelf-Control of Emotions by Slidesgo.pptx
Self-Control of Emotions by Slidesgo.pptx
iemerc2024
 

Recently uploaded (20)

Tutorial for 16S rRNA Gene Analysis with QIIME2.pdf
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdfTutorial for 16S rRNA Gene Analysis with QIIME2.pdf
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdf
 
14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
 
ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024
 
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
 
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
 
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesHarnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
 
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
 
ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...
ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...
ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
 
Hierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power SystemHierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power System
 
Fundamentals of Induction Motor Drives.pptx
Fundamentals of Induction Motor Drives.pptxFundamentals of Induction Motor Drives.pptx
Fundamentals of Induction Motor Drives.pptx
 
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsKuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
 
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
 
bank management system in java and mysql report1.pdf
bank management system in java and mysql report1.pdfbank management system in java and mysql report1.pdf
bank management system in java and mysql report1.pdf
 
PROJECT FORMAT FOR EVS AMITY UNIVERSITY GWALIOR.ppt
PROJECT FORMAT FOR EVS AMITY UNIVERSITY GWALIOR.pptPROJECT FORMAT FOR EVS AMITY UNIVERSITY GWALIOR.ppt
PROJECT FORMAT FOR EVS AMITY UNIVERSITY GWALIOR.ppt
 
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
 
Literature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptxLiterature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptx
 
6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)
 
Self-Control of Emotions by Slidesgo.pptx
Self-Control of Emotions by Slidesgo.pptxSelf-Control of Emotions by Slidesgo.pptx
Self-Control of Emotions by Slidesgo.pptx
 

Lecture3.pptx

  • 1. What is Data Science? Data Science is a combination of multiple disciplines that uses statistics, data analysis, and machine learning to analyze data and to extract knowledge and insights from it.
  • 2. By using Data Science, companies are able to make: •Better decisions (should we choose A or B) •Predictive analysis (what will happen next?) •Pattern discoveries (find pattern, or maybe hidden information in the data)
  • 3. Where is Data Science Needed? • Data Science is used in many industries in the world today, e.g. banking, consultancy, healthcare, and manufacturing. • Examples of where Data Science is needed:
  • 4. Data Science can be applied in nearly every part of a business where data is available. Examples are: • Consumer goods • Stock markets • Industry • Politics • Logistic companies • E-commerce
  • 5. How Does a Data Scientist Work? • A Data Scientist requires expertise in several backgrounds: • Machine Learning • Statistics • Programming (Python or R) • Mathematics • Databases
  • 6. Here is how a Data Scientist works: 1.Ask the right questions - To understand the business problem. 2.Explore and collect data - From database, web logs, customer feedback, etc. 3.Extract the data - Transform the data to a standardized format. 4.Clean the data - Remove erroneous values from the data. 5.Find and replace missing values - Check for missing values and replace them with a suitable value (e.g. an average value). 6.Normalize data - Scale the values in a practical range (e.g. 140 cm is smaller than 1,8 m. However, the number 140 is larger than 1,8. - so scaling is important). 7.Analyze data, find patterns and make future predictions. 8.Represent the result - Present the result with useful insights in a way the "company" can understand.
  • 7. What is Data? • Data is a collection of information. • One purpose of Data Science is to structure data, making it interpretable and easy to work with.
  • 8. Data can be categorized into two groups: • Structured data • Unstructured data
  • 9. Unstructured Data • Unstructured data is not organized. We must organize the data for analysis purposes.
  • 10. Structured Data • Structured data is organized and easier to work with.
  • 11. How to Structure Data? • We can use an array or a database table to structure or present data. • Example of an array: • [80, 85, 90, 95, 100, 105, 110, 115, 120, 125]
  • 12. Try this: Array=[80, 85, 90, 95, 100, 105, 110, 115, 120, 125] print(Array)
  • 13. Database Table • A database table is a table with structured data. • The following table shows a database table with health data extracted from a sports watch:
  • 14. Database Table Structure A database table consists of column(s) and row(s):
  • 15. Variables • A variable is defined as something that can be measured or counted. • Examples can be characters, numbers or time. • In the example under, we can observe that each column represents a variable.
  • 16.
  • 17. Data Science & Python • Python is a programming language widely used by Data Scientists. • Python has in-built mathematical libraries and functions, making it easier to calculate mathematical problems and to perform data analysis.
  • 18. Python Libraries • Python has libraries with large collections of mathematical functions and analytical tools. • In this course, we will use the following libraries: • Pandas - This library is used for structured data operations, like import CSV files, create dataframes, and data preparation • Numpy - This is a mathematical library. Has a powerful N- dimensional array object, linear algebra, Fourier transform, etc. • Matplotlib - This library is used for visualization of data. • SciPy - This library has linear algebra modules
  • 19. Data Science - Python DataFrame • Create a DataFrame with Pandas • A data frame is a structured representation of data. import pandas as pd d = {'col1': [1, 2, 3, 4, 7], 'col2': [4, 5, 6, 9, 5], 'col3': [7, 8, 12, 1, 11]} df = pd.DataFrame(data=d) print(df) We write pd. in front of DataFrame() to let Python know that we want to activate the DataFrame() function from the Pandas library. Be aware of the capital D and F in DataFrame!
  • 20. Example 1 Count the number of columns: • count_column = df.shape[1] print(count_column) Example 2 Count the number of rows: count_row = df.shape[0] print(count_row)
  • 21. Data Science Functions • This chapter shows three commonly used functions when working with Data Science: max(), min(), and mean().
  • 22.
  • 23. The max() function • The Python max() function is used to find the highest value in an array. • Ex Average_pulse_max = max(80, 85, 90, 95, 100, 105, 110, 115, 120, 125) print (Average_pulse_max)
  • 24. The mean() function The NumPy mean() function is used to find the average value of an array. Example: import numpy as np Calorie_burnage = [240, 250, 260, 270, 280, 290, 300, 310, 320, 330] Average_calorie_burnage = np.mean(Calorie_burnage) print(Average_calorie_burnage)
  • 25. Data Science - Data Preparation • Extract and Read Data With Pandas • import pandas as pd sample_data = pd.read_csv(‘music.csv’) sample_data
  • 26. Data Cleaning • import pandas as pd • sample_data = pd.read_csv("music.csv") • X= sample_data.drop(columns=['genre']) • print(X)
  • 27. Data Categories • Data can be split into three main categories: 1.Numerical - Contains numerical values. Can be divided into two categories: 1.Discrete: Numbers are counted as "whole". Example: You cannot have trained 2.5 sessions, it is either 2 or 3 2.Continuous: Numbers can be of infinite precision. For example, you can sleep for 7 hours, 30 minutes and 20 seconds, or 7.533 hours 2.Categorical - Contains values that cannot be measured up against each other. Example: A color or a type of training 3.Ordinal - Contains categorical data that can be measured up against each other. Example: School grades where A is better than B and so on
  • 28. Data Types We can use the info() function to list the data types within our data set: Ex: print(sample_data.info())
  • 29. Analyze the Data When we have cleaned the data set, we can start analyzing the data. We can use the describe() function in Python to summarize data:

Editor's Notes

  1. Data Science is about data gathering, analysis and decision-making. Data Science is about finding patterns in data, through analysis, and make future predictions.
  2. For route planning: To discover the best routes to ship To foresee delays for flight/ship/train etc. (through predictive analysis) To create promotional offers To find the best suited time to deliver goods To forecast the next years revenue for a company To analyze health benefit of training To predict who will win elections
  3. A Data Scientist must find patterns within the data. Before he/she can find the patterns, he/she must organize the data in a standard format.
  4. It is common to work with very large data sets in Data Science.
  5. This dataset contains information of a typical training session such as duration, average pulse, calorie burnage etc.
  6. A row is a horizontal representation of data. A column is a vertical representation of data.
  7. There are 6 columns, meaning that there are 6 variables (Duration, Average_Pulse, Max_Pulse, Calorie_Burnage, Hours_Work, Hours_Sleep). There are 11 rows, meaning that each variable has 10 observations. ***But if there are 11 rows, how come there are only 10 observations? It is because the first row is the label, meaning that it is the name of the variable.
  8. Example Explained Import the Pandas library as pd Define data with column and rows in a variable named d Create a data frame using the function pd.DataFrame() The data frame contains 3 columns and 5 rows Print the data frame output with the print() function
  9. Why Can We Not Just Count the Rows and Columns Ourselves? If we work with larger data sets with many columns and rows, it will be confusing to count it by yourself. You risk to count it wrongly. If we use the built-in functions in Python correctly, we assure that the count is correct.
  10. The data set above consists of 6 variables, each with 10 observations: Duration - How long lasted the training session in minutes? Average_Pulse - What was the average pulse of the training session? This is measured by beats per minute Max_Pulse - What was the max pulse of the training session? Calorie_Burnage - How much calories were burnt on the training session? Hours_Work - How many hours did we work at our job before the training session? Hours_Sleep - How much did we sleep the night before the training session? We use underscore (_) to separate strings because Python cannot read space as separator.
  11. We write np. in front of mean to let Python know that we want to activate the mean function from the Numpy library.
  12. 1.Before analyzing data, a Data Scientist must extract the data, and make it clean and valuable. 2. Example Explained Import the Pandas library Name the data frame as sample_data.
  13. By knowing the type of your data, you will be able to know what technique to use when analyzing them.