SlideShare a Scribd company logo
Ahmad B. Abdullahi Ahmed Olanrewaju Bilikisu AderintoAkinyomade Owolabi
Olalekan OlapejuKamoldeen Abiona Oluwabunmi OgunnowoBusayo Coker
Version
Ahmad Bello Abdullahi Ahmed Olanrewaju Bilikisu Aderinto
Olalekan OlapejuKamoldeen Abiona Oluwabunmi OgunnowoBusayo Coker
Akinyomade Owolabi
Meteorologist,
Nigerian Meteorological Agency (NiMet),
Abuja
Senior Systems Analyst,
Management information systems unit
University of Ibadan, Ibadan
Head of operation,
Pakino Nigeria Ltds
Principal Consultant,
Cheetahsoft Consulting Limited
Abuja
System Engineer,
Computer Warehouse Group (CWG)
Assistant Superintendent of Corps II
Nigeria Security & Civil Defence Corps
Education Officer I
(Mathematicss & Further Mathematics)
Lagos Education District IVs
Programmes Officer,
New Nigeria Foundation
to TEAM
Arthur Samuel (1959)
Machine Learning is the field
Of study that gives computers
the ability to learn without
being explicitly programmed.
The Tools
Project Description & Checklist
Data Loading, Merging and Visualisation
Feature Cleaning, Selection & Transformation
Machine Learning Algorithm Adoption
Model Performance Evaluation
Outline
Model Validation, Fine-Tuning & Ensembling
1
Project Description & Checklist
The Description
To use machine learning
techniques to perform
exploratory and predictive
analyses on crime data.
Project Description, Resources & Checklist
The Datasets
Additional data
(to be sourced later)
Dataset D
?
!
Data on the location
(i.e. geographical
coordinates) of the
police stations across
the country.
Dataset C
Data on the names of
police station and the
populationthat fall
under their
jurisdiction.
Dataset B
Data on crime
reported across the
country and the
respective police
stations
(2015/ 2016).
Dataset A
Project Description & Checklist
Checklist
Checklist 1
Is it a supervised, unsupervised or reinforcement machine
learning project?
Unsupervised
Learning
Computer
learns by
searching
Unsupervised
Learning
Aims at
finding
patterns
Outcome feature is known
Task driven
Fits data
Its goal is to predict values in
continuous (regression) or categorical
(classification) format
Re-Inforcement
Learning
Unsupervised
Learning
Supervised
Learning
Outcome feature is unknown.
Data driven
Clusters data
Its goal is to find patterns
(clustering) in the data.
Outcome feature is unknown.
Circumstance driven.
Decides on data
Its goal is to learn how to decide
under a given circumstance.
Id Province Police Station Population Burglary
AB123 Gauteng Dunnottar 10479 141
AB123 North West Mmabatho 134138 773
Id Province Police Station Population Frequent Crime
AB123 Gauteng Dunnottar 10479 Burglary
AB123 North West Mmabatho 134138 Arson
Label
Supervised Learning
Labelled Data
Label
Id Province Police Station Population Burglary Crime Type
AB123 Gauteng Dunnottar 10479 141 Burglary
AB123 North West Mmabatho 134138 773 Arson
Unsupervised Learning
Unlabelled Data
Project Description & Checklist
Checklist
Checklist 1
Checklist 2
Is it a supervised or unsupervised machine learning project?
Is it a classification or regression task?
Id Province Police Station Population Burglary
AB123 Gauteng Dunnottar 10479 141
AB123 North West Mmabatho 134138 773
Regression
Id Province Police Station Population Frequent Crime
AB123 Gauteng Dunnottar 10479 Burglary
AB123 North West Mmabatho 134138 Arson
Classification
Supervised Learning
Labelled Data
The values are
continuous
The values are
categorical
Project Description, Resources & Checklist
Checklist
Checklist 1
Checklist 2
Is it a supervised, unsupervised or reinforcement machine
learning project?
Is it a classification or regression task?
Checklist 3 Identify the target feature or features to be clustered
Checklist 4 Can I get extra data or feature to boost my project?
Project Description, Resources & Checklist
Checklist 5
Checklist 6
What are the available solutions to the problem?
How do I intend to measure the performance of my model?
Checklist 7 How will my solution be deployed and utilised?
Checklist
2
Video
AudioText
ImageAlpha
Numeric $1,000
Male Female
No
Yes
2014-08-21
10-5
2.0
1
This is a quote by Napoleon Hill.
do small things in a great way.
If you cannot do great things
Data Loading, Merging & Visualisation
Data Form
Data Loading, Merging & Visualisation
Data Location
Computer | Server | Web | Cloud.
Where is the dataset located?
Data Form
Numeric | Text | Image | Audio | Video.
The dataset is what form? Alpha-
Data Size
byte, megabyte, gigabyte or terabyte.
How big is the dataset? Is the size in kilo
Analysis Platform
Can I analyse it on my computer or I need to engage the
Data Flow
as a stream or in batches?
Is it a real time data? Does it come
Data Loading Checklist
service of cloud based computing provider e.g. Microsoft Azure,
Amazon web service (AWS), google cloud etc.
Data Loading, Merging & Visualisation
Data Loading Steps
Step 1
 RStudioStart Menu
Start RStudio
It is assumed that you have already installed RStudio
This pane is for writing
codes
This pane is for writing
codes.
This shows the loaded
data
This for packages, plots etc
Data Loading, Merging & Visualisation
Data Loading Steps
Step 3 library("dplyr")
library("pastecs")
library("ggplot2")
Load the packages
Step 4 setwd("C:Project_AnalyticsSA_Crime_Analysis")
Set the working directory
Step 5
Dataset_A<-read.csv("datasetDataset_A.csv")
Load the data
Step 2 install.packages("dplyr")
install.packages("pastecs")
install.packages("ggplot2")
Import the necessary R packages
Data Loading, Merging & Visualisation
Project Data Loading
Viewing the top 6 Records
DatasetA
The dataset is in csv (comma delimited) format
Dataset A - Crime Reported and Police Station
# Loading the dataset
Dataset_A <- read.csv("Dataset_A.csv")
# Loading the dataset
head(Dataset_A)
#Sorting the records using 'Police_Station'
Dataset_A[Dataset_A$Police_Station,]
Data Loading, Merging & Visualisation
Project Data Loading
DatasetA
str(Dataset_A)
Data Loading, Merging & Visualisation
Reshaping the dataset
DatasetA
Province Police_Station Crime_Category Period_2015_2016
Eastern Cape Aberdeen All theft not mentioned elsewhere 51
Eastern Cape Aberdeen Theft out of or from motor vehicle 7
Eastern Cape Aberdeen Theft of motor vehicle and motorcycle 2
Eastern Cape Aberdeen Stock-theft 20
Long Format
Province Police_Station All theft not
mentioned elsewhere
Theft out of or from
motor vehicle
Theft of motor vehicle
and motorcycle
Stock-theft
Eastern Cape Aberdeen 51 7 2 20
Wide Format
Data Loading, Merging & Visualisation
Project Data Loading
DatasetA
Reshaping (Pivoting) the dataset from "long" to "wide" format
Dataset_A_Wide <- spread(Dataset_A, Crime_Category, Period_2015_2016)
head(Dataset_A_Wide, n=5)
Viewing the top 5 Records
Data Loading, Merging & Visualisation
Project Data Loading
DatasetA
Viewing the properties of the reshaped
str(Dataset_A_Wide)
Data Loading, Merging & Visualisation
Project Data Loading
DatasetA
Check the datasets for duplicates
This is a major checklist before merging this dataset with the other datasets.
length(duplicated(Dataset_A_Wide$Police_Station))
[1] 1143
Data Loading, Merging & Visualisation
Project Data Loading
Dataset B - Police Stations and the Population that they Cover
DatasetB
Viewing the top 5 Records
The dataset is in xlsx (MS excel) format
#Load the library
library("xlsx")
head(Dataset_B, n = 5)
Police_Station population_estimate
1 ABERDEEN 9866.916
2 ACORNHOEK 127623.360
3 ACTONVILLE 52830.848
4 ADDO 20938.325
5 ADELAIDE 13587.573
install.packages("xlsx")
#Sort the dataset
Dataset_B[Dataset_B$Police_Station,]
#Load the dataset
Dataset_B <- read.xlsx (“Dataset_B.xlsx")
NB: You need to Install java and set JAVA_HOME for it to work. Download java via the following link
http://www.oracle.com/technetwork/java/javase/downloads/jdk9-downloads-3848520.html
Data Loading, Merging & Visualisation
Project Data Loading
DatasetB
Viewing the attributes of the features
Check the datasets for duplicates
str(Dataset_B)
length(duplicated(Dataset_B$Police_Station))
[1] 1140
Data Loading, Merging & Visualisation
Project Data Loading
Dataset C - Police Stations and their Geo-Coordinates
DatasetC
Viewing the top 6 Records
The dataset is in tsv (tab delimited) format
#Load the dataset
Dataset_C <- read.table("Dataset_C.tsv", header=TRUE,sep='t')
#Sort the dataset
Dataset_C[Dataset_C$Police_Station,]
Data Loading, Merging & Visualisation
Project Data Loading
DatasetC
Viewing the attributes of the features
Check the datasets for duplicates
Total Records = 1142
Feature
Police_Station
LongitudeY
LatitudeX
Dataset C
Total Records = 1140
Feature
Police_Station
population_estimate
Dataset B
Total Records = 1143
Feature
Province
Police_Station
+27 features
Dataset A
Data Loading, Merging & Visualisation
Datasets Merging
Province
Police_Station
Crime_Category
Period_2015_2016
Police_Station
population_estimate
Police_Station
LongitudeY
LatitudeX
1143
1140 1142
Data Loading, Merging & Visualisation
Datasets Merging
Merging Dataset A & B
Note: Dataset A contains more records than Dataset B. Hence, Dataset A is the universal dataset.
paste("Size of Dataset A wide =" , nrow(Dataset_A_Wide)
paste("Size of Dataset B =" , nrow(Dataset_B))
paste("Size of Dataset C =" , nrow(Dataset_C))
Size of Dataset A_Wide = 1143
Size of Dataset B = 1140
#Left Join
Dataset_A_and_B <- left_join(Dataset_A_Wide, Dataset_B, by="Police_Station")
Data Loading, Merging & Visualisation
Datasets Merging
Merging Dataset A_B with Dataset C
Merging …
paste("Size of Dataset A_B =" , nrow(Dataset_A_B))
paste("Size of Dataset C =" , nrow(Dataset_C))
Size of Dataset A_B = 1143
Size of Dataset C = 1142
#Left Join
Dataset_A_B_C <- left_join(Dataset_A_B, Dataset_C, by="Police_Station")
Please subscribe to my youtube channel for the
other versions
And like the video on linkedin and youtube
Implementing a data science project (R Version) Part1

More Related Content

Similar to Implementing a data science project (R Version) Part1

DATA @ NFLX (Tableau Conference 2014 Presentation)
DATA @ NFLX (Tableau Conference 2014 Presentation)DATA @ NFLX (Tableau Conference 2014 Presentation)
DATA @ NFLX (Tableau Conference 2014 Presentation)
Blake Irvine
 
SplunkLive! Zurich 2018: Integrating Metrics and Logs
SplunkLive! Zurich 2018: Integrating Metrics and LogsSplunkLive! Zurich 2018: Integrating Metrics and Logs
SplunkLive! Zurich 2018: Integrating Metrics and Logs
Splunk
 
Data profiling-best-practices
Data profiling-best-practicesData profiling-best-practices
Data profiling-best-practices
Blaise Cheuteu
 
Neoaug 2013 critical success factors for data quality management-chain-sys-co...
Neoaug 2013 critical success factors for data quality management-chain-sys-co...Neoaug 2013 critical success factors for data quality management-chain-sys-co...
Neoaug 2013 critical success factors for data quality management-chain-sys-co...
Chain Sys Corporation
 
Getting optimal performance from oracle e-business suite presentation
Getting optimal performance from oracle e-business suite presentationGetting optimal performance from oracle e-business suite presentation
Getting optimal performance from oracle e-business suite presentation
Berry Clemens
 
DSE-complete.pptx
DSE-complete.pptxDSE-complete.pptx
DSE-complete.pptx
RanjithKumar888622
 
employee turnover prediction document.docx
employee turnover prediction document.docxemployee turnover prediction document.docx
employee turnover prediction document.docx
rohithprabhas1
 
Email Spam Detection Using Machine Learning
Email Spam Detection Using Machine LearningEmail Spam Detection Using Machine Learning
Email Spam Detection Using Machine Learning
IRJET Journal
 
[計一] Basic r programming final0918
[計一] Basic r programming   final0918[計一] Basic r programming   final0918
[計一] Basic r programming final0918Yen_CY
 
[計一] Basic r programming final0918
[計一] Basic r programming   final0918[計一] Basic r programming   final0918
[計一] Basic r programming final0918Chia-Yi Yen
 
Agile Data Science 2.0
Agile Data Science 2.0Agile Data Science 2.0
Agile Data Science 2.0
Russell Jurney
 
Microsoft azure data fundamentals (dp 900) practice tests 2022
Microsoft azure data fundamentals (dp 900) practice tests 2022Microsoft azure data fundamentals (dp 900) practice tests 2022
Microsoft azure data fundamentals (dp 900) practice tests 2022
SkillCertProExams
 
B4UConference_machine learning_deeplearning
B4UConference_machine learning_deeplearningB4UConference_machine learning_deeplearning
B4UConference_machine learning_deeplearning
Hoa Le
 
Machine learning for sensor Data Analytics
Machine learning for sensor Data AnalyticsMachine learning for sensor Data Analytics
Machine learning for sensor Data Analytics
MATLABISRAEL
 
Data Science Training | Data Science Tutorial | Data Science Certification | ...
Data Science Training | Data Science Tutorial | Data Science Certification | ...Data Science Training | Data Science Tutorial | Data Science Certification | ...
Data Science Training | Data Science Tutorial | Data Science Certification | ...
Edureka!
 
Unit 2 - Data Manipulation with R.pptx
Unit 2 - Data Manipulation with R.pptxUnit 2 - Data Manipulation with R.pptx
Unit 2 - Data Manipulation with R.pptx
Malla Reddy University
 
Agile Data Science 2.0
Agile Data Science 2.0Agile Data Science 2.0
Agile Data Science 2.0
Russell Jurney
 
SFScon22 - Alex Bojeri - A Deep Learning algorithm for people detection in Se...
SFScon22 - Alex Bojeri - A Deep Learning algorithm for people detection in Se...SFScon22 - Alex Bojeri - A Deep Learning algorithm for people detection in Se...
SFScon22 - Alex Bojeri - A Deep Learning algorithm for people detection in Se...
South Tyrol Free Software Conference
 

Similar to Implementing a data science project (R Version) Part1 (20)

DATA @ NFLX (Tableau Conference 2014 Presentation)
DATA @ NFLX (Tableau Conference 2014 Presentation)DATA @ NFLX (Tableau Conference 2014 Presentation)
DATA @ NFLX (Tableau Conference 2014 Presentation)
 
SplunkLive! Zurich 2018: Integrating Metrics and Logs
SplunkLive! Zurich 2018: Integrating Metrics and LogsSplunkLive! Zurich 2018: Integrating Metrics and Logs
SplunkLive! Zurich 2018: Integrating Metrics and Logs
 
Data profiling-best-practices
Data profiling-best-practicesData profiling-best-practices
Data profiling-best-practices
 
Neoaug 2013 critical success factors for data quality management-chain-sys-co...
Neoaug 2013 critical success factors for data quality management-chain-sys-co...Neoaug 2013 critical success factors for data quality management-chain-sys-co...
Neoaug 2013 critical success factors for data quality management-chain-sys-co...
 
Getting optimal performance from oracle e-business suite presentation
Getting optimal performance from oracle e-business suite presentationGetting optimal performance from oracle e-business suite presentation
Getting optimal performance from oracle e-business suite presentation
 
DSE-complete.pptx
DSE-complete.pptxDSE-complete.pptx
DSE-complete.pptx
 
employee turnover prediction document.docx
employee turnover prediction document.docxemployee turnover prediction document.docx
employee turnover prediction document.docx
 
DWBI Testing Profile
DWBI Testing ProfileDWBI Testing Profile
DWBI Testing Profile
 
Email Spam Detection Using Machine Learning
Email Spam Detection Using Machine LearningEmail Spam Detection Using Machine Learning
Email Spam Detection Using Machine Learning
 
[計一] Basic r programming final0918
[計一] Basic r programming   final0918[計一] Basic r programming   final0918
[計一] Basic r programming final0918
 
[計一] Basic r programming final0918
[計一] Basic r programming   final0918[計一] Basic r programming   final0918
[計一] Basic r programming final0918
 
Agile Data Science 2.0
Agile Data Science 2.0Agile Data Science 2.0
Agile Data Science 2.0
 
Microsoft azure data fundamentals (dp 900) practice tests 2022
Microsoft azure data fundamentals (dp 900) practice tests 2022Microsoft azure data fundamentals (dp 900) practice tests 2022
Microsoft azure data fundamentals (dp 900) practice tests 2022
 
B4UConference_machine learning_deeplearning
B4UConference_machine learning_deeplearningB4UConference_machine learning_deeplearning
B4UConference_machine learning_deeplearning
 
Machine learning for sensor Data Analytics
Machine learning for sensor Data AnalyticsMachine learning for sensor Data Analytics
Machine learning for sensor Data Analytics
 
Data Science Training | Data Science Tutorial | Data Science Certification | ...
Data Science Training | Data Science Tutorial | Data Science Certification | ...Data Science Training | Data Science Tutorial | Data Science Certification | ...
Data Science Training | Data Science Tutorial | Data Science Certification | ...
 
Unit 2 - Data Manipulation with R.pptx
Unit 2 - Data Manipulation with R.pptxUnit 2 - Data Manipulation with R.pptx
Unit 2 - Data Manipulation with R.pptx
 
Agile Data Science 2.0
Agile Data Science 2.0Agile Data Science 2.0
Agile Data Science 2.0
 
SFScon22 - Alex Bojeri - A Deep Learning algorithm for people detection in Se...
SFScon22 - Alex Bojeri - A Deep Learning algorithm for people detection in Se...SFScon22 - Alex Bojeri - A Deep Learning algorithm for people detection in Se...
SFScon22 - Alex Bojeri - A Deep Learning algorithm for people detection in Se...
 
Jithender_3+Years_Exp_ETL Testing
Jithender_3+Years_Exp_ETL TestingJithender_3+Years_Exp_ETL Testing
Jithender_3+Years_Exp_ETL Testing
 

More from Dr Sulaimon Afolabi

Pragmatic South African Strategies in the Era of Artificial Intelligence
Pragmatic South African Strategies  in the Era of  Artificial IntelligencePragmatic South African Strategies  in the Era of  Artificial Intelligence
Pragmatic South African Strategies in the Era of Artificial Intelligence
Dr Sulaimon Afolabi
 
Multi image object detection v5
Multi image object detection v5Multi image object detection v5
Multi image object detection v5
Dr Sulaimon Afolabi
 
Smart tools for modern researchers
Smart tools for modern researchersSmart tools for modern researchers
Smart tools for modern researchers
Dr Sulaimon Afolabi
 
GeoSpatial Analytics
GeoSpatial AnalyticsGeoSpatial Analytics
GeoSpatial Analytics
Dr Sulaimon Afolabi
 
Embarking on an AI journey - Africa4Ai
Embarking on an AI journey - Africa4AiEmbarking on an AI journey - Africa4Ai
Embarking on an AI journey - Africa4Ai
Dr Sulaimon Afolabi
 
State of Africa Artificial Intelliegnce -Part 1
State of Africa Artificial Intelliegnce -Part 1State of Africa Artificial Intelliegnce -Part 1
State of Africa Artificial Intelliegnce -Part 1
Dr Sulaimon Afolabi
 
State of Africa Artificial Intelliegnce -Part 2
State of Africa Artificial Intelliegnce -Part 2State of Africa Artificial Intelliegnce -Part 2
State of Africa Artificial Intelliegnce -Part 2
Dr Sulaimon Afolabi
 
Boosting Approach to Solving Machine Learning Problems
Boosting Approach to Solving Machine Learning ProblemsBoosting Approach to Solving Machine Learning Problems
Boosting Approach to Solving Machine Learning Problems
Dr Sulaimon Afolabi
 
Implementing a data science project (data generation) part2
Implementing a data science project (data generation) part2Implementing a data science project (data generation) part2
Implementing a data science project (data generation) part2
Dr Sulaimon Afolabi
 
OpenHDS / ODK fieldworker manual
OpenHDS / ODK fieldworker manualOpenHDS / ODK fieldworker manual
OpenHDS / ODK fieldworker manual
Dr Sulaimon Afolabi
 
Encounters with big data
Encounters with big dataEncounters with big data
Encounters with big data
Dr Sulaimon Afolabi
 
Practical Guide for HDSS Data for Analysis using Stata
Practical Guide for HDSS Data for Analysis using StataPractical Guide for HDSS Data for Analysis using Stata
Practical Guide for HDSS Data for Analysis using StataDr Sulaimon Afolabi
 

More from Dr Sulaimon Afolabi (12)

Pragmatic South African Strategies in the Era of Artificial Intelligence
Pragmatic South African Strategies  in the Era of  Artificial IntelligencePragmatic South African Strategies  in the Era of  Artificial Intelligence
Pragmatic South African Strategies in the Era of Artificial Intelligence
 
Multi image object detection v5
Multi image object detection v5Multi image object detection v5
Multi image object detection v5
 
Smart tools for modern researchers
Smart tools for modern researchersSmart tools for modern researchers
Smart tools for modern researchers
 
GeoSpatial Analytics
GeoSpatial AnalyticsGeoSpatial Analytics
GeoSpatial Analytics
 
Embarking on an AI journey - Africa4Ai
Embarking on an AI journey - Africa4AiEmbarking on an AI journey - Africa4Ai
Embarking on an AI journey - Africa4Ai
 
State of Africa Artificial Intelliegnce -Part 1
State of Africa Artificial Intelliegnce -Part 1State of Africa Artificial Intelliegnce -Part 1
State of Africa Artificial Intelliegnce -Part 1
 
State of Africa Artificial Intelliegnce -Part 2
State of Africa Artificial Intelliegnce -Part 2State of Africa Artificial Intelliegnce -Part 2
State of Africa Artificial Intelliegnce -Part 2
 
Boosting Approach to Solving Machine Learning Problems
Boosting Approach to Solving Machine Learning ProblemsBoosting Approach to Solving Machine Learning Problems
Boosting Approach to Solving Machine Learning Problems
 
Implementing a data science project (data generation) part2
Implementing a data science project (data generation) part2Implementing a data science project (data generation) part2
Implementing a data science project (data generation) part2
 
OpenHDS / ODK fieldworker manual
OpenHDS / ODK fieldworker manualOpenHDS / ODK fieldworker manual
OpenHDS / ODK fieldworker manual
 
Encounters with big data
Encounters with big dataEncounters with big data
Encounters with big data
 
Practical Guide for HDSS Data for Analysis using Stata
Practical Guide for HDSS Data for Analysis using StataPractical Guide for HDSS Data for Analysis using Stata
Practical Guide for HDSS Data for Analysis using Stata
 

Recently uploaded

Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Dr. Vinod Kumar Kanvaria
 
"Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe..."Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe...
SACHIN R KONDAGURI
 
Executive Directors Chat Leveraging AI for Diversity, Equity, and Inclusion
Executive Directors Chat  Leveraging AI for Diversity, Equity, and InclusionExecutive Directors Chat  Leveraging AI for Diversity, Equity, and Inclusion
Executive Directors Chat Leveraging AI for Diversity, Equity, and Inclusion
TechSoup
 
Best Digital Marketing Institute In NOIDA
Best Digital Marketing Institute In NOIDABest Digital Marketing Institute In NOIDA
Best Digital Marketing Institute In NOIDA
deeptiverma2406
 
The basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptxThe basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptx
heathfieldcps1
 
Pollock and Snow "DEIA in the Scholarly Landscape, Session One: Setting Expec...
Pollock and Snow "DEIA in the Scholarly Landscape, Session One: Setting Expec...Pollock and Snow "DEIA in the Scholarly Landscape, Session One: Setting Expec...
Pollock and Snow "DEIA in the Scholarly Landscape, Session One: Setting Expec...
National Information Standards Organization (NISO)
 
RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3
RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3
RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3
IreneSebastianRueco1
 
Your Skill Boost Masterclass: Strategies for Effective Upskilling
Your Skill Boost Masterclass: Strategies for Effective UpskillingYour Skill Boost Masterclass: Strategies for Effective Upskilling
Your Skill Boost Masterclass: Strategies for Effective Upskilling
Excellence Foundation for South Sudan
 
The Diamonds of 2023-2024 in the IGRA collection
The Diamonds of 2023-2024 in the IGRA collectionThe Diamonds of 2023-2024 in the IGRA collection
The Diamonds of 2023-2024 in the IGRA collection
Israel Genealogy Research Association
 
MERN Stack Developer Roadmap By ScholarHat PDF
MERN Stack Developer Roadmap By ScholarHat PDFMERN Stack Developer Roadmap By ScholarHat PDF
MERN Stack Developer Roadmap By ScholarHat PDF
scholarhattraining
 
The simplified electron and muon model, Oscillating Spacetime: The Foundation...
The simplified electron and muon model, Oscillating Spacetime: The Foundation...The simplified electron and muon model, Oscillating Spacetime: The Foundation...
The simplified electron and muon model, Oscillating Spacetime: The Foundation...
RitikBhardwaj56
 
MASS MEDIA STUDIES-835-CLASS XI Resource Material.pdf
MASS MEDIA STUDIES-835-CLASS XI Resource Material.pdfMASS MEDIA STUDIES-835-CLASS XI Resource Material.pdf
MASS MEDIA STUDIES-835-CLASS XI Resource Material.pdf
goswamiyash170123
 
CACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdfCACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdf
camakaiclarkmusic
 
South African Journal of Science: Writing with integrity workshop (2024)
South African Journal of Science: Writing with integrity workshop (2024)South African Journal of Science: Writing with integrity workshop (2024)
South African Journal of Science: Writing with integrity workshop (2024)
Academy of Science of South Africa
 
Reflective and Evaluative Practice...pdf
Reflective and Evaluative Practice...pdfReflective and Evaluative Practice...pdf
Reflective and Evaluative Practice...pdf
amberjdewit93
 
Top five deadliest dog breeds in America
Top five deadliest dog breeds in AmericaTop five deadliest dog breeds in America
Top five deadliest dog breeds in America
Bisnar Chase Personal Injury Attorneys
 
Reflective and Evaluative Practice PowerPoint
Reflective and Evaluative Practice PowerPointReflective and Evaluative Practice PowerPoint
Reflective and Evaluative Practice PowerPoint
amberjdewit93
 
Digital Artefact 1 - Tiny Home Environmental Design
Digital Artefact 1 - Tiny Home Environmental DesignDigital Artefact 1 - Tiny Home Environmental Design
Digital Artefact 1 - Tiny Home Environmental Design
amberjdewit93
 
Aficamten in HCM (SEQUOIA HCM TRIAL 2024)
Aficamten in HCM (SEQUOIA HCM TRIAL 2024)Aficamten in HCM (SEQUOIA HCM TRIAL 2024)
Aficamten in HCM (SEQUOIA HCM TRIAL 2024)
Ashish Kohli
 
Introduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp NetworkIntroduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp Network
TechSoup
 

Recently uploaded (20)

Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
 
"Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe..."Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe...
 
Executive Directors Chat Leveraging AI for Diversity, Equity, and Inclusion
Executive Directors Chat  Leveraging AI for Diversity, Equity, and InclusionExecutive Directors Chat  Leveraging AI for Diversity, Equity, and Inclusion
Executive Directors Chat Leveraging AI for Diversity, Equity, and Inclusion
 
Best Digital Marketing Institute In NOIDA
Best Digital Marketing Institute In NOIDABest Digital Marketing Institute In NOIDA
Best Digital Marketing Institute In NOIDA
 
The basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptxThe basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptx
 
Pollock and Snow "DEIA in the Scholarly Landscape, Session One: Setting Expec...
Pollock and Snow "DEIA in the Scholarly Landscape, Session One: Setting Expec...Pollock and Snow "DEIA in the Scholarly Landscape, Session One: Setting Expec...
Pollock and Snow "DEIA in the Scholarly Landscape, Session One: Setting Expec...
 
RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3
RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3
RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3
 
Your Skill Boost Masterclass: Strategies for Effective Upskilling
Your Skill Boost Masterclass: Strategies for Effective UpskillingYour Skill Boost Masterclass: Strategies for Effective Upskilling
Your Skill Boost Masterclass: Strategies for Effective Upskilling
 
The Diamonds of 2023-2024 in the IGRA collection
The Diamonds of 2023-2024 in the IGRA collectionThe Diamonds of 2023-2024 in the IGRA collection
The Diamonds of 2023-2024 in the IGRA collection
 
MERN Stack Developer Roadmap By ScholarHat PDF
MERN Stack Developer Roadmap By ScholarHat PDFMERN Stack Developer Roadmap By ScholarHat PDF
MERN Stack Developer Roadmap By ScholarHat PDF
 
The simplified electron and muon model, Oscillating Spacetime: The Foundation...
The simplified electron and muon model, Oscillating Spacetime: The Foundation...The simplified electron and muon model, Oscillating Spacetime: The Foundation...
The simplified electron and muon model, Oscillating Spacetime: The Foundation...
 
MASS MEDIA STUDIES-835-CLASS XI Resource Material.pdf
MASS MEDIA STUDIES-835-CLASS XI Resource Material.pdfMASS MEDIA STUDIES-835-CLASS XI Resource Material.pdf
MASS MEDIA STUDIES-835-CLASS XI Resource Material.pdf
 
CACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdfCACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdf
 
South African Journal of Science: Writing with integrity workshop (2024)
South African Journal of Science: Writing with integrity workshop (2024)South African Journal of Science: Writing with integrity workshop (2024)
South African Journal of Science: Writing with integrity workshop (2024)
 
Reflective and Evaluative Practice...pdf
Reflective and Evaluative Practice...pdfReflective and Evaluative Practice...pdf
Reflective and Evaluative Practice...pdf
 
Top five deadliest dog breeds in America
Top five deadliest dog breeds in AmericaTop five deadliest dog breeds in America
Top five deadliest dog breeds in America
 
Reflective and Evaluative Practice PowerPoint
Reflective and Evaluative Practice PowerPointReflective and Evaluative Practice PowerPoint
Reflective and Evaluative Practice PowerPoint
 
Digital Artefact 1 - Tiny Home Environmental Design
Digital Artefact 1 - Tiny Home Environmental DesignDigital Artefact 1 - Tiny Home Environmental Design
Digital Artefact 1 - Tiny Home Environmental Design
 
Aficamten in HCM (SEQUOIA HCM TRIAL 2024)
Aficamten in HCM (SEQUOIA HCM TRIAL 2024)Aficamten in HCM (SEQUOIA HCM TRIAL 2024)
Aficamten in HCM (SEQUOIA HCM TRIAL 2024)
 
Introduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp NetworkIntroduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp Network
 

Implementing a data science project (R Version) Part1

  • 1. Ahmad B. Abdullahi Ahmed Olanrewaju Bilikisu AderintoAkinyomade Owolabi Olalekan OlapejuKamoldeen Abiona Oluwabunmi OgunnowoBusayo Coker Version
  • 2. Ahmad Bello Abdullahi Ahmed Olanrewaju Bilikisu Aderinto Olalekan OlapejuKamoldeen Abiona Oluwabunmi OgunnowoBusayo Coker Akinyomade Owolabi Meteorologist, Nigerian Meteorological Agency (NiMet), Abuja Senior Systems Analyst, Management information systems unit University of Ibadan, Ibadan Head of operation, Pakino Nigeria Ltds Principal Consultant, Cheetahsoft Consulting Limited Abuja System Engineer, Computer Warehouse Group (CWG) Assistant Superintendent of Corps II Nigeria Security & Civil Defence Corps Education Officer I (Mathematicss & Further Mathematics) Lagos Education District IVs Programmes Officer, New Nigeria Foundation to TEAM
  • 3. Arthur Samuel (1959) Machine Learning is the field Of study that gives computers the ability to learn without being explicitly programmed.
  • 5.
  • 6. Project Description & Checklist Data Loading, Merging and Visualisation Feature Cleaning, Selection & Transformation Machine Learning Algorithm Adoption Model Performance Evaluation Outline Model Validation, Fine-Tuning & Ensembling
  • 7. 1
  • 8. Project Description & Checklist The Description To use machine learning techniques to perform exploratory and predictive analyses on crime data.
  • 9. Project Description, Resources & Checklist The Datasets Additional data (to be sourced later) Dataset D ? ! Data on the location (i.e. geographical coordinates) of the police stations across the country. Dataset C Data on the names of police station and the populationthat fall under their jurisdiction. Dataset B Data on crime reported across the country and the respective police stations (2015/ 2016). Dataset A
  • 10. Project Description & Checklist Checklist Checklist 1 Is it a supervised, unsupervised or reinforcement machine learning project?
  • 13. Outcome feature is known Task driven Fits data Its goal is to predict values in continuous (regression) or categorical (classification) format Re-Inforcement Learning Unsupervised Learning Supervised Learning Outcome feature is unknown. Data driven Clusters data Its goal is to find patterns (clustering) in the data. Outcome feature is unknown. Circumstance driven. Decides on data Its goal is to learn how to decide under a given circumstance.
  • 14. Id Province Police Station Population Burglary AB123 Gauteng Dunnottar 10479 141 AB123 North West Mmabatho 134138 773 Id Province Police Station Population Frequent Crime AB123 Gauteng Dunnottar 10479 Burglary AB123 North West Mmabatho 134138 Arson Label Supervised Learning Labelled Data Label
  • 15. Id Province Police Station Population Burglary Crime Type AB123 Gauteng Dunnottar 10479 141 Burglary AB123 North West Mmabatho 134138 773 Arson Unsupervised Learning Unlabelled Data
  • 16. Project Description & Checklist Checklist Checklist 1 Checklist 2 Is it a supervised or unsupervised machine learning project? Is it a classification or regression task?
  • 17. Id Province Police Station Population Burglary AB123 Gauteng Dunnottar 10479 141 AB123 North West Mmabatho 134138 773 Regression Id Province Police Station Population Frequent Crime AB123 Gauteng Dunnottar 10479 Burglary AB123 North West Mmabatho 134138 Arson Classification Supervised Learning Labelled Data The values are continuous The values are categorical
  • 18. Project Description, Resources & Checklist Checklist Checklist 1 Checklist 2 Is it a supervised, unsupervised or reinforcement machine learning project? Is it a classification or regression task? Checklist 3 Identify the target feature or features to be clustered Checklist 4 Can I get extra data or feature to boost my project?
  • 19. Project Description, Resources & Checklist Checklist 5 Checklist 6 What are the available solutions to the problem? How do I intend to measure the performance of my model? Checklist 7 How will my solution be deployed and utilised? Checklist
  • 20. 2
  • 21. Video AudioText ImageAlpha Numeric $1,000 Male Female No Yes 2014-08-21 10-5 2.0 1 This is a quote by Napoleon Hill. do small things in a great way. If you cannot do great things Data Loading, Merging & Visualisation Data Form
  • 22. Data Loading, Merging & Visualisation Data Location Computer | Server | Web | Cloud. Where is the dataset located? Data Form Numeric | Text | Image | Audio | Video. The dataset is what form? Alpha- Data Size byte, megabyte, gigabyte or terabyte. How big is the dataset? Is the size in kilo Analysis Platform Can I analyse it on my computer or I need to engage the Data Flow as a stream or in batches? Is it a real time data? Does it come Data Loading Checklist service of cloud based computing provider e.g. Microsoft Azure, Amazon web service (AWS), google cloud etc.
  • 23. Data Loading, Merging & Visualisation Data Loading Steps Step 1  RStudioStart Menu Start RStudio It is assumed that you have already installed RStudio
  • 24.
  • 25.
  • 26.
  • 27. This pane is for writing codes This pane is for writing codes. This shows the loaded data This for packages, plots etc
  • 28. Data Loading, Merging & Visualisation Data Loading Steps Step 3 library("dplyr") library("pastecs") library("ggplot2") Load the packages Step 4 setwd("C:Project_AnalyticsSA_Crime_Analysis") Set the working directory Step 5 Dataset_A<-read.csv("datasetDataset_A.csv") Load the data Step 2 install.packages("dplyr") install.packages("pastecs") install.packages("ggplot2") Import the necessary R packages
  • 29. Data Loading, Merging & Visualisation Project Data Loading Viewing the top 6 Records DatasetA The dataset is in csv (comma delimited) format Dataset A - Crime Reported and Police Station # Loading the dataset Dataset_A <- read.csv("Dataset_A.csv") # Loading the dataset head(Dataset_A) #Sorting the records using 'Police_Station' Dataset_A[Dataset_A$Police_Station,]
  • 30. Data Loading, Merging & Visualisation Project Data Loading DatasetA str(Dataset_A)
  • 31. Data Loading, Merging & Visualisation Reshaping the dataset DatasetA Province Police_Station Crime_Category Period_2015_2016 Eastern Cape Aberdeen All theft not mentioned elsewhere 51 Eastern Cape Aberdeen Theft out of or from motor vehicle 7 Eastern Cape Aberdeen Theft of motor vehicle and motorcycle 2 Eastern Cape Aberdeen Stock-theft 20 Long Format Province Police_Station All theft not mentioned elsewhere Theft out of or from motor vehicle Theft of motor vehicle and motorcycle Stock-theft Eastern Cape Aberdeen 51 7 2 20 Wide Format
  • 32. Data Loading, Merging & Visualisation Project Data Loading DatasetA Reshaping (Pivoting) the dataset from "long" to "wide" format Dataset_A_Wide <- spread(Dataset_A, Crime_Category, Period_2015_2016) head(Dataset_A_Wide, n=5) Viewing the top 5 Records
  • 33. Data Loading, Merging & Visualisation Project Data Loading DatasetA Viewing the properties of the reshaped str(Dataset_A_Wide)
  • 34. Data Loading, Merging & Visualisation Project Data Loading DatasetA Check the datasets for duplicates This is a major checklist before merging this dataset with the other datasets. length(duplicated(Dataset_A_Wide$Police_Station)) [1] 1143
  • 35. Data Loading, Merging & Visualisation Project Data Loading Dataset B - Police Stations and the Population that they Cover DatasetB Viewing the top 5 Records The dataset is in xlsx (MS excel) format #Load the library library("xlsx") head(Dataset_B, n = 5) Police_Station population_estimate 1 ABERDEEN 9866.916 2 ACORNHOEK 127623.360 3 ACTONVILLE 52830.848 4 ADDO 20938.325 5 ADELAIDE 13587.573 install.packages("xlsx") #Sort the dataset Dataset_B[Dataset_B$Police_Station,] #Load the dataset Dataset_B <- read.xlsx (“Dataset_B.xlsx") NB: You need to Install java and set JAVA_HOME for it to work. Download java via the following link http://www.oracle.com/technetwork/java/javase/downloads/jdk9-downloads-3848520.html
  • 36. Data Loading, Merging & Visualisation Project Data Loading DatasetB Viewing the attributes of the features Check the datasets for duplicates str(Dataset_B) length(duplicated(Dataset_B$Police_Station)) [1] 1140
  • 37. Data Loading, Merging & Visualisation Project Data Loading Dataset C - Police Stations and their Geo-Coordinates DatasetC Viewing the top 6 Records The dataset is in tsv (tab delimited) format #Load the dataset Dataset_C <- read.table("Dataset_C.tsv", header=TRUE,sep='t') #Sort the dataset Dataset_C[Dataset_C$Police_Station,]
  • 38. Data Loading, Merging & Visualisation Project Data Loading DatasetC Viewing the attributes of the features Check the datasets for duplicates
  • 39. Total Records = 1142 Feature Police_Station LongitudeY LatitudeX Dataset C Total Records = 1140 Feature Police_Station population_estimate Dataset B Total Records = 1143 Feature Province Police_Station +27 features Dataset A
  • 40. Data Loading, Merging & Visualisation Datasets Merging Province Police_Station Crime_Category Period_2015_2016 Police_Station population_estimate Police_Station LongitudeY LatitudeX 1143 1140 1142
  • 41. Data Loading, Merging & Visualisation Datasets Merging Merging Dataset A & B Note: Dataset A contains more records than Dataset B. Hence, Dataset A is the universal dataset. paste("Size of Dataset A wide =" , nrow(Dataset_A_Wide) paste("Size of Dataset B =" , nrow(Dataset_B)) paste("Size of Dataset C =" , nrow(Dataset_C)) Size of Dataset A_Wide = 1143 Size of Dataset B = 1140 #Left Join Dataset_A_and_B <- left_join(Dataset_A_Wide, Dataset_B, by="Police_Station")
  • 42. Data Loading, Merging & Visualisation Datasets Merging Merging Dataset A_B with Dataset C Merging … paste("Size of Dataset A_B =" , nrow(Dataset_A_B)) paste("Size of Dataset C =" , nrow(Dataset_C)) Size of Dataset A_B = 1143 Size of Dataset C = 1142 #Left Join Dataset_A_B_C <- left_join(Dataset_A_B, Dataset_C, by="Police_Station")
  • 43. Please subscribe to my youtube channel for the other versions And like the video on linkedin and youtube