Implementing a data science project (R Version) Part1

Ahmad B. Abdullahi Ahmed Olanrewaju Bilikisu AderintoAkinyomade Owolabi
Olalekan OlapejuKamoldeen Abiona Oluwabunmi OgunnowoBusayo Coker
Version

Ahmad Bello Abdullahi Ahmed Olanrewaju Bilikisu Aderinto
Olalekan OlapejuKamoldeen Abiona Oluwabunmi OgunnowoBusayo Coker
Akinyomade Owolabi
Meteorologist,
Nigerian Meteorological Agency (NiMet),
Abuja
Senior Systems Analyst,
Management information systems unit
University of Ibadan, Ibadan
Head of operation,
Pakino Nigeria Ltds
Principal Consultant,
Cheetahsoft Consulting Limited
Abuja
System Engineer,
Computer Warehouse Group (CWG)
Assistant Superintendent of Corps II
Nigeria Security & Civil Defence Corps
Education Officer I
(Mathematicss & Further Mathematics)
Lagos Education District IVs
Programmes Officer,
New Nigeria Foundation
to TEAM

Arthur Samuel (1959)
Machine Learning is the field
Of study that gives computers
the ability to learn without
being explicitly programmed.

Project Description & Checklist
Data Loading, Merging and Visualisation
Feature Cleaning, Selection & Transformation
Machine Learning Algorithm Adoption
Model Performance Evaluation
Outline
Model Validation, Fine-Tuning & Ensembling

The Description
To use machine learning
techniques to perform
exploratory and predictive
analyses on crime data.

Project Description, Resources & Checklist
The Datasets
Additional data
(to be sourced later)
Dataset D
?
!
Data on the location
(i.e. geographical
coordinates) of the
police stations across
the country.
Dataset C
Data on the names of
police station and the
populationthat fall
under their
jurisdiction.
Dataset B
Data on crime
reported across the
country and the
respective police
stations
(2015/ 2016).
Dataset A

Checklist
Checklist 1
Is it a supervised, unsupervised or reinforcement machine
learning project?

Unsupervised
Learning
Computer
learns by
searching

Unsupervised
Learning
Aims at
finding
patterns

Outcome feature is known
Task driven
Fits data
Its goal is to predict values in
continuous (regression) or categorical
(classification) format
Re-Inforcement
Learning
Unsupervised
Learning
Supervised
Learning
Outcome feature is unknown.
Data driven
Clusters data
Its goal is to find patterns
(clustering) in the data.
Outcome feature is unknown.
Circumstance driven.
Decides on data
Its goal is to learn how to decide
under a given circumstance.

Id Province Police Station Population Burglary
AB123 Gauteng Dunnottar 10479 141
AB123 North West Mmabatho 134138 773
Id Province Police Station Population Frequent Crime
AB123 Gauteng Dunnottar 10479 Burglary
AB123 North West Mmabatho 134138 Arson
Label
Supervised Learning
Labelled Data
Label

Id Province Police Station Population Burglary Crime Type
AB123 Gauteng Dunnottar 10479 141 Burglary
AB123 North West Mmabatho 134138 773 Arson
Unsupervised Learning
Unlabelled Data

Checklist
Checklist 1
Checklist 2
Is it a supervised or unsupervised machine learning project?
Is it a classification or regression task?

Id Province Police Station Population Burglary
AB123 Gauteng Dunnottar 10479 141
AB123 North West Mmabatho 134138 773
Regression
Id Province Police Station Population Frequent Crime
AB123 Gauteng Dunnottar 10479 Burglary
AB123 North West Mmabatho 134138 Arson
Classification
Supervised Learning
Labelled Data
The values are
continuous
The values are
categorical

Checklist
Checklist 1
Checklist 2
Is it a supervised, unsupervised or reinforcement machine
learning project?
Is it a classification or regression task?
Checklist 3 Identify the target feature or features to be clustered
Checklist 4 Can I get extra data or feature to boost my project?

Checklist 5
Checklist 6
What are the available solutions to the problem?
How do I intend to measure the performance of my model?
Checklist 7 How will my solution be deployed and utilised?
Checklist

Video
AudioText
ImageAlpha
Numeric $1,000
Male Female
No
Yes
2014-08-21
10-5
2.0
1
This is a quote by Napoleon Hill.
do small things in a great way.
If you cannot do great things
Data Loading, Merging & Visualisation
Data Form

Data Location
Computer | Server | Web | Cloud.
Where is the dataset located?
Data Form
Numeric | Text | Image | Audio | Video.
The dataset is what form? Alpha-
Data Size
byte, megabyte, gigabyte or terabyte.
How big is the dataset? Is the size in kilo
Analysis Platform
Can I analyse it on my computer or I need to engage the
Data Flow
as a stream or in batches?
Is it a real time data? Does it come
Data Loading Checklist
service of cloud based computing provider e.g. Microsoft Azure,
Amazon web service (AWS), google cloud etc.

Data Loading Steps
Step 1
 RStudioStart Menu
Start RStudio
It is assumed that you have already installed RStudio

This pane is for writing
codes
This pane is for writing
codes.
This shows the loaded
data
This for packages, plots etc

Data Loading Steps
Step 3 library("dplyr")
library("pastecs")
library("ggplot2")
Load the packages
Step 4 setwd("C:Project_AnalyticsSA_Crime_Analysis")
Set the working directory
Step 5
Dataset_A<-read.csv("datasetDataset_A.csv")
Load the data
Step 2 install.packages("dplyr")
install.packages("pastecs")
install.packages("ggplot2")
Import the necessary R packages

Project Data Loading
Viewing the top 6 Records
DatasetA
The dataset is in csv (comma delimited) format
Dataset A - Crime Reported and Police Station
# Loading the dataset
Dataset_A <- read.csv("Dataset_A.csv")
# Loading the dataset
head(Dataset_A)
#Sorting the records using 'Police_Station'
Dataset_A[Dataset_A$Police_Station,]

DatasetA
str(Dataset_A)

Reshaping the dataset
DatasetA
Province Police_Station Crime_Category Period_2015_2016
Eastern Cape Aberdeen All theft not mentioned elsewhere 51
Eastern Cape Aberdeen Theft out of or from motor vehicle 7
Eastern Cape Aberdeen Theft of motor vehicle and motorcycle 2
Eastern Cape Aberdeen Stock-theft 20
Long Format
Province Police_Station All theft not
mentioned elsewhere
Theft out of or from
motor vehicle
Theft of motor vehicle
and motorcycle
Stock-theft
Eastern Cape Aberdeen 51 7 2 20
Wide Format

DatasetA
Reshaping (Pivoting) the dataset from "long" to "wide" format
Dataset_A_Wide <- spread(Dataset_A, Crime_Category, Period_2015_2016)
head(Dataset_A_Wide, n=5)

DatasetA
Viewing the properties of the reshaped
str(Dataset_A_Wide)

DatasetA
Check the datasets for duplicates
This is a major checklist before merging this dataset with the other datasets.
length(duplicated(Dataset_A_Wide$Police_Station))
[1] 1143

Dataset B - Police Stations and the Population that they Cover
DatasetB
The dataset is in xlsx (MS excel) format
#Load the library
library("xlsx")
head(Dataset_B, n = 5)
Police_Station population_estimate
1 ABERDEEN 9866.916
2 ACORNHOEK 127623.360
3 ACTONVILLE 52830.848
4 ADDO 20938.325
5 ADELAIDE 13587.573
install.packages("xlsx")
#Sort the dataset
Dataset_B[Dataset_B$Police_Station,]
#Load the dataset
Dataset_B <- read.xlsx (“Dataset_B.xlsx")
NB: You need to Install java and set JAVA_HOME for it to work. Download java via the following link
http://www.oracle.com/technetwork/java/javase/downloads/jdk9-downloads-3848520.html

DatasetB
Viewing the attributes of the features
str(Dataset_B)
length(duplicated(Dataset_B$Police_Station))
[1] 1140

Dataset C - Police Stations and their Geo-Coordinates
DatasetC
The dataset is in tsv (tab delimited) format
#Load the dataset
Dataset_C <- read.table("Dataset_C.tsv", header=TRUE,sep='t')
#Sort the dataset
Dataset_C[Dataset_C$Police_Station,]

DatasetC
Viewing the attributes of the features

Total Records = 1142
Feature
Police_Station
LongitudeY
LatitudeX
Dataset C
Feature
Police_Station
population_estimate
Dataset B
Feature
Province
Police_Station
+27 features
Dataset A

Datasets Merging
Province
Police_Station
Crime_Category
Period_2015_2016
Police_Station
population_estimate
Police_Station
LongitudeY
LatitudeX
1143
1140 1142

Datasets Merging
Merging Dataset A & B
Note: Dataset A contains more records than Dataset B. Hence, Dataset A is the universal dataset.
paste("Size of Dataset A wide =" , nrow(Dataset_A_Wide)
paste("Size of Dataset B =" , nrow(Dataset_B))
paste("Size of Dataset C =" , nrow(Dataset_C))
Size of Dataset A_Wide = 1143
Size of Dataset B = 1140
#Left Join
Dataset_A_and_B <- left_join(Dataset_A_Wide, Dataset_B, by="Police_Station")

Datasets Merging
Merging Dataset A_B with Dataset C
Merging …
paste("Size of Dataset A_B =" , nrow(Dataset_A_B))
paste("Size of Dataset C =" , nrow(Dataset_C))
Size of Dataset A_B = 1143
Size of Dataset C = 1142
#Left Join
Dataset_A_B_C <- left_join(Dataset_A_B, Dataset_C, by="Police_Station")

Please subscribe to my youtube channel for the
other versions
And like the video on linkedin and youtube

Implementing a data science project (R Version) Part1

Implementing a data science project (R Version) Part1

Recommended

Recommended

More Related Content

Similar to Implementing a data science project (R Version) Part1

Similar to Implementing a data science project (R Version) Part1 (20)

More from Dr Sulaimon Afolabi

More from Dr Sulaimon Afolabi (12)

Recently uploaded

Recently uploaded (20)

Implementing a data science project (R Version) Part1