SlideShare a Scribd company logo
1 of 37
Data Warehousing and Business Intelligence
Project
on
Criminalytics:
Entities affecting the Rate of Crime in Republic of Ireland
Shrikant Uday Samarth
x18129137
MSc/PGDip Data Analytics – 2019/20
Submitted to: Sean Heeney
National College of Ireland
Project Submission Sheet – 2017/2018
School of Computing
Student Name: Shrikant Uday Samarth
Student ID: x18129137
Programme: MSc Data Analytics
Year: 2019/20
Module: Data Warehousing and Business Intelligence
Lecturer: Dr. Simon Caton
Submission Due
Date:
12/04/2019
Project Title: Criminalytics: Entities affecting the Rate of Crime in Republic
of Ireland
I hereby certify that the information contained in this (my submission) is information
pertaining to my own individual work that I conducted for this project. All information
other than my own contribution is fully and appropriately referenced and listed in the
relevant bibliography section. I assert that I have not referred to any work(s) other than
those listed. I also include my TurnItIn report with this submission.
ALL materials used must be referenced in the bibliography section. Students are
encouraged to use the Harvard Referencing Standard supplied by the Library. To use
other author’s written or electronic work is an act of plagiarism and may result in disci-
plinary action. Students may be required to undergo a viva (oral examination) if there
is suspicion about the validity of their submitted work.
Signature:
Date: April 12, 2019
PLEASE READ THE FOLLOWING INSTRUCTIONS:
1. Please attach a completed copy of this sheet to each project (including multiple copies).
2. You must ensure that you retain a HARD COPY of ALL projects, both for
your own reference and in case a project is lost or mislaid. It is not sufficient to keep
a copy on computer. Please do not bind projects or place in covers unless specifically
requested.
3. Assignments that are submitted to the Programme Coordinator office must be placed
into the assignment box located outside the office.
Office Use Only
Signature:
Date:
Penalty Applied (if
applicable):
Table 1: Mark sheet – do not edit
Criteria Mark Awarded Comment(s)
Objectives of 5
Related Work of 10
Data of 25
ETL of 20
Application of 30
Video of 10
Presentation of 10
Total of 100
Project Check List
This section capture the core requirements that the project entails represented as a check
list for convenience.
Used LATEX template
Three Business Requirements listed in introduction
At least one structured data source
At least one unstructured data source
At least three sources of data
Described all sources of data
All sources of data are less than one year old, i.e. released after 17/09/2017
Inserted and discussed star schema
Completed logical data map
Discussed the high level ETL strategy
Provided 3 BI queries
Detailed the sources of data used in each query
Discussed the implications of results in each query
Reviewed at least 5-10 appropriate papers on topic of your DWBI project
Criminalytics: Different entities affecting the Rate of
Crime in Republic of Ireland
Shrikant Uday Samarth
x18129137
April 12, 2019
Abstract
The aim of this paper is to understand the spatial patterns of crimes held in
Ireland and to develop better understanding of the role of different factors like
migration, population, unemployment etc. being the main inputs to the rate of
increase or decrease in crime. In recent years according to the research papers there
were complaints of various types of violence and crime Bacon & O’Donoghue (1975).
To address the concern of public safety this Data Warehouse Business Intelligence
model has been developed based on the data from various sources. This analysis
includes data sources from wide range of varieties like Statista, Central Statistics
Office, numbio and many others that helped in understanding the mentioned factors
being involved in criminal rate of Republic of Ireland. These findings can be issued
and used by the Garda Siochana i.e. the Irish Police department to understand
the patterns and predict the increase in crime rates in a specific county through
studying factors.
1 Introduction
A law is defined by every country to maintain the decorum of the country. It includes
few set of rules that help the implementer and the follower believe that there is harmony
and well being with every living being. But when such laws are broken, there is a
discomfort. That discomfort can be defined as a crime. When we speak about crime, the
definition that Oxford Dictionary of Sociology specifies is an offence which goes beyond
the personal and into the public sphere, breaking prohibitory rules or laws, to which
legitimate punishments or sanctions are attached, and which requires the intervention of
a public authority Scott & Marshall (2009). So, a crime can be termed as an outcome of
disobedience to the set of protocols called as law. There are many notable scientists that
have given different definition on crime. ”An intentional act or omission in violation of
criminal law, committed, without defense or justification and sanctioned by law as felony
or misdemeanor Paul (2016).”
There exist crimes in least corrupted countries, but the aim should be a crime free
nation. The aim of this project is to understand the patterns of crimes in Ireland. The
motivation towards it is that how data analytics can help in reduction in crimes rates
and bring about new reforms to curb the crimes in Ireland.
1
(Req-1) Does the unemployment in the country have any effect on increase or decrease in
rate of crime and population of Ireland?
(Req-2) Does the increase in population of counties cause any effect on the number of crime
rates registered in the Garda Stations of the respective counties?
(Req-3) Does the immigration in the country have any effect on increase or decrease in rate
of crime in Ireland?
2 Data Sources
The source for the data places a very crucial role in understanding the pattern in crime and
with the quality of data also comes the quantity which helps in more concrete analysis of
data. The project implementation required the use of 2 Unstructured and 5 Structured
Data Sets from different Data Sources to compare different parameters of crime and
provide a conclusive and firm analytics output to the theme.
Source Type Brief Summary
Numbeo.com Unstructured It gives different parameters of crime com-
mitted in each county of Ireland.
Wikipedia.com Unstructured It gives county-wise population of Ireland.
Databank.com Structured It gives unemployment percentage of Male
and Female by basic and Advance Educa-
tion.
Statista.com Structured It provides year-wise (from year 2007-2007)
population of Ireland.
CSO.com Structured This source gives vaious crime offences from
2007-18 quarterly data which is used calcu-
late year-wise total crime offences.
Kaggle.com Structured This source provides criminal offenses com-
mitted and recorded in 563 different Garda
county Stations year wise.
dbei.gov.ie Structured This source gives county-wise work permit
issued by the companies for the year 2018.
Table 2: Summary of sources of data used in the project
2.1 Source 1: Numbeo - Crime in Ireland county-wise:
This dataset is the most important unstructured data with regards to all the datasets
used as this is the dataset that gives us different parameters of crime committed in every
county of Ireland. This dataset has all levels of crime committed and those are marked
in the form of percentage. The dataset tells us the percentage and it also signifies the
intensity of crime rate in form of parameters like Low, Moderate and High values. Using
the Xpath of the web page all levels of crimes with its crime index has been extracted
using the R studio. In this way 15 counties data have been extracted from 15 Numbeo
web pages.
https://www.numbeo.com/crime/in/Galway
https://www.numbeo.com/crime/in/Cork
https://www.numbeo.com/crime/in/Sligo-Ireland
https://www.numbeo.com/crime/in/Kilkenny
https://www.numbeo.com/crime/in/Limerick
https://www.numbeo.com/crime/in/Mayo
https://www.numbeo.com/crime/in/Waterford
https://www.numbeo.com/crime/in/Longford
https://www.numbeo.com/crime/in/Offaly
https://www.numbeo.com/crime/in/Dublin
https://www.numbeo.com/crime/in/Wexford-Ireland
https://www.numbeo.com/crime/in/Carlow-Ireland
https://www.numbeo.com/crime/in/Kerry
https://www.numbeo.com/crime/in/Kerry
https://www.numbeo.com/crime/in/Leitrim
Figure 1: Source 1
2.2 Source 2: Wikipedia -List of Irish Counties by population:
This dataset consists of list of counties of Ireland which are ordered by population. The
data has been taken from the latest census of Republic of Ireland. This source contains
4 columns of information (i.e population, density, traditional province and change in
previous year census) on 32 counties. The data is extracted using htmltab and data.table
packages through R studio.
https://en.wikipedia.org/wiki/List_of_Irish_counties_by_population
Figure 2: Source 2
2.3 Source 3: Databank - Unemployment: percentage with ba-
sic and advanced education
This dataset consists of data of unemployed males and females of Ireland who have gained
a basic and advanced education. This data is used to understand how unemployment can
be a cause of increase or decrease in crime rates in Ireland. The unemployment for both
male and female is in the form of percentage against the Total Labor Force of Ireland. The
above mentioned parameters were selected from the website link given below. The data
was downloaded in the .xlsx format. This source contains 5 columns of information for
2007-2018 year. Worldbank website never provides data created or updated information.
As the data source contains 2018 year data; which is within project requirement range of
one-year time frame.
https://databank.worldbank.org/data/source/world-development-indicators
2.4 Source 4: Statista - Population Growth in Ireland:
This is an important dataset from Statista which displays the population growth in
Ireland past 10 years. The reason this data is very important is to understand the
pattern of population growth and its effect on rise or fall of criminal activities in the state.
The dataset was released on November 2018 which gives population growth percentage
compared to previous years. To make the uniformity with all the dataset, year 2018
information was added from the source mentioned in the details section on the link.
Moreover, percentage data has been converted into population in millions which was
achieved by adding 2006 population which was in millions Irish population analysis (2019)
by doing the calculations in the excel file. Then, data cleaning is done through R studio.
https://www.statista.com/statistics/376895/population-growth-in-ireland/
Figure 3: Source 4
2.5 Source 5: Central Statistic Office - Crime Recorded in Ire-
land:
The third dataset used in assistance with the above 2 datasets is the crimes that were
recorded in Ireland for last decade. CSO has released this data publicly for awareness to
the citizens which is a very suitable set for understanding how crimes are registered with
respect to every year and its dependence on unemployment rate and population growth.
From the below CSO link, key table CJQ01 data was used which gives quarter-wise 75
types of recorded crime offences. The data was published on 22/03/2019 11:00:00 by the
CSO website. The quarter wise data was then converted into year wise through R studio
to maintain the uniformity.
https://www.cso.ie/px/pxeirestat/Statire/SelectVarVal/Define.asp?MainTable=
CJQ01&TabStrip=Select&PLanguage=0&FF=1
Figure 4: Source 5
2.6 Source 6: Kaggle - Crimes at Ireland Garda Stations 2007-
2017
This file consists of original variables regarding the criminal offenses committed and
recorded in 563 different Garda Stations year wise. This dataset would help us understand
what type of crimes were committed with the Station ID where the crimes have been
reported through a decade from 2007 to 2017. The data was created on 26/3/2019
contains 4 columns of information for each county with recorded offences per year.
https://www.kaggle.com/johnpwatson/crimesatirelandgardastations20072017
Figure 5: Source 6
2.7 Source 7: 4. Department of Business Enterprise and Inno-
vation - Employment Permit Statistics 2018
This is a statistical data that has been collected from the Irish Government which deals
with county wise work permit issued by the companies for the year 2018. This is a
part of an open-source publication by the Republic of Ireland for the public to access.
This dataset will help us to understand immigration that is carried out in every county of
Ireland from which we can understand the detailed idea of granted/non-granted permits to
the individuals. This data was released on 16/01/2019 contains 5 columns of information
for each county.
https://dbei.gov.ie/en/Publications/Employment-Permit-Statistics-2018.html
Figure 6: Source 7
3 Related Work
According to research done on Crime records in Ireland, increase in population has a
significant relation with the increase in amount of crimes in Ireland. In a paper submitted
by Peter Bacon and Martin ODonoghue by the name The Economics of Crime in Republic
of Ireland: An exploratory paper, they have explored possibilities of applying models
developed elsewhere to an analysis of rising crime rates in Ireland. However, the analysis
also indicates that rising unemployment will be associated with an increase in crimes
against property with violence, and with a decrease in crimes against property without
violence. This turns to be motivation for understanding if the unemployment is a cause
of increase in crime rates in Republic of Ireland Bacon & O’Donoghue (1975).
Considering this as a hypothesis, it was tried with different other parameters that can
also assist increase in crime rates in Ireland. One of the factors considered here is the
rate of unemployment in Ireland against 10 long years. In the paper submitted by Alan
Barrett and Seamus McGuinness by the name The Irish labour market And the great
recession It was found out that due to recession in 2008-09, there was a drastic drop in
employment and from the crime dataset it was found and investigated that the crime
rates were on rise during the recession period. To earn livelihood during the recession,
data showed that people had resorted to criminal activities during this year more than
any other year Barrett & McGuiness (2012). Moreover, in another paper submitted by
Alan Barrett1 and Elish Kelly by the name The Impact of Irelands Recession on the
Labour Market Outcomes of its Immigrants the employee permits had seen a big plum-
met during this year which motivated me to research on this texture so that we can
understand the patterns of mentality of criminals before they could commit crimes Bar-
rett & Kelly (2012). So, to check the findings of last decade the Databank website source
was used to check the unemployment, crime dataset was taken from CSO and to check
any changes major change in population Statista source was used. Alan and Elish also
discussed about the decline in immigrant population which motivated me to check the
Ireland county population texture for the which was taken from Wikipedia and for the
work permits county-wise total work Employee permit dataset was taken from dbei gov-
ernment website and compared it with the county crime index dataset which is based on
the people perceptions which was web scrapped from Numbeo website. Furthermore, a
research paper submitted by CSO on Review of quality and crime statistics explains how
statistics of recorded crime plays a vital role in informing society of the level and types of
offence CSO Review of the Quality of Crime Statistics 2016 (2016); which motivates me
to check the garda stations records dataset which was taken sourced from Kaggle website.
4 Data Model
Coming to the concept of Data warehousing, a data warehousing is a technique for
collecting and managing data from varied sources to provide meaningful business in-
sights. It is a blend of technologies and components which allows the strategic use of
data. It is electronic storage of a large amount of information by a business which is
designed for query and analysis instead of transaction processing. It is a process of
transforming data into information and making it available to users in a timely man-
ner to make a difference What Is Data Warehousing? Types, Definition Example (n.d.).
There are 2 types of approaches that we can follow while using Data Warehouse.
1. Inmon 2. Kimball
When it comes to designing a data warehouse for your business, the two most com-
monly discussed methods are the approaches introduced by Bill Inmon and Ralph Kim-
ball. In Bill Inmons enterprise data warehouse approach (the top-down design), a nor-
malized data model is designed first, then the dimensional data marts, which contain
data required for specific business processes or specific departments, are created from
the data warehouse. In Ralph Kimballs dimensional design approach (the bottom-up
design), the data marts facilitating reports and analysis are created first; these are
then combined to create a broad data warehouse George (2019). In our project, we
have used Kimballs approach to design dimension modelling for building a data ware-
house. The reason to do so is that it occupies less space, makes easy management and
is faster with respect to Inmon’s approach. Ralph Kimball always supported the in-
clusion of the end-users in the process throughout his work Chhabra & Pahwa (2014).
Surely, this approach is suitable for my research project, as the queries will help so-
ciety to make a better place for living. In case if we need to change fact table in
future, then we should use Inmons approach. But in our situation, we don’t need to
roll out any improvements indeed. Therefore, Kimball’s methodology will be a better
alternative to do this undertaking. In this way, to accomplish what I want out of my
task I have joined my datasets using year (2007-2018) which is common in Databank,
CSO and Statista datasets; whereas, unstructured dataset from Wikipedia and Numbeo
is joined with DBIO and Kaggle datasets which are based on counties for 2018 year.
So, from the above dataset I have derived 2 dimensions i.e DimYear and DimCounties.
All these dimensions are discussed below:
DimYear: Dimension year was made as year parameter is common in all datasets.
CSO which contains Crime offences data by year, Databank contains year-wise data of
unemployment of male and female percentage by basic and advance education whereas
statista contains the year-wise population data. Hence, identify the unemployment per-
centage with respect to crime offences year is the common parameter. Hence, Year is the
primary parameter in this data warehouse project. Using SSIS, YearId was generated and
assigned as a primary key for the year dimension.
DimCounty: Dimension County was created as county is common in Numbio, Wikipedia
which are the unstructured datasets and is also common in Kaggle and DBEI which are
the structured datasets. Here county was the common parameter with which Crime rate
from Numbio and employee permits are compared. Hence, County is also the primary key
parameter in this data warehouse. Using SSIS, CountyID was generated and assigned as a
primary key for the county dimension.
The Below figure illustrates the star Schema for this project which is further used in
Tableau visualization:
Figure 7: Star Schema
5 Logical Data Map
Logical data Map: The dimensions and the facts of all the datasets are explained below:
Table 3: Transformation, sources and destination for all
components in the Logical Data Map are illustrated in
below Fig: Figure ??
Source Column Destination Column Type Transformation
1 Year DimMovie DimYear Dimension Primary key, 2018 year added to the table, as the data
is given in the website
1 County DimCounty County Dimension Primary key, County name renamed to match consis-
tency
1 Level of crime FactTable Level of crime Fact Round up to 2 decimal $
1 Safety during
daylight
FactTable Safety during
daylight
Fact Round up to 2 decimal $
1 Total Crime
Record
FactTable Total Crime
Record
Fact Round up to 2 decimal $
2 Year DimYear Year Dimension Primary key, 2018 year added to the table, as the data
is given in the website $
2 Administrative
County
DimCounty County Dimension Primary key, County names renamed to match consis-
tency with other dataset
2 Population FactTable Population Fact Comma removed from the values $
3 Year DimYear Year Dimension Primary key, Format matched with other datasets (it
was 2007 [YR2007] format) $
3 Labor force,
total
FactTable Total Labor
Force in
Million
Fact Rounded to 3 decimals $
Continued on next page
Table 3 – Continued from previous page
Source Column Destination Column Type Transformation
3 Unemployment
with basic ed-
ucation, male
(% of male
labor force
with basic
education)
FactTable Unemployment
Ba-
sic Education
Male
Percentage
Fact Rounded upto 2 decimals $
3 Unemployment
with basic
education,
female
FactTable Unemployment
Basic Ed-
ucation
female Per-
centage
Fact Rounded upto 2 decimals $
3 Unemployment
with Advance
education,
female
FactTable Unemployment
Advance Ed-
ucation
Female
Percentage
Fact Rounded upto 2 decimals $
3 Unemployment
with Advance
education,
male
FactTable Unemployment
Advance Ed-
ucation
male Per-
centage
Fact Rounded upto 2 decimals $
4 Year DimYear Year Dimension Primary key, 2018 year added from the source men-
tioned in the description $
4 Population
growth com-
pared to
previous year
FactTable Percentage
Rise
Fact Rounded up to 2 decimal
Continued on next page
Table 3 – Continued from previous page
Source Column Destination Column Type Transformation
4 population
Million
FactTable Population
in Million
Fact Rounded up to two decimal $
5 Year(Quarter-
wise)
DimYear Year Dimension Quarterwise Year converted into Year format then
transposed to match with the other datasets $
5 Total Of-
fences
FactTable Total Offences Fact Number of offences columns were added and converted
into Total offences
5 Revenue FactTable Revenue Fact Rounded to nearest million $
6 Year DimYear Year Dimension 2018 year was taken from the table, positioning of col-
umn
6 Station DimCounty County Dimension Taken as it is from the source
6 Number
˙of ˙Crime
˙Record
FactTable Total Crime
Record
Fact Comma removed from the values $
7 Year DimYear Year Dimension Taken as it is from the source
7 County/
Country
DimCounty County Dimension Extra spaces were removed from the column $
7 New FactTable New Fact Taken as it is from the source
7 Renewal FactTable Renewal Fact Taken as it is from the source $
7 Total FactTable Total Permit Fact Extra comma removed from the value
7 Refused FactTable Refused Fact Taken as it is from the source $
7 Withdrawn FactTable Withdrawn Fact Taken as it is from the source $
6 ETL Process
ETL means Extraction, Transform and Load is considered to be the foundation database
or data warehousing to reduce the error and minimize the data loss. It is the high-level
perspective of the system can be visualized by conceptual modeling of ETL process.
There are various advantages like system error identification, cost minimization, risk and
scope assessment etc Biswas et al. (2019). For ETL process, according to the requirement
the data has to be cleaned and transformed keeping in mind to remove to expel all sort
of redundancy in information. Following is the ETL procedure which I utilized in my
undertaking:
Figure 8: ETL Process
6.1 Extraction:
We have downloaded structured dataset from CSO, Statista, Databank, DIBO and Kaggle
websites in the excel file format which are cleaned through R. whereas the unstructured
dataset was web scrapped through R from the Wikipedia and Numbeo websites. First
structured dataset which was on crime rate of Ireland data. There were total 48 columns
from 2007-2018 quarter which was converted into year by taking the sum of the quarter
to make the data year-wise. Second dataset which was from Statista on Ireland popula-
tion which gave the Irelands population growth from 2007-2017. Third structured source
was from Databank which was on Unemployment percentage gender statistics. The ex-
tracted excel file contains 6 columns like Year (2007-2018), labor force, Unemployment
percentage with basic education male female, Unemployment percentage with advance
education with male female. Fourth structured dataset was extracted in the form of .xlsx
format which was taken from DBIO website. This dataset was for county-wise employee
permit by companies for 2018 year. The dataset contains 27 rows with counties and
7 columns such as year, county, new permits, renewals permits, total permits, refused
and withdrawn permits from Irish embassy. Fifth dataset was taken from Kaggle which
was on guarda station crime records for each county. Initially there were three sheets
in the excel; we have used third sheet which gave the year-wise county station records.
The table contains 5 columns, station id, stations, Divisions, Year and Number of crime
record. For Unstructured dataset, we have used Numbeo as our first unstructured data
source; we did web scrapping on 15 webpages, as each webpage gives different level of
crime data that represents each county. We are not able to scrape data for remain-
ing counties because of the unavailability of information for those counties. To do the
web-scrapping we have used different R packages and function to extract data from 15
counties. The data had been selected using the Xpath from the website. All the pages
are then upended into the data frame. Then, unwanted extra columns and rows were
deleted. In order to make to consistency, county names were renamed to match with the
other datasets. For second unstructured dataset, we have extracted population data from
the Wikipedia. The dataset contains 6 columns namely Rank, Administrative County,
Population, Density, Traditional province, change in the previous census. To do web
scrapping different R libraries were used. Using the htmltab the data was extracted from
Wikipedia. Then, unwanted columns were removed and column names were renamed to
make the consistency throughout the datasets.
6.2 Transformation:
After the extraction and before loading the data in the data warehouse, the data should
be made appropriate to meet the business necessities. The data transformation may
incorporate activities, for example, cleaning, joining, and generating calculated data de-
pending on existing values. This part of the ETL procedure is the most critical and tiring
one and expends a great deal of time as we need to accomplish the cleanest information
to deliver an exact business solution. All the structured data set were in the .xlsx format.
So, first I found out all the required fields from these datasets and which are needs to be
modified for my BI queries. To do this extra columns and spaces were removed from the
datasets. I found out that CSO dataset was given in quarters to make it consistent it is
converted into years through R code. Also, the data had 76 rows of number of offences
which then was added to find the total offences per year. The table is then transformed
using t() function from rows to year to maintain consistency across all dataset. All the
quarters and types of offences which was not required was then dropped from the ta-
ble. For Statista, to make the uniformity with all the dataset, year 2018 information
was added from the source mentioned in the details section of that web page. Moreover,
percentage data has been converted into population in millions which was achieved by
adding 2006 population which was in millions Irish population analysis (2019) by doing
the calculations in the excel file. Then, data cleaning such as extra spaces and columns
were removed through R studio. In the third Databank dataset, cleaning and formatting
was done through R studio to maintain the consistency. In the DBIO dataset, column
names were renamed, moreover extra column and spaces were removed. Also some null
values were removed from the dataset. The Kaggle dataset a lot of null values were
present and to get the desired 2018 year for BI query, unwanted rows and columns were
removed. To do this is.na(x) function was used. Not equal to symbol is used to check
the condition for removing the unwanted rows. From the table we have extracted crime
records for 2018 year by removing the unwanted year rows using R studio. Extra spaces
and columns were then removed from the table to maintain the consistency throughout
the datasets. changeling task was to scrape data from the Numbeo website, as each
county represents and the data was available for 15 counties, so for 15 counties I had
to scrape 15 pages. The challenging task was made easy after using R packages such as
rvest,magrittr,RSelenium,httr,dplyr and data.table packages which made our work easy.
We have used function to extract data from 15 counties. The data had been selected using
the Xpath from the website. All the pages are then upended into the data frame. Then,
unwanted extra columns and rows were deleted. In order to make to consistency, county
names were renamed to match with the other datasets. In Wikipedia web-scrapping,
we have extracted population data from the Wikipedia. To do web scrapping htmltab
and data.table was used. Using the htmltab the data was extracted from Wikipedia.
The dataset contains 6 columns, unwanted column such as Rank, Density, Traditional
province and change in the previous census were dropped. Then, column names were
renamed to make the consistency throughout the datasets. All the above changes were
done in R studio and every one of the codes which are utilized to accomplish the outcomes
are referenced in Appendix. Indexing has been done in SQL Server Management studio
tool for the database which was named as Ireland Crime for this project. In this I have
chosen a quality from each table which had unique value for each line and allocated it as
a Primary Key for that specific table. SSMS as a matter of course allots the varchar (50)
as the data type for each attribute. For year which is a numeric value, I have changed
the data type to int(integer).
6.3 Loading:
The last advance in the ETL procedure includes stacking the transformed data into the
end target. Depending on the requirements of the organization, this process varies widely.
Some data warehouses may overwrite existing information with cumulative information;
updating extracted data is frequently done on a daily, weekly, or monthly basis Extract,
transform, load (2019). After creating Ireland Crime database, we have to load all in-
formation in SSMS through SSIS on staging area. Data flow task was used to complete
this process. To load the data, flat file source component was used that helped to load
data from csv format file to SSMS using the OLEDB destination component. OLEDB
component helps to create table and load the data into the SSMS. In Flat file source,
we need to give the table name which needs to be created in SSMS; Moreover, we also
need to give the csv transformed output file. After this we need to set the text qualifier
to inverted commas as a separator. In the preview section file can be seen in the tabular
format. The Advance SSIS provides options to change the datatypes which is important
in order to load the correct data to SSMS. Then in the connection manager of OLEDB,
we can compose a SQL query to make new table in SSMS or SSIS give us Accurate
recommendation to make table dependent on flat source file. We can modify this query
from the new tab in the OLEDB destination component, we can also see the error in
the error section. In all I had 7 flat files, so 7 data flow task was file was taken. After
Files getting transferred in SSMS we have to make Dimension table. A SQL execution
task is utilized in this progression. A SQL content is written in the undertaking itself to
make two dimension tables by using the data from the raw tables which was created by
the data flow task. The Dimension tables provide the context for fact tables for all the
measurements presented in the Data warehouse. Although dimension tables are usually
much smaller than fact tables, but they are the heart of the data warehouse because they
provide entry points to data Kimball & Ross (2011). Next, for populating the fact table
another SQL task was created. First, SQL script was written to insert the data from
the raw tables using the inner joins on the dimension tables. It was quite a challenging
task; as after the data is inserted in the table, populating fact table with proper values
and join the table required proper brainstorming. Considering various types of joins we
came to know that for my data inner join is best suitable. After the fact table, cube was
deployed we were able to get star schema. I have connected this cube with the SSIS for
automation. After successful deployment our desired star schema appeared and checked
if all the values in the cubes are correct using the explore option. After confirming all
the values from the cube, we are ready for visualizing the data in a visualization tool i.e
Tableau for this project.
7 Application
For better understanding the Business Intelligence Queries, as per the business require-
ments discussed in Section 1, are visualized with the help of Tableau by mapping the 3
entities separately in a graph and understand the pattern in the data.
7.1 BI Query 1: Does the unemployment in the country have
any effect on increase or decrease in rate of crime and pop-
ulation of Ireland?
To understand the queries, it is to be understood that how the unemployment has affected
Ireland in last recent years. For that, the data to be used should be from 2007 till present.
The factors taken into consideration are as follows,
- Unemployment in Ireland among working adults (men and women) in last decade
(2007 - 2017)
- Total population across the last decade (2007-2017)
- Total number of offenses occurred in the last decade (2007-2017)
1. Unemployment in Ireland among working adults: Here the unemployment based
on population has been extracted from The World Bank which gives the complete details
of unemployment based on citizens who have received basis education with regards to
ones who have received advanced education among all the labor force available in Ireland.
This will help in filtering the number of non-working population from the year 2007 till
2018. This data would help understand variety in unemployment if any specific which is
mapped against the years 2007 and 2018.
2. Total population across last decade: The cumulative population has been extracted
from Wikipedia for the years 2007 till 2017 to understand the growth of population
of Ireland. From this, it is understood that the what amount of population has been
increased or decreased so that it becomes comparable to the total number of crimes and
the unemployment occurring in the Ireland.
3. Total number of offenses occurred in the decade:This dataset comes from Central
Statistics Office where the data has been used in the form of structured dataset which
contains all the offenses occurring in the past decade (2007 till 2017). This dataset will
help an individual to understand the amount of crimes occurring with respect to the
population and unemployment in Ireland. The data was extracted and presented and
visualized with the help of Tableau. Below is the visualization received from execution of
first BI query. The data was extracted, presented and visualized with the help of Tableau.
Below is the visualization received from execution of first BI query.
Figure 9: 1st BI Query
7.2 BI Query 2: Does the increase in population of counties
cause any effect on the number of crime rates registered in
the Garda Stations of the respective counties?
To understand this, the datasets that have been used are as follows,
- Garda station crime record (2018)
- Population of Ireland based on counties (2018)
1. Garda station crime records: The dataset has been scrapped from Kaggle which
consisted of all the Garda Station of Ireland county wise that have criminal cases reg-
istered for the year 2018. This data can be very useful with respect to understanding
the crimes registered across the counties which is then compared with the population of
counties in Ireland.
2. Population of Ireland based on counties: This dataset has been extracted from
scraping Wikipedia which has been updated with the population of Ireland county wise
for the year 2018. This will help to understand the pattern of population density across
Ireland.
The data was extracted, presented and visualized with the help of Tableau. Below is
the visualization received from execution of Second BI query.
Figure 10: 2rd BI Query
7.3 BI Query 3: Does the immigration in the country have any
effect on increase or decrease in rate of crime in Ireland?
Immigration for any countries help the country to develop in various fields like technology,
science and infrastructure. But it is debatable fact for the residents of the country as
some countries think that immigration would take the jobs away from the local people of
the country. This can be proven with the help of facts and data available. To verify the
validity of this point, two datasets have been used namely,
- Population of Ireland based on county.
- Crimes in Ireland based on county.
- Employee permits to the immigrants provided in every county.
1. Population of Ireland based on county: Earlier, the population has been considered
on the basis of number of years for the whole country. Here the population of counties have
been used as a comparative factor between crimes occurring in Ireland. This dataset has
been extracted from scraping Wikipedia which has been updated with the population
of Ireland county wise for the year 2018. This will help to understand the pattern of
population density across Ireland.
2. Crimes in Ireland based on county: The dataset for crimes in Ireland has been
extracted and scraped from Numbeo.com which has wide range of crimes that have been
committed in Ireland for the year 2018. This data will help to understand crimes for a
specific year occurring on the county level to bifurcate the crime and go deep down to
understand crime variation on county level
3. Employee permits provided to the immigrants on county level for 2018: The dataset
gives the details of the permits given to the immigrants in the Ireland. This dataset will
help to understand how many permits were granted by the Irish government for every
county thus giving a whole idea about immigration.
The data was extracted, presented and visualized with the help of Tableau. Below is
the visualization received from execution of Third BI query.
Figure 11: 3rd BI Query
These BI queries have been detailed further in the Discussion section.
7.4 Discussion
From the 1st BI query, it can be understood that from the year 2007 to 2010, there was
a steep increase in the rate of unemployment in Ireland and due to which there is also
an increase in crime rates as well. This can be linked with the Global recession that hit
the whole world Great Recession (n.d.). For the year 2007 to 2010, the recession affected
Ireland as well which caused a huge amount of joblessness and unemployment. From
the data available, during this period there is a steep increase in crime rates too. For
the years 2010-2014, there is no specific change unemployment and crimes committed
but from the year 2015 till 2018, there was a huge amount of decrease in unemployment
rate which in turn has also shown the decrease in rate of crimes committed, so it can be
concluded that there is a direct proportion between rate of crime and unemployment.
The population has been on steady increase from the time period of 2007 to 2018.
But there does not seem any relation between overall population of country and crimes
committed. This can be elaborated in the 2nd BI query where the population of counties
have been compared with the crimes rates registered in the Garda Stations in different
counties in Ireland. From the datasets, it has been found out that with the increase in
population of counties, there has been an increase in number of crime rates registered
in respective county. Thus, it can be clearly justified that there is relation between
population of country and the crimes committed or registered in Ireland based on counties.
The 3rd BI query deals with the immigration effect on rate of crimes in Ireland. When
the datasets were mapped in tableau, an interesting factor was noted that Dublin being
the capital of Ireland has highest number immigration filings, but the crimes committed
in the county are less. If this must be compared with the Donegal that has highest
criminal cases registered even when the number of immigration cases filed are less, it is
understood that there is not significant relation between the immigration in Ireland and
the crimes reported in Ireland. On the contrary, it can be inferred that with the increased
amount of security in Dublin, the crime rates are less as compared to Donegal, which is a
remote county and is less secure than Dublin and progressed counties like Cork, Galway
and Limerick.
8 Conclusion and Future Aspects
During the start of project, there was a consideration that the crime rates in Ireland,
though is less, but still there is small amount of unrest which restricts Ireland from being
one of the most uncorrupted countries in world. Crime being one of the factors, the
sub-factors that affect crimes have been viewed and verified with the data validations.
The factors like unemployment, immigration and population rise paved the way towards
criminal activities in Ireland. Factors like unemployment and population influenced crime
rate but immigration didn’t find a place in affecting the crime rate hike. This shows that
even if Ireland has its own problems to deal like border issues with Norther Ireland or
business continuity with Britain after the Brexit, it is still a peace loving country and
encourages a good behavior towards immigration and development.
This project is helpful for future aspects as it deals with criminal analytics. Every
country is affected with crime and Ireland is not different. Recently, UK police had
collaborated with Accenture to perform criminal analytics to understand the patterns of
crimes that happen in UK. This helped UK in huge amount as the system could give a
rough information about the culture of criminals and identify how and when the attacks
would occur in the disturbed areas. This project is termed as The Enterprise approach to
Law Enforcement Accenture Police Center of Excellence (n.d.)”. The same concept can
be implemented in Ireland and can pave a way towards safer and crime free country.
References
Accenture Police Center of Excellence (n.d.).
URL: https://www.accenture.com/gb-en/insight-enterprise-approach-to-law-
enforcement
Bacon, P. & O’Donoghue, M. (1975), ‘The economics of crime in the republic of ireland:
An exploratory paper’, Economic and Social Review 7(1), 19.
Barrett, A. & Kelly, E. (2012), ‘The impact of irelands recession on the labour market
outcomes of its immigrants’, European Journal of Population/Revue europ´eenne de
D´emographie 28(1), 91–111.
Barrett, A. & McGuiness, S. (2012), ‘The irish labour market and the great recession’,
CESifo DICE Report 10(2), 27–33.
Biswas, N., Chattapadhyay, S., Mahapatra, G., Chatterjee, S. & Mondal, K. C. (2019),
‘A new approach for conceptual extraction-transformation-loading process modeling’,
International Journal of Ambient Computing and Intelligence (IJACI) 10(1), 30–45.
Chhabra, R. & Pahwa, P. (2014), ‘Data mart designing and integration approaches’,
International Journal of Computer Science and Mobile Computing 3(4), 74–79.
CSO Review of the Quality of Crime Statistics 2016 (2016).
URL: http://www.cso.ie/en/media/csoie/releasespublications/documents/crimejustice/2016/reviewo
Extract, transform, load (2019).
URL: https://en.wikipedia.org/wiki/Extract,transform,l oadTransform
George, S. (2019), ‘Inmon or kimball: Which approach is suitable for your data ware-
house? 2019’.
Great Recession (n.d.).
URL: https://en.wikipedia.org/wiki/GreatRecession
Irish population analysis (2019).
URL: https://en.wikipedia.org/wiki/Irishpopulationanalysis
Kimball, R. & Ross, M. (2011), The data warehouse toolkit: the complete guide to dimen-
sional modeling, John Wiley & Sons.
Paul, T. (2016).
URL: https://www.cliffsnotes.com/study-guides/criminal-justice/crime/definitions-
of-crime
Scott, J. & Marshall, G. (2009), A dictionary of sociology, OUP Oxford.
What Is Data Warehousing? Types, Definition Example (n.d.).
URL: https://www.guru99.com/data-warehousing.html1
Appendix
R code used Cleaning and Extraction:
1. Numbeo Web scrapping Code
#Install Packages
install.packages(’rvest ’,repos = "https :// cran.rstudio.com")
install.packages(’magrittr ’,repos = "https :// cran.rstudio.com")
install.packages(’RSelenium ’,repos = "https :// cran.rstudio.com")
install.packages(’httr ’,repos = "https :// cran.rstudio.com")
install.packages(’dplyr ’,repos = "https :// cran.rstudio.com")
#dplyr is the next iteration of plyr , focussed on tools for working with data
install.packages(’data.table ’,repos = "https :// cran.rstudio.com")
#Load packages
library(’rvest ’)
library(’magrittr ’)
library(’RSelenium ’)
library(’httr ’)
library(’dplyr ’)
library(’data.table ’)
#Define Web Scrapping Function and Scrapping Code for Multiple Pages
get_countdata <-function(keyword)
{
url <- paste(’https :// www.numbeo.com/crime/in/’,keyword ,sep="")
#Reading the HTML code from the website
webpage <- read_html(url)
#Getting name of the County
County_crime <- html_nodes(webpage ,’.columnWithName ’)
#Converting the title data to text
Countyname_data <- html_text(County_crime)
#Let ’s have a look at the title
head(Countyname_data)
#Getting number of ratings
Crimenumber_data <- html_nodes(webpage ,’.indexValueTd ’)
#Converting the title data to text
Crimenumber_data <- html_text(Crimenumber_data)
#Let ’s have a look at the title
head(Crimenumber_data)
#combining all lists to form a data frane
Crimecountywise _df <- data.frame(Name = Countyname_data , NumberofRatings =
#adds new variables and preserves existing ones
Crimecountywise _df <- Crimecountywise _df %>% mutate(County_Name = keyword)
}
#Define Keyword for Webscrapping
url <-get_countdata(’Galway ’)
url1<-get_countdata("Cork")
url2<-get_countdata("Sligo -Ireland")
url3<-get_countdata("Kilkenny")
url4<-get_countdata("Limerick")
url5<-get_countdata("Maynooth -Ireland")
url6<-get_countdata("Waterford")
url7<-get_countdata("Athlone")
url8<-get_countdata(’Mullingar -Ireland ’)
url9<-get_countdata("Dublin")
url10<-get_countdata("Wexford -Ireland")
url11<-get_countdata("Carlow -Ireland")
url12<-get_countdata("Wexford -Ireland")
url13<-get_countdata("Donegal -Ireland")
url14<-get_countdata("Ennis -Ireland")
#Transpose all data
url <-transpose(url , ignore.empty = FALSE)
url1<-transpose(url1, ignore.empty = FALSE)
url2<-transpose(url2, ignore.empty = FALSE)
url3<-transpose(url3, ignore.empty = FALSE)
url4<-transpose(url4, ignore.empty = FALSE)
url5<-transpose(url5, ignore.empty = FALSE)
url6<-transpose(url6, ignore.empty = FALSE)
url7<-transpose(url7, ignore.empty = FALSE)
url8<-transpose(url8, ignore.empty = FALSE)
url9<-transpose(url9, ignore.empty = FALSE)
url10<-transpose(url10, ignore.empty = FALSE)
url11<-transpose(url11, ignore.empty = FALSE)
url12<-transpose(url12, ignore.empty = FALSE)
url13<-transpose(url13, ignore.empty = FALSE)
url14<-transpose(url14, ignore.empty = FALSE)
#uppend all files
v<-rbind(url ,url1,url2,url3,url4,url5,url6,url7,
url8,url9,url10,url11,url12,url13,url14)
#Replicate first column
n=1
v1 = cbind(v, replicate(n,v$V1))
#Positioning of Replicated column
v2 <-v1[, c(1,16,2:15)]
#Drop Extra Rows
v2 <-v2[-c(4,7,10,13,16,19,22,25,28,31,34,37,40,
43),]
#Rename Row names
v2[1,1]<- ’County ’
v2[2,1]<- ’Galway ’
v2[4,1]<- ’Cork ’
v2[6,1]<- ’Sligo ’
v2[8,1]<- ’Kilkenny ’
v2[10,1]<-’Limerick ’
v2[12,1]<-’Mayo ’
v2[14,1]<-’Waterford ’
v2[16,1]<-’Longford ’
v2[18,1]<-’Offaly ’
v2[20,1]<-’Dublin ’
v2[22,1]<-’Wexford ’
v2[24,1]<-’Carlow ’
v2[26,1]<-’Kerry ’
v2[28,1]<-’Donegal ’
v2[30,1]<-’Leitrim ’
#Remove extra Rows
v2<-v2[-c(3,5,7,9,11,13,15,17,19,21,23,25,27,29,
31),]
#Convert whole data into character
v2[] <- lapply(v2, as.character)
#Head First Row
colnames(v2) <- v2[1, ]
#Drop First Row
v2 <- v2[-1 ,]
str(v2)
#Convert data into numeric using lappy (function over a list)
v2[] <- lapply(v2, as.numeric)
#Write Formated File
write.csv(v2, file = "C:/ Users/MOLAP/Desktop/R Project/Updated/Numbio_Data_Sc
row.names=FALSE)
#Removed extra row while Reading File
Numbio_Final <-read.csv("C:/ Users/MOLAP/Desktop/
R Project/Updated/Numbio_Data_Scrapping_OP.csv", skip = 1)
#summary(Numbio_Final)
#Sum of all Crimes to find the Total number of Crime
Numbio_Final$Total <-as.numeric(Numbio_Final$
Level.of.crime )+ as.numeric(Numbio_Final$Crime.
increasing.in.the.past.3.years )+as.numeric(Numbio_Final$
Worries.home.broken.and.things.stolen )+as.numeric
(Numbio_Final$Worries.being.mugged.or.robbed )+
as.numeric(Numbio_Final$Worries.car.stolen )+
as.numeric(Numbio_Final$Worries.things.from.car.
stolen )+as.numeric(Numbio_Final$Worries.attacked )+
as.numeric(Numbio_Final$Worries.being.insulted )+
as.numeric(Numbio_Final$Worries.being.subject.
to.a.physical.attack.because.of.your.skin.colour
.. ethnic.origin.or.religion )+as.numeric(Numbio_
Final$Problem.people.using.or.dealing.drugs )+as.
numeric(Numbio_Final$Problem.property.crimes.such
.as.vandalism.and.theft )+
as.numeric(Numbio_Final$Problem.violent.crimes.
such.as.assault.and.armed.robbery )+
as.numeric(Numbio_Final$Problem.corruption.and
.bribery )+as.numeric(Numbio_Final$Safety.walking.
alone.during.daylight )+as.numeric(Numbio_Final$
Safety.walking.alone.during.night)
colnames(Numbio_Final)
#Rename Column name(Removed extra spaces)
setnames(Numbio_Final ,old = c("County","Level.of.crime","Crime.increasing.in.
colnames(Numbio_Final)
#Convert data to numeric
Numbio_Final$County <-as.character(Numbio_Final$
County)
Numbio_Final$Level_of_crime <-as.numeric(Numbio_
Final$Level_of_crime)
Numbio_Final$past_3_years <-as.numeric
(Numbio_Final$past_3_years)
Numbio_Final$things_stolen <-as.numeric
(Numbio_Final$things_stolen)
Numbio_Final$mugged_or_robbed <-as.numeric
(Numbio_Final$mugged_or_robbed)
Numbio_Final$car_stolen <-as.numeric(Numbio_Final
$car_stolen)
Numbio_Final$things_from_car_stolen <-as.numeric
(Numbio_Final$things_from_car_stolen)
Numbio_Final$attacked <-as.numeric
(Numbio_Final$attacked)
Numbio_Final$being_insulted <-as.numeric
(Numbio_Final$being_insulted)
Numbio_Final$physical_attack <-as.numeric
(Numbio_Final$physical_attack)
Numbio_Final$dealing_drugs <-as.numeric
(Numbio_Final$dealing_drugs)
Numbio_Final$vandalism_and_theft <-as.numeric
(Numbio_Final$vandalism_and_theft)
Numbio_Final$assault_and_armed_robbery <-
as.numeric(Numbio_Final$assault_and_armed_robbery)
Numbio_Final$corruption_and_bribery <-as.numeric
(Numbio_Final$corruption_and_bribery)
Numbio_Final$Safety_during_daylight <-as.numeric
(Numbio_Final$Safety_during_daylight)
Numbio_Final$Safety_during_night <-as.numeric
(Numbio_Final$Safety_during_night)
Numbio_Final$Total_Crime_Record <-as.numeric
(Numbio_Final$Total_Crime_Record)
str(Numbio_Final)
#Record is for 2018 , So adding Year Column
Numbio_Final$Year <-"2018"
#Positioning of column
Numbio_Final <- Numbio_Final[, c(18,1:17)]
#Write Final o/p File
write.csv(Numbio_Final , file = "C:/ Users/MOLAP/Desktop/R Project/Updated/
Numbio_Data_Scrapping_OP.csv",row.names=FALSE)
2. Wikipedia Web page scrapping Code
# Data Scrapping for Project from Wikipedia
#Install Package
install.packages("htmltab",repos = "https :// cran.rstudio.com")
install.packages(’data.table ’,repos = "https :// cran.rstudio.com")
#Load Library
library("htmltab")
library(’data.table ’)
#Read URL
url <-"https ://en.wikipedia.org/wiki/List_of_Irish_counties_by_population"
wiki <- htmltab(doc=url , which=1)
#Drop Column
wiki <-wiki[-c(1)]
#Drop Extra Rows
wiki <-wiki[-c(10,36,2,4,8,37,6,13,31,37),]
#Rename Column
wiki <-setnames(wiki , old=c("Administrative county"), new=c("County"))
#Add Extra Column
wiki$Year <-"2018"
#Rename Column Names
setnames(wiki ,old=c("County","Population",
"Density (/ k m )", "Traditional province","Change since previous census")
Province","Change_since_previous_census"))
#Drop column
wiki <-wiki[-c(3,4,5)]
#Positioning of column
wiki <- wiki[, c(3,1:2)]
#Remove Extra characters (Comma)
wiki$Population <-gsub(",", "",wiki$Population)
#Write the Final o/p data to a csv file
write.csv(wiki , file = "Wiki_Population_OP.csv",row.names=FALSE)
3. The World Bank - Unemployment Percentage basic and Ad-
vance Education
#Install packages
install.packages(’openxlsx ’,repos = "https :// cran.rstudio.com")
install.packages(’data.table ’,repos = "https :// cran.rstudio.com")
#Load library
library(’openxlsx ’)
library(’data.table ’)
setwd("C:/ Users/MOLAP/Desktop/R Project/Updated")
#Read File
dataUnemployment <-read.xlsx("C:/ Users/MOLAP/Desktop/R Project/Updated/Unempl
#Remove Row
dataUnemployment <- dataUnemployment [-c(1:3),]
#Column Rename
dataUnemployment 1<- setnames(dataUnemployment , old=c("Series.Name","Labor.for
"Unemployment.with.basic.education ,
.male .(%. of.male.labor.force.with.basic.education)","Unemployment.with.basic.
(%.of.female.labor.force.with.basic.education)",
"Unemployment.with.advanced.education ,. female.
(%.of.female.labor.force.with.advanced.education)","Unemployment.with.advance
(%.of.male.labor.force.with.advanced.education)"), new=c("Year","Total_Labor_
"Unemployment_Basic_Education_Female_Percentage","Unemployment_Advance_Educat
,"Unemployment_Advance_Education_Male_Percentage"))
#numeric
dataUnemployment 1$Total_Labor_Force_in_Million
<- as.numeric( dataUnemployment 1$Total_Labor_
Force_in_Million)
dataUnemployment 1$Unemployment_Basic_Education_
Male_Percentage <- as.numeric( dataUnemployment 1$Unemployment_Basic_
Education_Male_Percentage)
dataUnemployment 1$Unemployment_Basic_Education_
Female_Percentage <- as.numeric( dataUnemployment 1$Unemployment_Basic_
Education_Female_Percentage)
dataUnemployment 1$Unemployment_Advance_Education_
Female_Percentage <- as.numeric( dataUnemployment 1$Unemployment_Advance
_Education_Female_Percentage)
dataUnemployment 1$Unemployment_Advance_Education
_Male_Percentage <- as.numeric( dataUnemployment 1$Unemployment_Advance
_Education_Male_Percentage)
#round of 2 after decimal value
dataUnemployment 1$Total_Labor_Force_in_Million
<- round( dataUnemployment 1$Total_Labor_Force_in
_Million ,2)
dataUnemployment 1$Unemployment_Basic_Education_
Male_Percentage <- round( dataUnemployment 1$Unemployment_Basic_
Education_Male_Percentage ,2)
dataUnemployment 1$Unemployment_Basic_Education_
Female_Percentage <- round( dataUnemployment 1$Unemployment_Basic_
Education_Female_Percentage ,2)
dataUnemployment 1$Unemployment_Advance_Education_
Female_Percentage <- round( dataUnemployment 1$Unemployment_Advance_
Education_Female_Percentage ,2)
dataUnemployment 1$Unemployment_Advance_Education_
Male_Percentage <- round( dataUnemployment 1$Unemployment_Advance_
Education_Male_Percentage ,2)
#Column Fields Rename
dataUnemployment 1[1,1]<-"2007"
dataUnemployment 1[2,1]<-"2008"
dataUnemployment 1[3,1]<-"2009"
dataUnemployment 1[4,1]<-"2010"
dataUnemployment 1[5,1]<-"2011"
dataUnemployment 1[6,1]<-"2012"
dataUnemployment 1[7,1]<-"2013"
dataUnemployment 1[8,1]<-"2014"
dataUnemployment 1[9,1]<-"2015"
dataUnemployment 1[10,1]<-"2016"
dataUnemployment 1[11,1]<-"2017"
dataUnemployment 1[12,1]<-"2018"
#Write the Final o/p data to a csv file
write.csv( dataUnemployment 1, file = "Unemployment_Gender_Statistics_OP.csv",
row.names =FALSE)
4. Statista - Population Growth Ireland
#Statista Data in R
#Install Packages
install.packages("openxlsx",repos = "https :// cran.rstudio.com")
install.packages("data.table",repos = "https :// cran.rstudio.com")
#Load library
library(’openxlsx ’)
library(’data.table ’)
#Read Raw File
dataStatista <-read.xlsx("C:/ Users/MOLAP/Desktop
/R Project/Updated/population -growth -in -ireland -
2018.xlsx", sheet = 2,startRow = 1,colNames = T, skipEmptyRows = TRUE)
#To remove blank rows
dataStatista <-dataStatista[-c(1:2),]
#Drop a column
dataStatista <-dataStatista[,-3]
#Convert Data to numeric
dataStatista$X4 <- as.numeric(dataStatista$X4)
dataStatista$X2 <- as.numeric(dataStatista$X2)
#Roundof upto 2 valuea after decimal point
dataStatista$X4 <- round(dataStatista$X4,3)
dataStatista$X2 <- round(dataStatista$X2,3)
#Column Rename
setnames(dataStatista , old=c("Population.growth.in.Ireland.2017",
"X2","X4"), new=c("Year", "Percentage_Rise","Population_in_MIllions"))
dataStatista[1,1]<- "2007"
dataStatista[2,1]<- "2008"
dataStatista[3,1]<- "2009"
dataStatista[4,1]<- "2010"
dataStatista[5,1]<- "2011"
dataStatista[6,1]<- "2012"
dataStatista[7,1]<- "2013"
dataStatista[8,1]<- "2014"
dataStatista[9,1]<- "2015"
dataStatista[10,1]<- "2016"
dataStatista[11,1]<- "2017"
dataStatista[12,1]<- "2018"
#Convert data to numeric
dataStatista$Percentage_Rise <-as.numeric
(dataStatista$Percentage_Rise)
dataStatista$Population_in_MIllions <-as.numeric
(dataStatista$Population_in_MIllions)
str(dataStatista)
#write the final o/p data to a csv file
write.csv(dataStatista , file = "Population_Growth_Statista_OP.csv",
row.names=FALSE)
5. Central Statistics Office Ireland - Crime Recorded in Ireland
#Install packages
install.packages(’openxlsx ’,repos = "https :// cran.rstudio.com")
install.packages(’data.table ’,repos = "https :// cran.rstudio.com")
#Load library
library(’openxlsx ’)
library(’data.table ’)
setwd("C:/ Users/MOLAP/Desktop/R Project/Updated")
#Read Raw File
datacrimeyearwise <-read.xlsx("C:/ Users/MOLAP/Desktop/R Project/Updated/Crime
#Rename First Column Name
datacrimeyearwise [1,1]<- "Year"
#Converting Quarters into Year (addition of quarters to make year)
datacrimeyearwise $Y2007 <- as.numeric( datacrimeyearwise $X2)+ as.numeric( datac
( datacrimeyearwise $X4)+as.numeric
( datacrimeyearwise $X5)
datacrimeyearwise $Y2008 <- as.numeric( datacrimeyearwise $X6) + as.numeric( data
( datacrimeyearwise $X8)+as.numeric
( datacrimeyearwise $X9)
datacrimeyearwise $Y2009 <- as.numeric( datacrimeyearwise $X10) + as.numeric( da
( datacrimeyearwise $X12)+as.numeric
( datacrimeyearwise $X13)
datacrimeyearwise $Y2010 <- as.numeric( datacrimeyearwise $X14) + as.numeric( da
( datacrimeyearwise $X16)+as.numeric
( datacrimeyearwise $X17)
datacrimeyearwise $Y2011 <- as.numeric( datacrimeyearwise $X18) + as.numeric( da
( datacrimeyearwise $X20)+as.numeric
( datacrimeyearwise $X21)
datacrimeyearwise $Y2012 <- as.numeric( datacrimeyearwise $X22) + as.numeric( da
( datacrimeyearwise $X24)+as.numeric
( datacrimeyearwise $X25)
datacrimeyearwise $Y2013 <- as.numeric( datacrimeyearwise $X26) + as.numeric( da
( datacrimeyearwise $X28)+as.numeric
( datacrimeyearwise $X29)
datacrimeyearwise $Y2014 <- as.numeric( datacrimeyearwise $X30) + as.numeric( da
( datacrimeyearwise $X32)+as.numeric
( datacrimeyearwise $X33)
datacrimeyearwise $Y2015 <- as.numeric( datacrimeyearwise $X34) + as.numeric( da
( datacrimeyearwise $X36)+as.numeric
( datacrimeyearwise $X37)
datacrimeyearwise $Y2016 <- as.numeric( datacrimeyearwise $X38) + as.numeric( da
( datacrimeyearwise $X40)+as.numeric
( datacrimeyearwise $X41)
datacrimeyearwise $Y2017 <- as.numeric( datacrimeyearwise $X42) + as.numeric( da
( datacrimeyearwise $X44)+as.numeric
( datacrimeyearwise $X45)
datacrimeyearwise $Y2018 <- as.numeric( datacrimeyearwise $X46) + as.numeric( da
( datacrimeyearwise $X48)+as.numeric
( datacrimeyearwise $X49)
#Rename Row names after converting quarter into years
datacrimeyearwise [1,50]<- "2007"
datacrimeyearwise [1,51]<- "2008"
datacrimeyearwise [1,52]<- "2009"
datacrimeyearwise [1,53]<- "2010"
datacrimeyearwise [1,54]<- "2011"
datacrimeyearwise [1,55]<- "2012"
datacrimeyearwise [1,56]<- "2013"
datacrimeyearwise [1,57]<- "2014"
datacrimeyearwise [1,58]<- "2015"
datacrimeyearwise [1,59]<- "2016"
datacrimeyearwise [1,60]<- "2017"
datacrimeyearwise [1,61]<- "2018"
#Transpose Data
final_df <-as.data.frame(t( datacrimeyearwise ))
str(final_df)
#colnames(final_df) = final_df[1, ]
#Convert whole data into character
final_df[] <- lapply(final_df , as.character)
#Head First Row
colnames(final_df) <- final_df[1, ]
#Drop First Row
final_df <- final_df[-1 ,]
str(final_df)
#Convert data into numeric using lappy
(function over a list)
final_df[] <- lapply(final_df , as.numeric)
#Remove Extra Row
final_df <-final_df[-c(1:48),]
#Sum of all offences to make Total Offences
crimeoutput <-cbind(final_df , Total_Offences = rowSums(final_df))
#Remove Extra Column
crimeoutput <-crimeoutput[-c(2:76)]
#To check column Names
colnames(crimeoutput)
#To check Structure Names
str(crimeoutput)
#Writing Final o/p File
write.csv(crimeoutput , file = "Crime_Rate_Yearwise_OP.csv",row.names=FALSE)
6. Kaggle - Crimes at Ireland Garda Stations 2007-2017
#Install package
install.packages("openxlsx",repos = "https :// cran.rstudio.com")
install.packages("data.table",repos = "https :// cran.rstudio.com")
#Load library
library(’openxlsx ’)
library(’data.table ’)
crimerecord <-read.xlsx("C:/ Users/MOLAP/Desktop/R Project/Updated/Crime_Year_
crimerecord <-crimerecord[-c(2:11,14:23,26:35,38:47,50:59,62:71,74:83,86:95,98
#Copy cell contents from one cell to Another
crimerecord [] <- lapply(crimerecord , function(x)
{
i1 <- which(is.na(x))
replace(x, i1, x[i1-1])
})
#Delete with certain condition to get single
year Data
crimerecord <-crimerecord [!( crimerecord$Year ==
"2007"),]
#To make Data Consistent Rename Row Names
crimerecord[4,2]<-"Dublin"
crimerecord <-setnames(crimerecord , old=c(’id’,’Station ’,’Divisions ’,’Year ’,
’Number.of.Crime.Record ’),new=c(’Station_ID’,
’County ’,’Station_Division ’,’Year ’,
’Total_Crime_Record ’))
#Positioning of column
crimerecord <- crimerecord[, c(1,4,2,3,5)]
#Write Final o/p File
write.csv(crimerecord , file = "Crime_Record_OP.csv",row.names=FALSE)
7. Department of Business, Enterprise and Innovation - 4. Em-
ployment Permit Statistics
#Install package
install.packages("openxlsx",repos = "https :// cran.rstudio.com")
install.packages("data.table",repos = "https :// cran.rstudio.com")
#Load library
library(’openxlsx ’)
library(’data.table ’)
#Read Raw File
permitData <-read.xlsx("C:/ Users/MOLAP/Desktop
/R Project/Updated/Employee Permit Countywise.xlsx", sheet = 1,startRow = 1,
colNames = T, skipEmptyRows = TRUE)
#Rename for Column name
permitData[1,2] = "Total_Work_Permit"
#Drop First Column
permitData <-permitData[-1,-1]
#Add Extra Column
permitData$Year <-"2018"
#Positioning of column
permitData <- permitData[, c(7,1:6)]
#Rename Column name
permitData <-setnames(permitData , old=c("Year","County/Country","Total"), new=
#Convert data to numeric typecasting
permitData$Year <-as.numeric(permitData$Year)
permitData$New <-as.numeric(permitData$New)
permitData$Renewal <-as.numeric(permitData$Renewal)
permitData$Total_Permit <-as.numeric(permitData$Total_Permit)
permitData$Refused <-as.numeric(permitData$Refused)
permitData$Withdrawn <-as.numeric(permitData$Withdrawn)
#Write Final o/p File
write.csv(permitData , file ="Employee_Permitdata_OP.csv",row.names=FALSE)

More Related Content

Similar to DWBI - Criminalytics: Entities affecting the Rate of Crime in Republic of Ireland

DWBI_Project_Women_Empowerment_and_Gender_Gap
DWBI_Project_Women_Empowerment_and_Gender_GapDWBI_Project_Women_Empowerment_and_Gender_Gap
DWBI_Project_Women_Empowerment_and_Gender_GapAlekhya Bhupati
 
Database and Analytics Programming - Project report
Database and Analytics Programming - Project reportDatabase and Analytics Programming - Project report
Database and Analytics Programming - Project reportsarthakkhare3
 
Data warehousing and Business intelligence project on Tourism sector's impact...
Data warehousing and Business intelligence project on Tourism sector's impact...Data warehousing and Business intelligence project on Tourism sector's impact...
Data warehousing and Business intelligence project on Tourism sector's impact...SindhujanDhayalan
 
Written Response Submission FormYour Name First and last.docx
Written Response Submission FormYour Name First and last.docxWritten Response Submission FormYour Name First and last.docx
Written Response Submission FormYour Name First and last.docxodiliagilby
 
2017 Born Digital Legal Deposit Policies and Practices
2017 Born Digital Legal Deposit Policies and Practices2017 Born Digital Legal Deposit Policies and Practices
2017 Born Digital Legal Deposit Policies and PracticesFrederick Zarndt
 
Data-Warehouse-and-Business-Intelligence
Data-Warehouse-and-Business-IntelligenceData-Warehouse-and-Business-Intelligence
Data-Warehouse-and-Business-IntelligenceShantanu Deshpande
 
Correlation Method for Public Security Information in Big Data Environment
Correlation Method for Public Security Information in Big Data EnvironmentCorrelation Method for Public Security Information in Big Data Environment
Correlation Method for Public Security Information in Big Data EnvironmentIJERA Editor
 
APPLYING DATA ENVELOPMENT ANALYSIS AND CLUSTERING ANALYSIS IN ENHANCING THE P...
APPLYING DATA ENVELOPMENT ANALYSIS AND CLUSTERING ANALYSIS IN ENHANCING THE P...APPLYING DATA ENVELOPMENT ANALYSIS AND CLUSTERING ANALYSIS IN ENHANCING THE P...
APPLYING DATA ENVELOPMENT ANALYSIS AND CLUSTERING ANALYSIS IN ENHANCING THE P...IJITCA Journal
 
APPLYING DATA ENVELOPMENT ANALYSIS AND CLUSTERING ANALYSIS IN ENHANCING THE P...
APPLYING DATA ENVELOPMENT ANALYSIS AND CLUSTERING ANALYSIS IN ENHANCING THE P...APPLYING DATA ENVELOPMENT ANALYSIS AND CLUSTERING ANALYSIS IN ENHANCING THE P...
APPLYING DATA ENVELOPMENT ANALYSIS AND CLUSTERING ANALYSIS IN ENHANCING THE P...IJITCA Journal
 
APPLYING DATA ENVELOPMENT ANALYSIS AND CLUSTERING ANALYSIS IN ENHANCING THE P...
APPLYING DATA ENVELOPMENT ANALYSIS AND CLUSTERING ANALYSIS IN ENHANCING THE P...APPLYING DATA ENVELOPMENT ANALYSIS AND CLUSTERING ANALYSIS IN ENHANCING THE P...
APPLYING DATA ENVELOPMENT ANALYSIS AND CLUSTERING ANALYSIS IN ENHANCING THE P...IJITCA Journal
 
APPLYING DATA ENVELOPMENT ANALYSIS AND CLUSTERING ANALYSIS IN ENHANCING THE P...
APPLYING DATA ENVELOPMENT ANALYSIS AND CLUSTERING ANALYSIS IN ENHANCING THE P...APPLYING DATA ENVELOPMENT ANALYSIS AND CLUSTERING ANALYSIS IN ENHANCING THE P...
APPLYING DATA ENVELOPMENT ANALYSIS AND CLUSTERING ANALYSIS IN ENHANCING THE P...IJITCA Journal
 
Policing Issues In Law Enforcement
Policing Issues In Law EnforcementPolicing Issues In Law Enforcement
Policing Issues In Law EnforcementAmber Rodriguez
 
Data Warehousing and Business Intelligence Project on Smart Agriculture and M...
Data Warehousing and Business Intelligence Project on Smart Agriculture and M...Data Warehousing and Business Intelligence Project on Smart Agriculture and M...
Data Warehousing and Business Intelligence Project on Smart Agriculture and M...Kaushik Rajan
 
The Open Data Economy Unlocking Economic Value by Opening Government and Publ...
The Open Data Economy Unlocking Economic Value by Opening Government and Publ...The Open Data Economy Unlocking Economic Value by Opening Government and Publ...
The Open Data Economy Unlocking Economic Value by Opening Government and Publ...Capgemini
 
Crime Data Analysis and Prediction for city of Los Angeles
Crime Data Analysis and Prediction for city of Los AngelesCrime Data Analysis and Prediction for city of Los Angeles
Crime Data Analysis and Prediction for city of Los AngelesHeta Parekh
 
ŠVOČ: Design and architecture of a web applications for interactive display o...
ŠVOČ: Design and architecture of a web applications for interactive display o...ŠVOČ: Design and architecture of a web applications for interactive display o...
ŠVOČ: Design and architecture of a web applications for interactive display o...Martin Puškáč
 
CitizenReporting_for_Crime_Analysis
CitizenReporting_for_Crime_AnalysisCitizenReporting_for_Crime_Analysis
CitizenReporting_for_Crime_AnalysisPatrick Floto
 

Similar to DWBI - Criminalytics: Entities affecting the Rate of Crime in Republic of Ireland (20)

DWBI_Project_Women_Empowerment_and_Gender_Gap
DWBI_Project_Women_Empowerment_and_Gender_GapDWBI_Project_Women_Empowerment_and_Gender_Gap
DWBI_Project_Women_Empowerment_and_Gender_Gap
 
Database and Analytics Programming - Project report
Database and Analytics Programming - Project reportDatabase and Analytics Programming - Project report
Database and Analytics Programming - Project report
 
Data warehousing and Business intelligence project on Tourism sector's impact...
Data warehousing and Business intelligence project on Tourism sector's impact...Data warehousing and Business intelligence project on Tourism sector's impact...
Data warehousing and Business intelligence project on Tourism sector's impact...
 
X18136931 dwbi report
X18136931 dwbi reportX18136931 dwbi report
X18136931 dwbi report
 
Written Response Submission FormYour Name First and last.docx
Written Response Submission FormYour Name First and last.docxWritten Response Submission FormYour Name First and last.docx
Written Response Submission FormYour Name First and last.docx
 
2017 Born Digital Legal Deposit Policies and Practices
2017 Born Digital Legal Deposit Policies and Practices2017 Born Digital Legal Deposit Policies and Practices
2017 Born Digital Legal Deposit Policies and Practices
 
Data-Warehouse-and-Business-Intelligence
Data-Warehouse-and-Business-IntelligenceData-Warehouse-and-Business-Intelligence
Data-Warehouse-and-Business-Intelligence
 
Correlation Method for Public Security Information in Big Data Environment
Correlation Method for Public Security Information in Big Data EnvironmentCorrelation Method for Public Security Information in Big Data Environment
Correlation Method for Public Security Information in Big Data Environment
 
Tushar Dalvi DWBI
Tushar Dalvi DWBITushar Dalvi DWBI
Tushar Dalvi DWBI
 
APPLYING DATA ENVELOPMENT ANALYSIS AND CLUSTERING ANALYSIS IN ENHANCING THE P...
APPLYING DATA ENVELOPMENT ANALYSIS AND CLUSTERING ANALYSIS IN ENHANCING THE P...APPLYING DATA ENVELOPMENT ANALYSIS AND CLUSTERING ANALYSIS IN ENHANCING THE P...
APPLYING DATA ENVELOPMENT ANALYSIS AND CLUSTERING ANALYSIS IN ENHANCING THE P...
 
APPLYING DATA ENVELOPMENT ANALYSIS AND CLUSTERING ANALYSIS IN ENHANCING THE P...
APPLYING DATA ENVELOPMENT ANALYSIS AND CLUSTERING ANALYSIS IN ENHANCING THE P...APPLYING DATA ENVELOPMENT ANALYSIS AND CLUSTERING ANALYSIS IN ENHANCING THE P...
APPLYING DATA ENVELOPMENT ANALYSIS AND CLUSTERING ANALYSIS IN ENHANCING THE P...
 
APPLYING DATA ENVELOPMENT ANALYSIS AND CLUSTERING ANALYSIS IN ENHANCING THE P...
APPLYING DATA ENVELOPMENT ANALYSIS AND CLUSTERING ANALYSIS IN ENHANCING THE P...APPLYING DATA ENVELOPMENT ANALYSIS AND CLUSTERING ANALYSIS IN ENHANCING THE P...
APPLYING DATA ENVELOPMENT ANALYSIS AND CLUSTERING ANALYSIS IN ENHANCING THE P...
 
APPLYING DATA ENVELOPMENT ANALYSIS AND CLUSTERING ANALYSIS IN ENHANCING THE P...
APPLYING DATA ENVELOPMENT ANALYSIS AND CLUSTERING ANALYSIS IN ENHANCING THE P...APPLYING DATA ENVELOPMENT ANALYSIS AND CLUSTERING ANALYSIS IN ENHANCING THE P...
APPLYING DATA ENVELOPMENT ANALYSIS AND CLUSTERING ANALYSIS IN ENHANCING THE P...
 
Policing Issues In Law Enforcement
Policing Issues In Law EnforcementPolicing Issues In Law Enforcement
Policing Issues In Law Enforcement
 
Data Warehousing and Business Intelligence Project on Smart Agriculture and M...
Data Warehousing and Business Intelligence Project on Smart Agriculture and M...Data Warehousing and Business Intelligence Project on Smart Agriculture and M...
Data Warehousing and Business Intelligence Project on Smart Agriculture and M...
 
The Open Data Economy Unlocking Economic Value by Opening Government and Publ...
The Open Data Economy Unlocking Economic Value by Opening Government and Publ...The Open Data Economy Unlocking Economic Value by Opening Government and Publ...
The Open Data Economy Unlocking Economic Value by Opening Government and Publ...
 
Crime Data Analysis and Prediction for city of Los Angeles
Crime Data Analysis and Prediction for city of Los AngelesCrime Data Analysis and Prediction for city of Los Angeles
Crime Data Analysis and Prediction for city of Los Angeles
 
ŠVOČ: Design and architecture of a web applications for interactive display o...
ŠVOČ: Design and architecture of a web applications for interactive display o...ŠVOČ: Design and architecture of a web applications for interactive display o...
ŠVOČ: Design and architecture of a web applications for interactive display o...
 
UK accident analysis
UK accident analysisUK accident analysis
UK accident analysis
 
CitizenReporting_for_Crime_Analysis
CitizenReporting_for_Crime_AnalysisCitizenReporting_for_Crime_Analysis
CitizenReporting_for_Crime_Analysis
 

More from Shrikant Samarth

Thesis - Mechanizing optimization of warehouses by implementation of machine ...
Thesis - Mechanizing optimization of warehouses by implementation of machine ...Thesis - Mechanizing optimization of warehouses by implementation of machine ...
Thesis - Mechanizing optimization of warehouses by implementation of machine ...Shrikant Samarth
 
Infographic - Ireland: "A Beneficiary of Brexit"
Infographic - Ireland: "A Beneficiary of Brexit"Infographic - Ireland: "A Beneficiary of Brexit"
Infographic - Ireland: "A Beneficiary of Brexit"Shrikant Samarth
 
Data Visualization - A reality check Prisons in India
Data Visualization - A reality check Prisons in IndiaData Visualization - A reality check Prisons in India
Data Visualization - A reality check Prisons in IndiaShrikant Samarth
 
Analytical CRM - Ecommerce analysis of customer behavior to enhance sales
Analytical CRM - Ecommerce analysis of customer behavior to enhance sales Analytical CRM - Ecommerce analysis of customer behavior to enhance sales
Analytical CRM - Ecommerce analysis of customer behavior to enhance sales Shrikant Samarth
 
Advance Data Mining - Analysis and forecasting of power factor for optimum el...
Advance Data Mining - Analysis and forecasting of power factor for optimum el...Advance Data Mining - Analysis and forecasting of power factor for optimum el...
Advance Data Mining - Analysis and forecasting of power factor for optimum el...Shrikant Samarth
 
Statistics For Data Analytics - Multiple &amp; logistic regression
Statistics For Data Analytics - Multiple &amp; logistic regression Statistics For Data Analytics - Multiple &amp; logistic regression
Statistics For Data Analytics - Multiple &amp; logistic regression Shrikant Samarth
 
Sales force- Housing society management system | Strategic ICT and eBusiness ...
Sales force- Housing society management system | Strategic ICT and eBusiness ...Sales force- Housing society management system | Strategic ICT and eBusiness ...
Sales force- Housing society management system | Strategic ICT and eBusiness ...Shrikant Samarth
 
DSM - Comparison of Hbase and Cassandra
DSM - Comparison of Hbase and CassandraDSM - Comparison of Hbase and Cassandra
DSM - Comparison of Hbase and CassandraShrikant Samarth
 

More from Shrikant Samarth (8)

Thesis - Mechanizing optimization of warehouses by implementation of machine ...
Thesis - Mechanizing optimization of warehouses by implementation of machine ...Thesis - Mechanizing optimization of warehouses by implementation of machine ...
Thesis - Mechanizing optimization of warehouses by implementation of machine ...
 
Infographic - Ireland: "A Beneficiary of Brexit"
Infographic - Ireland: "A Beneficiary of Brexit"Infographic - Ireland: "A Beneficiary of Brexit"
Infographic - Ireland: "A Beneficiary of Brexit"
 
Data Visualization - A reality check Prisons in India
Data Visualization - A reality check Prisons in IndiaData Visualization - A reality check Prisons in India
Data Visualization - A reality check Prisons in India
 
Analytical CRM - Ecommerce analysis of customer behavior to enhance sales
Analytical CRM - Ecommerce analysis of customer behavior to enhance sales Analytical CRM - Ecommerce analysis of customer behavior to enhance sales
Analytical CRM - Ecommerce analysis of customer behavior to enhance sales
 
Advance Data Mining - Analysis and forecasting of power factor for optimum el...
Advance Data Mining - Analysis and forecasting of power factor for optimum el...Advance Data Mining - Analysis and forecasting of power factor for optimum el...
Advance Data Mining - Analysis and forecasting of power factor for optimum el...
 
Statistics For Data Analytics - Multiple &amp; logistic regression
Statistics For Data Analytics - Multiple &amp; logistic regression Statistics For Data Analytics - Multiple &amp; logistic regression
Statistics For Data Analytics - Multiple &amp; logistic regression
 
Sales force- Housing society management system | Strategic ICT and eBusiness ...
Sales force- Housing society management system | Strategic ICT and eBusiness ...Sales force- Housing society management system | Strategic ICT and eBusiness ...
Sales force- Housing society management system | Strategic ICT and eBusiness ...
 
DSM - Comparison of Hbase and Cassandra
DSM - Comparison of Hbase and CassandraDSM - Comparison of Hbase and Cassandra
DSM - Comparison of Hbase and Cassandra
 

Recently uploaded

Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...christianmathematics
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDThiyagu K
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhikauryashika82
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfchloefrazer622
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room servicediscovermytutordmt
 
General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024Janet Corral
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 

Recently uploaded (20)

Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdf
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room service
 
General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 

DWBI - Criminalytics: Entities affecting the Rate of Crime in Republic of Ireland

  • 1. Data Warehousing and Business Intelligence Project on Criminalytics: Entities affecting the Rate of Crime in Republic of Ireland Shrikant Uday Samarth x18129137 MSc/PGDip Data Analytics – 2019/20 Submitted to: Sean Heeney
  • 2. National College of Ireland Project Submission Sheet – 2017/2018 School of Computing Student Name: Shrikant Uday Samarth Student ID: x18129137 Programme: MSc Data Analytics Year: 2019/20 Module: Data Warehousing and Business Intelligence Lecturer: Dr. Simon Caton Submission Due Date: 12/04/2019 Project Title: Criminalytics: Entities affecting the Rate of Crime in Republic of Ireland I hereby certify that the information contained in this (my submission) is information pertaining to my own individual work that I conducted for this project. All information other than my own contribution is fully and appropriately referenced and listed in the relevant bibliography section. I assert that I have not referred to any work(s) other than those listed. I also include my TurnItIn report with this submission. ALL materials used must be referenced in the bibliography section. Students are encouraged to use the Harvard Referencing Standard supplied by the Library. To use other author’s written or electronic work is an act of plagiarism and may result in disci- plinary action. Students may be required to undergo a viva (oral examination) if there is suspicion about the validity of their submitted work. Signature: Date: April 12, 2019 PLEASE READ THE FOLLOWING INSTRUCTIONS: 1. Please attach a completed copy of this sheet to each project (including multiple copies). 2. You must ensure that you retain a HARD COPY of ALL projects, both for your own reference and in case a project is lost or mislaid. It is not sufficient to keep a copy on computer. Please do not bind projects or place in covers unless specifically requested. 3. Assignments that are submitted to the Programme Coordinator office must be placed into the assignment box located outside the office. Office Use Only Signature: Date: Penalty Applied (if applicable):
  • 3. Table 1: Mark sheet – do not edit Criteria Mark Awarded Comment(s) Objectives of 5 Related Work of 10 Data of 25 ETL of 20 Application of 30 Video of 10 Presentation of 10 Total of 100
  • 4. Project Check List This section capture the core requirements that the project entails represented as a check list for convenience. Used LATEX template Three Business Requirements listed in introduction At least one structured data source At least one unstructured data source At least three sources of data Described all sources of data All sources of data are less than one year old, i.e. released after 17/09/2017 Inserted and discussed star schema Completed logical data map Discussed the high level ETL strategy Provided 3 BI queries Detailed the sources of data used in each query Discussed the implications of results in each query Reviewed at least 5-10 appropriate papers on topic of your DWBI project
  • 5. Criminalytics: Different entities affecting the Rate of Crime in Republic of Ireland Shrikant Uday Samarth x18129137 April 12, 2019 Abstract The aim of this paper is to understand the spatial patterns of crimes held in Ireland and to develop better understanding of the role of different factors like migration, population, unemployment etc. being the main inputs to the rate of increase or decrease in crime. In recent years according to the research papers there were complaints of various types of violence and crime Bacon & O’Donoghue (1975). To address the concern of public safety this Data Warehouse Business Intelligence model has been developed based on the data from various sources. This analysis includes data sources from wide range of varieties like Statista, Central Statistics Office, numbio and many others that helped in understanding the mentioned factors being involved in criminal rate of Republic of Ireland. These findings can be issued and used by the Garda Siochana i.e. the Irish Police department to understand the patterns and predict the increase in crime rates in a specific county through studying factors. 1 Introduction A law is defined by every country to maintain the decorum of the country. It includes few set of rules that help the implementer and the follower believe that there is harmony and well being with every living being. But when such laws are broken, there is a discomfort. That discomfort can be defined as a crime. When we speak about crime, the definition that Oxford Dictionary of Sociology specifies is an offence which goes beyond the personal and into the public sphere, breaking prohibitory rules or laws, to which legitimate punishments or sanctions are attached, and which requires the intervention of a public authority Scott & Marshall (2009). So, a crime can be termed as an outcome of disobedience to the set of protocols called as law. There are many notable scientists that have given different definition on crime. ”An intentional act or omission in violation of criminal law, committed, without defense or justification and sanctioned by law as felony or misdemeanor Paul (2016).” There exist crimes in least corrupted countries, but the aim should be a crime free nation. The aim of this project is to understand the patterns of crimes in Ireland. The motivation towards it is that how data analytics can help in reduction in crimes rates and bring about new reforms to curb the crimes in Ireland. 1
  • 6. (Req-1) Does the unemployment in the country have any effect on increase or decrease in rate of crime and population of Ireland? (Req-2) Does the increase in population of counties cause any effect on the number of crime rates registered in the Garda Stations of the respective counties? (Req-3) Does the immigration in the country have any effect on increase or decrease in rate of crime in Ireland? 2 Data Sources The source for the data places a very crucial role in understanding the pattern in crime and with the quality of data also comes the quantity which helps in more concrete analysis of data. The project implementation required the use of 2 Unstructured and 5 Structured Data Sets from different Data Sources to compare different parameters of crime and provide a conclusive and firm analytics output to the theme. Source Type Brief Summary Numbeo.com Unstructured It gives different parameters of crime com- mitted in each county of Ireland. Wikipedia.com Unstructured It gives county-wise population of Ireland. Databank.com Structured It gives unemployment percentage of Male and Female by basic and Advance Educa- tion. Statista.com Structured It provides year-wise (from year 2007-2007) population of Ireland. CSO.com Structured This source gives vaious crime offences from 2007-18 quarterly data which is used calcu- late year-wise total crime offences. Kaggle.com Structured This source provides criminal offenses com- mitted and recorded in 563 different Garda county Stations year wise. dbei.gov.ie Structured This source gives county-wise work permit issued by the companies for the year 2018. Table 2: Summary of sources of data used in the project 2.1 Source 1: Numbeo - Crime in Ireland county-wise: This dataset is the most important unstructured data with regards to all the datasets used as this is the dataset that gives us different parameters of crime committed in every county of Ireland. This dataset has all levels of crime committed and those are marked in the form of percentage. The dataset tells us the percentage and it also signifies the intensity of crime rate in form of parameters like Low, Moderate and High values. Using the Xpath of the web page all levels of crimes with its crime index has been extracted using the R studio. In this way 15 counties data have been extracted from 15 Numbeo web pages.
  • 7. https://www.numbeo.com/crime/in/Galway https://www.numbeo.com/crime/in/Cork https://www.numbeo.com/crime/in/Sligo-Ireland https://www.numbeo.com/crime/in/Kilkenny https://www.numbeo.com/crime/in/Limerick https://www.numbeo.com/crime/in/Mayo https://www.numbeo.com/crime/in/Waterford https://www.numbeo.com/crime/in/Longford https://www.numbeo.com/crime/in/Offaly https://www.numbeo.com/crime/in/Dublin https://www.numbeo.com/crime/in/Wexford-Ireland https://www.numbeo.com/crime/in/Carlow-Ireland https://www.numbeo.com/crime/in/Kerry https://www.numbeo.com/crime/in/Kerry https://www.numbeo.com/crime/in/Leitrim Figure 1: Source 1 2.2 Source 2: Wikipedia -List of Irish Counties by population: This dataset consists of list of counties of Ireland which are ordered by population. The data has been taken from the latest census of Republic of Ireland. This source contains 4 columns of information (i.e population, density, traditional province and change in previous year census) on 32 counties. The data is extracted using htmltab and data.table packages through R studio. https://en.wikipedia.org/wiki/List_of_Irish_counties_by_population
  • 8. Figure 2: Source 2 2.3 Source 3: Databank - Unemployment: percentage with ba- sic and advanced education This dataset consists of data of unemployed males and females of Ireland who have gained a basic and advanced education. This data is used to understand how unemployment can be a cause of increase or decrease in crime rates in Ireland. The unemployment for both male and female is in the form of percentage against the Total Labor Force of Ireland. The above mentioned parameters were selected from the website link given below. The data was downloaded in the .xlsx format. This source contains 5 columns of information for 2007-2018 year. Worldbank website never provides data created or updated information. As the data source contains 2018 year data; which is within project requirement range of one-year time frame. https://databank.worldbank.org/data/source/world-development-indicators 2.4 Source 4: Statista - Population Growth in Ireland: This is an important dataset from Statista which displays the population growth in Ireland past 10 years. The reason this data is very important is to understand the pattern of population growth and its effect on rise or fall of criminal activities in the state. The dataset was released on November 2018 which gives population growth percentage compared to previous years. To make the uniformity with all the dataset, year 2018 information was added from the source mentioned in the details section on the link. Moreover, percentage data has been converted into population in millions which was achieved by adding 2006 population which was in millions Irish population analysis (2019) by doing the calculations in the excel file. Then, data cleaning is done through R studio. https://www.statista.com/statistics/376895/population-growth-in-ireland/
  • 9. Figure 3: Source 4 2.5 Source 5: Central Statistic Office - Crime Recorded in Ire- land: The third dataset used in assistance with the above 2 datasets is the crimes that were recorded in Ireland for last decade. CSO has released this data publicly for awareness to the citizens which is a very suitable set for understanding how crimes are registered with respect to every year and its dependence on unemployment rate and population growth. From the below CSO link, key table CJQ01 data was used which gives quarter-wise 75 types of recorded crime offences. The data was published on 22/03/2019 11:00:00 by the CSO website. The quarter wise data was then converted into year wise through R studio to maintain the uniformity. https://www.cso.ie/px/pxeirestat/Statire/SelectVarVal/Define.asp?MainTable= CJQ01&TabStrip=Select&PLanguage=0&FF=1 Figure 4: Source 5
  • 10. 2.6 Source 6: Kaggle - Crimes at Ireland Garda Stations 2007- 2017 This file consists of original variables regarding the criminal offenses committed and recorded in 563 different Garda Stations year wise. This dataset would help us understand what type of crimes were committed with the Station ID where the crimes have been reported through a decade from 2007 to 2017. The data was created on 26/3/2019 contains 4 columns of information for each county with recorded offences per year. https://www.kaggle.com/johnpwatson/crimesatirelandgardastations20072017 Figure 5: Source 6 2.7 Source 7: 4. Department of Business Enterprise and Inno- vation - Employment Permit Statistics 2018 This is a statistical data that has been collected from the Irish Government which deals with county wise work permit issued by the companies for the year 2018. This is a part of an open-source publication by the Republic of Ireland for the public to access. This dataset will help us to understand immigration that is carried out in every county of Ireland from which we can understand the detailed idea of granted/non-granted permits to the individuals. This data was released on 16/01/2019 contains 5 columns of information for each county. https://dbei.gov.ie/en/Publications/Employment-Permit-Statistics-2018.html Figure 6: Source 7
  • 11. 3 Related Work According to research done on Crime records in Ireland, increase in population has a significant relation with the increase in amount of crimes in Ireland. In a paper submitted by Peter Bacon and Martin ODonoghue by the name The Economics of Crime in Republic of Ireland: An exploratory paper, they have explored possibilities of applying models developed elsewhere to an analysis of rising crime rates in Ireland. However, the analysis also indicates that rising unemployment will be associated with an increase in crimes against property with violence, and with a decrease in crimes against property without violence. This turns to be motivation for understanding if the unemployment is a cause of increase in crime rates in Republic of Ireland Bacon & O’Donoghue (1975). Considering this as a hypothesis, it was tried with different other parameters that can also assist increase in crime rates in Ireland. One of the factors considered here is the rate of unemployment in Ireland against 10 long years. In the paper submitted by Alan Barrett and Seamus McGuinness by the name The Irish labour market And the great recession It was found out that due to recession in 2008-09, there was a drastic drop in employment and from the crime dataset it was found and investigated that the crime rates were on rise during the recession period. To earn livelihood during the recession, data showed that people had resorted to criminal activities during this year more than any other year Barrett & McGuiness (2012). Moreover, in another paper submitted by Alan Barrett1 and Elish Kelly by the name The Impact of Irelands Recession on the Labour Market Outcomes of its Immigrants the employee permits had seen a big plum- met during this year which motivated me to research on this texture so that we can understand the patterns of mentality of criminals before they could commit crimes Bar- rett & Kelly (2012). So, to check the findings of last decade the Databank website source was used to check the unemployment, crime dataset was taken from CSO and to check any changes major change in population Statista source was used. Alan and Elish also discussed about the decline in immigrant population which motivated me to check the Ireland county population texture for the which was taken from Wikipedia and for the work permits county-wise total work Employee permit dataset was taken from dbei gov- ernment website and compared it with the county crime index dataset which is based on the people perceptions which was web scrapped from Numbeo website. Furthermore, a research paper submitted by CSO on Review of quality and crime statistics explains how statistics of recorded crime plays a vital role in informing society of the level and types of offence CSO Review of the Quality of Crime Statistics 2016 (2016); which motivates me to check the garda stations records dataset which was taken sourced from Kaggle website.
  • 12. 4 Data Model Coming to the concept of Data warehousing, a data warehousing is a technique for collecting and managing data from varied sources to provide meaningful business in- sights. It is a blend of technologies and components which allows the strategic use of data. It is electronic storage of a large amount of information by a business which is designed for query and analysis instead of transaction processing. It is a process of transforming data into information and making it available to users in a timely man- ner to make a difference What Is Data Warehousing? Types, Definition Example (n.d.). There are 2 types of approaches that we can follow while using Data Warehouse. 1. Inmon 2. Kimball When it comes to designing a data warehouse for your business, the two most com- monly discussed methods are the approaches introduced by Bill Inmon and Ralph Kim- ball. In Bill Inmons enterprise data warehouse approach (the top-down design), a nor- malized data model is designed first, then the dimensional data marts, which contain data required for specific business processes or specific departments, are created from the data warehouse. In Ralph Kimballs dimensional design approach (the bottom-up design), the data marts facilitating reports and analysis are created first; these are then combined to create a broad data warehouse George (2019). In our project, we have used Kimballs approach to design dimension modelling for building a data ware- house. The reason to do so is that it occupies less space, makes easy management and is faster with respect to Inmon’s approach. Ralph Kimball always supported the in- clusion of the end-users in the process throughout his work Chhabra & Pahwa (2014). Surely, this approach is suitable for my research project, as the queries will help so- ciety to make a better place for living. In case if we need to change fact table in future, then we should use Inmons approach. But in our situation, we don’t need to roll out any improvements indeed. Therefore, Kimball’s methodology will be a better alternative to do this undertaking. In this way, to accomplish what I want out of my task I have joined my datasets using year (2007-2018) which is common in Databank, CSO and Statista datasets; whereas, unstructured dataset from Wikipedia and Numbeo is joined with DBIO and Kaggle datasets which are based on counties for 2018 year. So, from the above dataset I have derived 2 dimensions i.e DimYear and DimCounties. All these dimensions are discussed below: DimYear: Dimension year was made as year parameter is common in all datasets. CSO which contains Crime offences data by year, Databank contains year-wise data of unemployment of male and female percentage by basic and advance education whereas statista contains the year-wise population data. Hence, identify the unemployment per- centage with respect to crime offences year is the common parameter. Hence, Year is the primary parameter in this data warehouse project. Using SSIS, YearId was generated and assigned as a primary key for the year dimension. DimCounty: Dimension County was created as county is common in Numbio, Wikipedia which are the unstructured datasets and is also common in Kaggle and DBEI which are the structured datasets. Here county was the common parameter with which Crime rate from Numbio and employee permits are compared. Hence, County is also the primary key parameter in this data warehouse. Using SSIS, CountyID was generated and assigned as a
  • 13. primary key for the county dimension. The Below figure illustrates the star Schema for this project which is further used in Tableau visualization: Figure 7: Star Schema
  • 14. 5 Logical Data Map Logical data Map: The dimensions and the facts of all the datasets are explained below: Table 3: Transformation, sources and destination for all components in the Logical Data Map are illustrated in below Fig: Figure ?? Source Column Destination Column Type Transformation 1 Year DimMovie DimYear Dimension Primary key, 2018 year added to the table, as the data is given in the website 1 County DimCounty County Dimension Primary key, County name renamed to match consis- tency 1 Level of crime FactTable Level of crime Fact Round up to 2 decimal $ 1 Safety during daylight FactTable Safety during daylight Fact Round up to 2 decimal $ 1 Total Crime Record FactTable Total Crime Record Fact Round up to 2 decimal $ 2 Year DimYear Year Dimension Primary key, 2018 year added to the table, as the data is given in the website $ 2 Administrative County DimCounty County Dimension Primary key, County names renamed to match consis- tency with other dataset 2 Population FactTable Population Fact Comma removed from the values $ 3 Year DimYear Year Dimension Primary key, Format matched with other datasets (it was 2007 [YR2007] format) $ 3 Labor force, total FactTable Total Labor Force in Million Fact Rounded to 3 decimals $ Continued on next page
  • 15. Table 3 – Continued from previous page Source Column Destination Column Type Transformation 3 Unemployment with basic ed- ucation, male (% of male labor force with basic education) FactTable Unemployment Ba- sic Education Male Percentage Fact Rounded upto 2 decimals $ 3 Unemployment with basic education, female FactTable Unemployment Basic Ed- ucation female Per- centage Fact Rounded upto 2 decimals $ 3 Unemployment with Advance education, female FactTable Unemployment Advance Ed- ucation Female Percentage Fact Rounded upto 2 decimals $ 3 Unemployment with Advance education, male FactTable Unemployment Advance Ed- ucation male Per- centage Fact Rounded upto 2 decimals $ 4 Year DimYear Year Dimension Primary key, 2018 year added from the source men- tioned in the description $ 4 Population growth com- pared to previous year FactTable Percentage Rise Fact Rounded up to 2 decimal Continued on next page
  • 16. Table 3 – Continued from previous page Source Column Destination Column Type Transformation 4 population Million FactTable Population in Million Fact Rounded up to two decimal $ 5 Year(Quarter- wise) DimYear Year Dimension Quarterwise Year converted into Year format then transposed to match with the other datasets $ 5 Total Of- fences FactTable Total Offences Fact Number of offences columns were added and converted into Total offences 5 Revenue FactTable Revenue Fact Rounded to nearest million $ 6 Year DimYear Year Dimension 2018 year was taken from the table, positioning of col- umn 6 Station DimCounty County Dimension Taken as it is from the source 6 Number ˙of ˙Crime ˙Record FactTable Total Crime Record Fact Comma removed from the values $ 7 Year DimYear Year Dimension Taken as it is from the source 7 County/ Country DimCounty County Dimension Extra spaces were removed from the column $ 7 New FactTable New Fact Taken as it is from the source 7 Renewal FactTable Renewal Fact Taken as it is from the source $ 7 Total FactTable Total Permit Fact Extra comma removed from the value 7 Refused FactTable Refused Fact Taken as it is from the source $ 7 Withdrawn FactTable Withdrawn Fact Taken as it is from the source $
  • 17. 6 ETL Process ETL means Extraction, Transform and Load is considered to be the foundation database or data warehousing to reduce the error and minimize the data loss. It is the high-level perspective of the system can be visualized by conceptual modeling of ETL process. There are various advantages like system error identification, cost minimization, risk and scope assessment etc Biswas et al. (2019). For ETL process, according to the requirement the data has to be cleaned and transformed keeping in mind to remove to expel all sort of redundancy in information. Following is the ETL procedure which I utilized in my undertaking: Figure 8: ETL Process 6.1 Extraction: We have downloaded structured dataset from CSO, Statista, Databank, DIBO and Kaggle websites in the excel file format which are cleaned through R. whereas the unstructured dataset was web scrapped through R from the Wikipedia and Numbeo websites. First structured dataset which was on crime rate of Ireland data. There were total 48 columns from 2007-2018 quarter which was converted into year by taking the sum of the quarter to make the data year-wise. Second dataset which was from Statista on Ireland popula- tion which gave the Irelands population growth from 2007-2017. Third structured source was from Databank which was on Unemployment percentage gender statistics. The ex- tracted excel file contains 6 columns like Year (2007-2018), labor force, Unemployment percentage with basic education male female, Unemployment percentage with advance education with male female. Fourth structured dataset was extracted in the form of .xlsx format which was taken from DBIO website. This dataset was for county-wise employee
  • 18. permit by companies for 2018 year. The dataset contains 27 rows with counties and 7 columns such as year, county, new permits, renewals permits, total permits, refused and withdrawn permits from Irish embassy. Fifth dataset was taken from Kaggle which was on guarda station crime records for each county. Initially there were three sheets in the excel; we have used third sheet which gave the year-wise county station records. The table contains 5 columns, station id, stations, Divisions, Year and Number of crime record. For Unstructured dataset, we have used Numbeo as our first unstructured data source; we did web scrapping on 15 webpages, as each webpage gives different level of crime data that represents each county. We are not able to scrape data for remain- ing counties because of the unavailability of information for those counties. To do the web-scrapping we have used different R packages and function to extract data from 15 counties. The data had been selected using the Xpath from the website. All the pages are then upended into the data frame. Then, unwanted extra columns and rows were deleted. In order to make to consistency, county names were renamed to match with the other datasets. For second unstructured dataset, we have extracted population data from the Wikipedia. The dataset contains 6 columns namely Rank, Administrative County, Population, Density, Traditional province, change in the previous census. To do web scrapping different R libraries were used. Using the htmltab the data was extracted from Wikipedia. Then, unwanted columns were removed and column names were renamed to make the consistency throughout the datasets. 6.2 Transformation: After the extraction and before loading the data in the data warehouse, the data should be made appropriate to meet the business necessities. The data transformation may incorporate activities, for example, cleaning, joining, and generating calculated data de- pending on existing values. This part of the ETL procedure is the most critical and tiring one and expends a great deal of time as we need to accomplish the cleanest information to deliver an exact business solution. All the structured data set were in the .xlsx format. So, first I found out all the required fields from these datasets and which are needs to be modified for my BI queries. To do this extra columns and spaces were removed from the datasets. I found out that CSO dataset was given in quarters to make it consistent it is converted into years through R code. Also, the data had 76 rows of number of offences which then was added to find the total offences per year. The table is then transformed using t() function from rows to year to maintain consistency across all dataset. All the quarters and types of offences which was not required was then dropped from the ta- ble. For Statista, to make the uniformity with all the dataset, year 2018 information was added from the source mentioned in the details section of that web page. Moreover, percentage data has been converted into population in millions which was achieved by adding 2006 population which was in millions Irish population analysis (2019) by doing the calculations in the excel file. Then, data cleaning such as extra spaces and columns were removed through R studio. In the third Databank dataset, cleaning and formatting was done through R studio to maintain the consistency. In the DBIO dataset, column names were renamed, moreover extra column and spaces were removed. Also some null values were removed from the dataset. The Kaggle dataset a lot of null values were present and to get the desired 2018 year for BI query, unwanted rows and columns were removed. To do this is.na(x) function was used. Not equal to symbol is used to check the condition for removing the unwanted rows. From the table we have extracted crime
  • 19. records for 2018 year by removing the unwanted year rows using R studio. Extra spaces and columns were then removed from the table to maintain the consistency throughout the datasets. changeling task was to scrape data from the Numbeo website, as each county represents and the data was available for 15 counties, so for 15 counties I had to scrape 15 pages. The challenging task was made easy after using R packages such as rvest,magrittr,RSelenium,httr,dplyr and data.table packages which made our work easy. We have used function to extract data from 15 counties. The data had been selected using the Xpath from the website. All the pages are then upended into the data frame. Then, unwanted extra columns and rows were deleted. In order to make to consistency, county names were renamed to match with the other datasets. In Wikipedia web-scrapping, we have extracted population data from the Wikipedia. To do web scrapping htmltab and data.table was used. Using the htmltab the data was extracted from Wikipedia. The dataset contains 6 columns, unwanted column such as Rank, Density, Traditional province and change in the previous census were dropped. Then, column names were renamed to make the consistency throughout the datasets. All the above changes were done in R studio and every one of the codes which are utilized to accomplish the outcomes are referenced in Appendix. Indexing has been done in SQL Server Management studio tool for the database which was named as Ireland Crime for this project. In this I have chosen a quality from each table which had unique value for each line and allocated it as a Primary Key for that specific table. SSMS as a matter of course allots the varchar (50) as the data type for each attribute. For year which is a numeric value, I have changed the data type to int(integer). 6.3 Loading: The last advance in the ETL procedure includes stacking the transformed data into the end target. Depending on the requirements of the organization, this process varies widely. Some data warehouses may overwrite existing information with cumulative information; updating extracted data is frequently done on a daily, weekly, or monthly basis Extract, transform, load (2019). After creating Ireland Crime database, we have to load all in- formation in SSMS through SSIS on staging area. Data flow task was used to complete this process. To load the data, flat file source component was used that helped to load data from csv format file to SSMS using the OLEDB destination component. OLEDB component helps to create table and load the data into the SSMS. In Flat file source, we need to give the table name which needs to be created in SSMS; Moreover, we also need to give the csv transformed output file. After this we need to set the text qualifier to inverted commas as a separator. In the preview section file can be seen in the tabular format. The Advance SSIS provides options to change the datatypes which is important in order to load the correct data to SSMS. Then in the connection manager of OLEDB, we can compose a SQL query to make new table in SSMS or SSIS give us Accurate recommendation to make table dependent on flat source file. We can modify this query from the new tab in the OLEDB destination component, we can also see the error in the error section. In all I had 7 flat files, so 7 data flow task was file was taken. After Files getting transferred in SSMS we have to make Dimension table. A SQL execution task is utilized in this progression. A SQL content is written in the undertaking itself to make two dimension tables by using the data from the raw tables which was created by the data flow task. The Dimension tables provide the context for fact tables for all the measurements presented in the Data warehouse. Although dimension tables are usually
  • 20. much smaller than fact tables, but they are the heart of the data warehouse because they provide entry points to data Kimball & Ross (2011). Next, for populating the fact table another SQL task was created. First, SQL script was written to insert the data from the raw tables using the inner joins on the dimension tables. It was quite a challenging task; as after the data is inserted in the table, populating fact table with proper values and join the table required proper brainstorming. Considering various types of joins we came to know that for my data inner join is best suitable. After the fact table, cube was deployed we were able to get star schema. I have connected this cube with the SSIS for automation. After successful deployment our desired star schema appeared and checked if all the values in the cubes are correct using the explore option. After confirming all the values from the cube, we are ready for visualizing the data in a visualization tool i.e Tableau for this project. 7 Application For better understanding the Business Intelligence Queries, as per the business require- ments discussed in Section 1, are visualized with the help of Tableau by mapping the 3 entities separately in a graph and understand the pattern in the data. 7.1 BI Query 1: Does the unemployment in the country have any effect on increase or decrease in rate of crime and pop- ulation of Ireland? To understand the queries, it is to be understood that how the unemployment has affected Ireland in last recent years. For that, the data to be used should be from 2007 till present. The factors taken into consideration are as follows, - Unemployment in Ireland among working adults (men and women) in last decade (2007 - 2017) - Total population across the last decade (2007-2017) - Total number of offenses occurred in the last decade (2007-2017) 1. Unemployment in Ireland among working adults: Here the unemployment based on population has been extracted from The World Bank which gives the complete details of unemployment based on citizens who have received basis education with regards to ones who have received advanced education among all the labor force available in Ireland. This will help in filtering the number of non-working population from the year 2007 till 2018. This data would help understand variety in unemployment if any specific which is mapped against the years 2007 and 2018. 2. Total population across last decade: The cumulative population has been extracted from Wikipedia for the years 2007 till 2017 to understand the growth of population of Ireland. From this, it is understood that the what amount of population has been increased or decreased so that it becomes comparable to the total number of crimes and the unemployment occurring in the Ireland. 3. Total number of offenses occurred in the decade:This dataset comes from Central Statistics Office where the data has been used in the form of structured dataset which contains all the offenses occurring in the past decade (2007 till 2017). This dataset will help an individual to understand the amount of crimes occurring with respect to the population and unemployment in Ireland. The data was extracted and presented and
  • 21. visualized with the help of Tableau. Below is the visualization received from execution of first BI query. The data was extracted, presented and visualized with the help of Tableau. Below is the visualization received from execution of first BI query. Figure 9: 1st BI Query 7.2 BI Query 2: Does the increase in population of counties cause any effect on the number of crime rates registered in the Garda Stations of the respective counties? To understand this, the datasets that have been used are as follows, - Garda station crime record (2018) - Population of Ireland based on counties (2018) 1. Garda station crime records: The dataset has been scrapped from Kaggle which consisted of all the Garda Station of Ireland county wise that have criminal cases reg- istered for the year 2018. This data can be very useful with respect to understanding the crimes registered across the counties which is then compared with the population of counties in Ireland. 2. Population of Ireland based on counties: This dataset has been extracted from scraping Wikipedia which has been updated with the population of Ireland county wise for the year 2018. This will help to understand the pattern of population density across Ireland. The data was extracted, presented and visualized with the help of Tableau. Below is the visualization received from execution of Second BI query.
  • 22. Figure 10: 2rd BI Query 7.3 BI Query 3: Does the immigration in the country have any effect on increase or decrease in rate of crime in Ireland? Immigration for any countries help the country to develop in various fields like technology, science and infrastructure. But it is debatable fact for the residents of the country as some countries think that immigration would take the jobs away from the local people of the country. This can be proven with the help of facts and data available. To verify the validity of this point, two datasets have been used namely, - Population of Ireland based on county. - Crimes in Ireland based on county. - Employee permits to the immigrants provided in every county. 1. Population of Ireland based on county: Earlier, the population has been considered on the basis of number of years for the whole country. Here the population of counties have been used as a comparative factor between crimes occurring in Ireland. This dataset has been extracted from scraping Wikipedia which has been updated with the population of Ireland county wise for the year 2018. This will help to understand the pattern of population density across Ireland. 2. Crimes in Ireland based on county: The dataset for crimes in Ireland has been extracted and scraped from Numbeo.com which has wide range of crimes that have been committed in Ireland for the year 2018. This data will help to understand crimes for a specific year occurring on the county level to bifurcate the crime and go deep down to understand crime variation on county level 3. Employee permits provided to the immigrants on county level for 2018: The dataset gives the details of the permits given to the immigrants in the Ireland. This dataset will help to understand how many permits were granted by the Irish government for every
  • 23. county thus giving a whole idea about immigration. The data was extracted, presented and visualized with the help of Tableau. Below is the visualization received from execution of Third BI query. Figure 11: 3rd BI Query These BI queries have been detailed further in the Discussion section. 7.4 Discussion From the 1st BI query, it can be understood that from the year 2007 to 2010, there was a steep increase in the rate of unemployment in Ireland and due to which there is also an increase in crime rates as well. This can be linked with the Global recession that hit the whole world Great Recession (n.d.). For the year 2007 to 2010, the recession affected Ireland as well which caused a huge amount of joblessness and unemployment. From the data available, during this period there is a steep increase in crime rates too. For the years 2010-2014, there is no specific change unemployment and crimes committed but from the year 2015 till 2018, there was a huge amount of decrease in unemployment rate which in turn has also shown the decrease in rate of crimes committed, so it can be concluded that there is a direct proportion between rate of crime and unemployment. The population has been on steady increase from the time period of 2007 to 2018. But there does not seem any relation between overall population of country and crimes committed. This can be elaborated in the 2nd BI query where the population of counties have been compared with the crimes rates registered in the Garda Stations in different counties in Ireland. From the datasets, it has been found out that with the increase in population of counties, there has been an increase in number of crime rates registered in respective county. Thus, it can be clearly justified that there is relation between population of country and the crimes committed or registered in Ireland based on counties.
  • 24. The 3rd BI query deals with the immigration effect on rate of crimes in Ireland. When the datasets were mapped in tableau, an interesting factor was noted that Dublin being the capital of Ireland has highest number immigration filings, but the crimes committed in the county are less. If this must be compared with the Donegal that has highest criminal cases registered even when the number of immigration cases filed are less, it is understood that there is not significant relation between the immigration in Ireland and the crimes reported in Ireland. On the contrary, it can be inferred that with the increased amount of security in Dublin, the crime rates are less as compared to Donegal, which is a remote county and is less secure than Dublin and progressed counties like Cork, Galway and Limerick. 8 Conclusion and Future Aspects During the start of project, there was a consideration that the crime rates in Ireland, though is less, but still there is small amount of unrest which restricts Ireland from being one of the most uncorrupted countries in world. Crime being one of the factors, the sub-factors that affect crimes have been viewed and verified with the data validations. The factors like unemployment, immigration and population rise paved the way towards criminal activities in Ireland. Factors like unemployment and population influenced crime rate but immigration didn’t find a place in affecting the crime rate hike. This shows that even if Ireland has its own problems to deal like border issues with Norther Ireland or business continuity with Britain after the Brexit, it is still a peace loving country and encourages a good behavior towards immigration and development. This project is helpful for future aspects as it deals with criminal analytics. Every country is affected with crime and Ireland is not different. Recently, UK police had collaborated with Accenture to perform criminal analytics to understand the patterns of crimes that happen in UK. This helped UK in huge amount as the system could give a rough information about the culture of criminals and identify how and when the attacks would occur in the disturbed areas. This project is termed as The Enterprise approach to Law Enforcement Accenture Police Center of Excellence (n.d.)”. The same concept can be implemented in Ireland and can pave a way towards safer and crime free country. References Accenture Police Center of Excellence (n.d.). URL: https://www.accenture.com/gb-en/insight-enterprise-approach-to-law- enforcement Bacon, P. & O’Donoghue, M. (1975), ‘The economics of crime in the republic of ireland: An exploratory paper’, Economic and Social Review 7(1), 19. Barrett, A. & Kelly, E. (2012), ‘The impact of irelands recession on the labour market outcomes of its immigrants’, European Journal of Population/Revue europ´eenne de D´emographie 28(1), 91–111. Barrett, A. & McGuiness, S. (2012), ‘The irish labour market and the great recession’, CESifo DICE Report 10(2), 27–33.
  • 25. Biswas, N., Chattapadhyay, S., Mahapatra, G., Chatterjee, S. & Mondal, K. C. (2019), ‘A new approach for conceptual extraction-transformation-loading process modeling’, International Journal of Ambient Computing and Intelligence (IJACI) 10(1), 30–45. Chhabra, R. & Pahwa, P. (2014), ‘Data mart designing and integration approaches’, International Journal of Computer Science and Mobile Computing 3(4), 74–79. CSO Review of the Quality of Crime Statistics 2016 (2016). URL: http://www.cso.ie/en/media/csoie/releasespublications/documents/crimejustice/2016/reviewo Extract, transform, load (2019). URL: https://en.wikipedia.org/wiki/Extract,transform,l oadTransform George, S. (2019), ‘Inmon or kimball: Which approach is suitable for your data ware- house? 2019’. Great Recession (n.d.). URL: https://en.wikipedia.org/wiki/GreatRecession Irish population analysis (2019). URL: https://en.wikipedia.org/wiki/Irishpopulationanalysis Kimball, R. & Ross, M. (2011), The data warehouse toolkit: the complete guide to dimen- sional modeling, John Wiley & Sons. Paul, T. (2016). URL: https://www.cliffsnotes.com/study-guides/criminal-justice/crime/definitions- of-crime Scott, J. & Marshall, G. (2009), A dictionary of sociology, OUP Oxford. What Is Data Warehousing? Types, Definition Example (n.d.). URL: https://www.guru99.com/data-warehousing.html1
  • 26. Appendix R code used Cleaning and Extraction: 1. Numbeo Web scrapping Code #Install Packages install.packages(’rvest ’,repos = "https :// cran.rstudio.com") install.packages(’magrittr ’,repos = "https :// cran.rstudio.com") install.packages(’RSelenium ’,repos = "https :// cran.rstudio.com") install.packages(’httr ’,repos = "https :// cran.rstudio.com") install.packages(’dplyr ’,repos = "https :// cran.rstudio.com") #dplyr is the next iteration of plyr , focussed on tools for working with data install.packages(’data.table ’,repos = "https :// cran.rstudio.com") #Load packages library(’rvest ’) library(’magrittr ’) library(’RSelenium ’) library(’httr ’) library(’dplyr ’) library(’data.table ’) #Define Web Scrapping Function and Scrapping Code for Multiple Pages get_countdata <-function(keyword) { url <- paste(’https :// www.numbeo.com/crime/in/’,keyword ,sep="") #Reading the HTML code from the website webpage <- read_html(url) #Getting name of the County County_crime <- html_nodes(webpage ,’.columnWithName ’) #Converting the title data to text Countyname_data <- html_text(County_crime) #Let ’s have a look at the title head(Countyname_data) #Getting number of ratings Crimenumber_data <- html_nodes(webpage ,’.indexValueTd ’) #Converting the title data to text Crimenumber_data <- html_text(Crimenumber_data) #Let ’s have a look at the title head(Crimenumber_data)
  • 27. #combining all lists to form a data frane Crimecountywise _df <- data.frame(Name = Countyname_data , NumberofRatings = #adds new variables and preserves existing ones Crimecountywise _df <- Crimecountywise _df %>% mutate(County_Name = keyword) } #Define Keyword for Webscrapping url <-get_countdata(’Galway ’) url1<-get_countdata("Cork") url2<-get_countdata("Sligo -Ireland") url3<-get_countdata("Kilkenny") url4<-get_countdata("Limerick") url5<-get_countdata("Maynooth -Ireland") url6<-get_countdata("Waterford") url7<-get_countdata("Athlone") url8<-get_countdata(’Mullingar -Ireland ’) url9<-get_countdata("Dublin") url10<-get_countdata("Wexford -Ireland") url11<-get_countdata("Carlow -Ireland") url12<-get_countdata("Wexford -Ireland") url13<-get_countdata("Donegal -Ireland") url14<-get_countdata("Ennis -Ireland") #Transpose all data url <-transpose(url , ignore.empty = FALSE) url1<-transpose(url1, ignore.empty = FALSE) url2<-transpose(url2, ignore.empty = FALSE) url3<-transpose(url3, ignore.empty = FALSE) url4<-transpose(url4, ignore.empty = FALSE) url5<-transpose(url5, ignore.empty = FALSE) url6<-transpose(url6, ignore.empty = FALSE) url7<-transpose(url7, ignore.empty = FALSE) url8<-transpose(url8, ignore.empty = FALSE) url9<-transpose(url9, ignore.empty = FALSE) url10<-transpose(url10, ignore.empty = FALSE) url11<-transpose(url11, ignore.empty = FALSE) url12<-transpose(url12, ignore.empty = FALSE) url13<-transpose(url13, ignore.empty = FALSE) url14<-transpose(url14, ignore.empty = FALSE) #uppend all files v<-rbind(url ,url1,url2,url3,url4,url5,url6,url7, url8,url9,url10,url11,url12,url13,url14) #Replicate first column n=1 v1 = cbind(v, replicate(n,v$V1))
  • 28. #Positioning of Replicated column v2 <-v1[, c(1,16,2:15)] #Drop Extra Rows v2 <-v2[-c(4,7,10,13,16,19,22,25,28,31,34,37,40, 43),] #Rename Row names v2[1,1]<- ’County ’ v2[2,1]<- ’Galway ’ v2[4,1]<- ’Cork ’ v2[6,1]<- ’Sligo ’ v2[8,1]<- ’Kilkenny ’ v2[10,1]<-’Limerick ’ v2[12,1]<-’Mayo ’ v2[14,1]<-’Waterford ’ v2[16,1]<-’Longford ’ v2[18,1]<-’Offaly ’ v2[20,1]<-’Dublin ’ v2[22,1]<-’Wexford ’ v2[24,1]<-’Carlow ’ v2[26,1]<-’Kerry ’ v2[28,1]<-’Donegal ’ v2[30,1]<-’Leitrim ’ #Remove extra Rows v2<-v2[-c(3,5,7,9,11,13,15,17,19,21,23,25,27,29, 31),] #Convert whole data into character v2[] <- lapply(v2, as.character) #Head First Row colnames(v2) <- v2[1, ] #Drop First Row v2 <- v2[-1 ,] str(v2) #Convert data into numeric using lappy (function over a list) v2[] <- lapply(v2, as.numeric) #Write Formated File write.csv(v2, file = "C:/ Users/MOLAP/Desktop/R Project/Updated/Numbio_Data_Sc row.names=FALSE) #Removed extra row while Reading File Numbio_Final <-read.csv("C:/ Users/MOLAP/Desktop/ R Project/Updated/Numbio_Data_Scrapping_OP.csv", skip = 1) #summary(Numbio_Final)
  • 29. #Sum of all Crimes to find the Total number of Crime Numbio_Final$Total <-as.numeric(Numbio_Final$ Level.of.crime )+ as.numeric(Numbio_Final$Crime. increasing.in.the.past.3.years )+as.numeric(Numbio_Final$ Worries.home.broken.and.things.stolen )+as.numeric (Numbio_Final$Worries.being.mugged.or.robbed )+ as.numeric(Numbio_Final$Worries.car.stolen )+ as.numeric(Numbio_Final$Worries.things.from.car. stolen )+as.numeric(Numbio_Final$Worries.attacked )+ as.numeric(Numbio_Final$Worries.being.insulted )+ as.numeric(Numbio_Final$Worries.being.subject. to.a.physical.attack.because.of.your.skin.colour .. ethnic.origin.or.religion )+as.numeric(Numbio_ Final$Problem.people.using.or.dealing.drugs )+as. numeric(Numbio_Final$Problem.property.crimes.such .as.vandalism.and.theft )+ as.numeric(Numbio_Final$Problem.violent.crimes. such.as.assault.and.armed.robbery )+ as.numeric(Numbio_Final$Problem.corruption.and .bribery )+as.numeric(Numbio_Final$Safety.walking. alone.during.daylight )+as.numeric(Numbio_Final$ Safety.walking.alone.during.night) colnames(Numbio_Final) #Rename Column name(Removed extra spaces) setnames(Numbio_Final ,old = c("County","Level.of.crime","Crime.increasing.in. colnames(Numbio_Final) #Convert data to numeric Numbio_Final$County <-as.character(Numbio_Final$ County) Numbio_Final$Level_of_crime <-as.numeric(Numbio_ Final$Level_of_crime) Numbio_Final$past_3_years <-as.numeric (Numbio_Final$past_3_years) Numbio_Final$things_stolen <-as.numeric (Numbio_Final$things_stolen) Numbio_Final$mugged_or_robbed <-as.numeric (Numbio_Final$mugged_or_robbed) Numbio_Final$car_stolen <-as.numeric(Numbio_Final $car_stolen) Numbio_Final$things_from_car_stolen <-as.numeric (Numbio_Final$things_from_car_stolen) Numbio_Final$attacked <-as.numeric (Numbio_Final$attacked) Numbio_Final$being_insulted <-as.numeric (Numbio_Final$being_insulted) Numbio_Final$physical_attack <-as.numeric (Numbio_Final$physical_attack) Numbio_Final$dealing_drugs <-as.numeric (Numbio_Final$dealing_drugs)
  • 30. Numbio_Final$vandalism_and_theft <-as.numeric (Numbio_Final$vandalism_and_theft) Numbio_Final$assault_and_armed_robbery <- as.numeric(Numbio_Final$assault_and_armed_robbery) Numbio_Final$corruption_and_bribery <-as.numeric (Numbio_Final$corruption_and_bribery) Numbio_Final$Safety_during_daylight <-as.numeric (Numbio_Final$Safety_during_daylight) Numbio_Final$Safety_during_night <-as.numeric (Numbio_Final$Safety_during_night) Numbio_Final$Total_Crime_Record <-as.numeric (Numbio_Final$Total_Crime_Record) str(Numbio_Final) #Record is for 2018 , So adding Year Column Numbio_Final$Year <-"2018" #Positioning of column Numbio_Final <- Numbio_Final[, c(18,1:17)] #Write Final o/p File write.csv(Numbio_Final , file = "C:/ Users/MOLAP/Desktop/R Project/Updated/ Numbio_Data_Scrapping_OP.csv",row.names=FALSE) 2. Wikipedia Web page scrapping Code # Data Scrapping for Project from Wikipedia #Install Package install.packages("htmltab",repos = "https :// cran.rstudio.com") install.packages(’data.table ’,repos = "https :// cran.rstudio.com") #Load Library library("htmltab") library(’data.table ’) #Read URL url <-"https ://en.wikipedia.org/wiki/List_of_Irish_counties_by_population" wiki <- htmltab(doc=url , which=1) #Drop Column wiki <-wiki[-c(1)] #Drop Extra Rows wiki <-wiki[-c(10,36,2,4,8,37,6,13,31,37),] #Rename Column wiki <-setnames(wiki , old=c("Administrative county"), new=c("County")) #Add Extra Column wiki$Year <-"2018"
  • 31. #Rename Column Names setnames(wiki ,old=c("County","Population", "Density (/ k m )", "Traditional province","Change since previous census") Province","Change_since_previous_census")) #Drop column wiki <-wiki[-c(3,4,5)] #Positioning of column wiki <- wiki[, c(3,1:2)] #Remove Extra characters (Comma) wiki$Population <-gsub(",", "",wiki$Population) #Write the Final o/p data to a csv file write.csv(wiki , file = "Wiki_Population_OP.csv",row.names=FALSE) 3. The World Bank - Unemployment Percentage basic and Ad- vance Education #Install packages install.packages(’openxlsx ’,repos = "https :// cran.rstudio.com") install.packages(’data.table ’,repos = "https :// cran.rstudio.com") #Load library library(’openxlsx ’) library(’data.table ’) setwd("C:/ Users/MOLAP/Desktop/R Project/Updated") #Read File dataUnemployment <-read.xlsx("C:/ Users/MOLAP/Desktop/R Project/Updated/Unempl #Remove Row dataUnemployment <- dataUnemployment [-c(1:3),] #Column Rename dataUnemployment 1<- setnames(dataUnemployment , old=c("Series.Name","Labor.for "Unemployment.with.basic.education , .male .(%. of.male.labor.force.with.basic.education)","Unemployment.with.basic. (%.of.female.labor.force.with.basic.education)", "Unemployment.with.advanced.education ,. female. (%.of.female.labor.force.with.advanced.education)","Unemployment.with.advance (%.of.male.labor.force.with.advanced.education)"), new=c("Year","Total_Labor_ "Unemployment_Basic_Education_Female_Percentage","Unemployment_Advance_Educat ,"Unemployment_Advance_Education_Male_Percentage")) #numeric dataUnemployment 1$Total_Labor_Force_in_Million <- as.numeric( dataUnemployment 1$Total_Labor_
  • 32. Force_in_Million) dataUnemployment 1$Unemployment_Basic_Education_ Male_Percentage <- as.numeric( dataUnemployment 1$Unemployment_Basic_ Education_Male_Percentage) dataUnemployment 1$Unemployment_Basic_Education_ Female_Percentage <- as.numeric( dataUnemployment 1$Unemployment_Basic_ Education_Female_Percentage) dataUnemployment 1$Unemployment_Advance_Education_ Female_Percentage <- as.numeric( dataUnemployment 1$Unemployment_Advance _Education_Female_Percentage) dataUnemployment 1$Unemployment_Advance_Education _Male_Percentage <- as.numeric( dataUnemployment 1$Unemployment_Advance _Education_Male_Percentage) #round of 2 after decimal value dataUnemployment 1$Total_Labor_Force_in_Million <- round( dataUnemployment 1$Total_Labor_Force_in _Million ,2) dataUnemployment 1$Unemployment_Basic_Education_ Male_Percentage <- round( dataUnemployment 1$Unemployment_Basic_ Education_Male_Percentage ,2) dataUnemployment 1$Unemployment_Basic_Education_ Female_Percentage <- round( dataUnemployment 1$Unemployment_Basic_ Education_Female_Percentage ,2) dataUnemployment 1$Unemployment_Advance_Education_ Female_Percentage <- round( dataUnemployment 1$Unemployment_Advance_ Education_Female_Percentage ,2) dataUnemployment 1$Unemployment_Advance_Education_ Male_Percentage <- round( dataUnemployment 1$Unemployment_Advance_ Education_Male_Percentage ,2) #Column Fields Rename dataUnemployment 1[1,1]<-"2007" dataUnemployment 1[2,1]<-"2008" dataUnemployment 1[3,1]<-"2009" dataUnemployment 1[4,1]<-"2010" dataUnemployment 1[5,1]<-"2011" dataUnemployment 1[6,1]<-"2012" dataUnemployment 1[7,1]<-"2013" dataUnemployment 1[8,1]<-"2014" dataUnemployment 1[9,1]<-"2015" dataUnemployment 1[10,1]<-"2016" dataUnemployment 1[11,1]<-"2017" dataUnemployment 1[12,1]<-"2018" #Write the Final o/p data to a csv file write.csv( dataUnemployment 1, file = "Unemployment_Gender_Statistics_OP.csv", row.names =FALSE) 4. Statista - Population Growth Ireland
  • 33. #Statista Data in R #Install Packages install.packages("openxlsx",repos = "https :// cran.rstudio.com") install.packages("data.table",repos = "https :// cran.rstudio.com") #Load library library(’openxlsx ’) library(’data.table ’) #Read Raw File dataStatista <-read.xlsx("C:/ Users/MOLAP/Desktop /R Project/Updated/population -growth -in -ireland - 2018.xlsx", sheet = 2,startRow = 1,colNames = T, skipEmptyRows = TRUE) #To remove blank rows dataStatista <-dataStatista[-c(1:2),] #Drop a column dataStatista <-dataStatista[,-3] #Convert Data to numeric dataStatista$X4 <- as.numeric(dataStatista$X4) dataStatista$X2 <- as.numeric(dataStatista$X2) #Roundof upto 2 valuea after decimal point dataStatista$X4 <- round(dataStatista$X4,3) dataStatista$X2 <- round(dataStatista$X2,3) #Column Rename setnames(dataStatista , old=c("Population.growth.in.Ireland.2017", "X2","X4"), new=c("Year", "Percentage_Rise","Population_in_MIllions")) dataStatista[1,1]<- "2007" dataStatista[2,1]<- "2008" dataStatista[3,1]<- "2009" dataStatista[4,1]<- "2010" dataStatista[5,1]<- "2011" dataStatista[6,1]<- "2012" dataStatista[7,1]<- "2013" dataStatista[8,1]<- "2014" dataStatista[9,1]<- "2015" dataStatista[10,1]<- "2016" dataStatista[11,1]<- "2017" dataStatista[12,1]<- "2018" #Convert data to numeric dataStatista$Percentage_Rise <-as.numeric (dataStatista$Percentage_Rise) dataStatista$Population_in_MIllions <-as.numeric
  • 34. (dataStatista$Population_in_MIllions) str(dataStatista) #write the final o/p data to a csv file write.csv(dataStatista , file = "Population_Growth_Statista_OP.csv", row.names=FALSE) 5. Central Statistics Office Ireland - Crime Recorded in Ireland #Install packages install.packages(’openxlsx ’,repos = "https :// cran.rstudio.com") install.packages(’data.table ’,repos = "https :// cran.rstudio.com") #Load library library(’openxlsx ’) library(’data.table ’) setwd("C:/ Users/MOLAP/Desktop/R Project/Updated") #Read Raw File datacrimeyearwise <-read.xlsx("C:/ Users/MOLAP/Desktop/R Project/Updated/Crime #Rename First Column Name datacrimeyearwise [1,1]<- "Year" #Converting Quarters into Year (addition of quarters to make year) datacrimeyearwise $Y2007 <- as.numeric( datacrimeyearwise $X2)+ as.numeric( datac ( datacrimeyearwise $X4)+as.numeric ( datacrimeyearwise $X5) datacrimeyearwise $Y2008 <- as.numeric( datacrimeyearwise $X6) + as.numeric( data ( datacrimeyearwise $X8)+as.numeric ( datacrimeyearwise $X9) datacrimeyearwise $Y2009 <- as.numeric( datacrimeyearwise $X10) + as.numeric( da ( datacrimeyearwise $X12)+as.numeric ( datacrimeyearwise $X13) datacrimeyearwise $Y2010 <- as.numeric( datacrimeyearwise $X14) + as.numeric( da ( datacrimeyearwise $X16)+as.numeric ( datacrimeyearwise $X17) datacrimeyearwise $Y2011 <- as.numeric( datacrimeyearwise $X18) + as.numeric( da ( datacrimeyearwise $X20)+as.numeric ( datacrimeyearwise $X21) datacrimeyearwise $Y2012 <- as.numeric( datacrimeyearwise $X22) + as.numeric( da ( datacrimeyearwise $X24)+as.numeric ( datacrimeyearwise $X25) datacrimeyearwise $Y2013 <- as.numeric( datacrimeyearwise $X26) + as.numeric( da ( datacrimeyearwise $X28)+as.numeric ( datacrimeyearwise $X29) datacrimeyearwise $Y2014 <- as.numeric( datacrimeyearwise $X30) + as.numeric( da ( datacrimeyearwise $X32)+as.numeric ( datacrimeyearwise $X33) datacrimeyearwise $Y2015 <- as.numeric( datacrimeyearwise $X34) + as.numeric( da ( datacrimeyearwise $X36)+as.numeric ( datacrimeyearwise $X37) datacrimeyearwise $Y2016 <- as.numeric( datacrimeyearwise $X38) + as.numeric( da
  • 35. ( datacrimeyearwise $X40)+as.numeric ( datacrimeyearwise $X41) datacrimeyearwise $Y2017 <- as.numeric( datacrimeyearwise $X42) + as.numeric( da ( datacrimeyearwise $X44)+as.numeric ( datacrimeyearwise $X45) datacrimeyearwise $Y2018 <- as.numeric( datacrimeyearwise $X46) + as.numeric( da ( datacrimeyearwise $X48)+as.numeric ( datacrimeyearwise $X49) #Rename Row names after converting quarter into years datacrimeyearwise [1,50]<- "2007" datacrimeyearwise [1,51]<- "2008" datacrimeyearwise [1,52]<- "2009" datacrimeyearwise [1,53]<- "2010" datacrimeyearwise [1,54]<- "2011" datacrimeyearwise [1,55]<- "2012" datacrimeyearwise [1,56]<- "2013" datacrimeyearwise [1,57]<- "2014" datacrimeyearwise [1,58]<- "2015" datacrimeyearwise [1,59]<- "2016" datacrimeyearwise [1,60]<- "2017" datacrimeyearwise [1,61]<- "2018" #Transpose Data final_df <-as.data.frame(t( datacrimeyearwise )) str(final_df) #colnames(final_df) = final_df[1, ] #Convert whole data into character final_df[] <- lapply(final_df , as.character) #Head First Row colnames(final_df) <- final_df[1, ] #Drop First Row final_df <- final_df[-1 ,] str(final_df) #Convert data into numeric using lappy (function over a list) final_df[] <- lapply(final_df , as.numeric) #Remove Extra Row final_df <-final_df[-c(1:48),] #Sum of all offences to make Total Offences crimeoutput <-cbind(final_df , Total_Offences = rowSums(final_df)) #Remove Extra Column crimeoutput <-crimeoutput[-c(2:76)]
  • 36. #To check column Names colnames(crimeoutput) #To check Structure Names str(crimeoutput) #Writing Final o/p File write.csv(crimeoutput , file = "Crime_Rate_Yearwise_OP.csv",row.names=FALSE) 6. Kaggle - Crimes at Ireland Garda Stations 2007-2017 #Install package install.packages("openxlsx",repos = "https :// cran.rstudio.com") install.packages("data.table",repos = "https :// cran.rstudio.com") #Load library library(’openxlsx ’) library(’data.table ’) crimerecord <-read.xlsx("C:/ Users/MOLAP/Desktop/R Project/Updated/Crime_Year_ crimerecord <-crimerecord[-c(2:11,14:23,26:35,38:47,50:59,62:71,74:83,86:95,98 #Copy cell contents from one cell to Another crimerecord [] <- lapply(crimerecord , function(x) { i1 <- which(is.na(x)) replace(x, i1, x[i1-1]) }) #Delete with certain condition to get single year Data crimerecord <-crimerecord [!( crimerecord$Year == "2007"),] #To make Data Consistent Rename Row Names crimerecord[4,2]<-"Dublin" crimerecord <-setnames(crimerecord , old=c(’id’,’Station ’,’Divisions ’,’Year ’, ’Number.of.Crime.Record ’),new=c(’Station_ID’, ’County ’,’Station_Division ’,’Year ’, ’Total_Crime_Record ’)) #Positioning of column crimerecord <- crimerecord[, c(1,4,2,3,5)] #Write Final o/p File write.csv(crimerecord , file = "Crime_Record_OP.csv",row.names=FALSE) 7. Department of Business, Enterprise and Innovation - 4. Em- ployment Permit Statistics
  • 37. #Install package install.packages("openxlsx",repos = "https :// cran.rstudio.com") install.packages("data.table",repos = "https :// cran.rstudio.com") #Load library library(’openxlsx ’) library(’data.table ’) #Read Raw File permitData <-read.xlsx("C:/ Users/MOLAP/Desktop /R Project/Updated/Employee Permit Countywise.xlsx", sheet = 1,startRow = 1, colNames = T, skipEmptyRows = TRUE) #Rename for Column name permitData[1,2] = "Total_Work_Permit" #Drop First Column permitData <-permitData[-1,-1] #Add Extra Column permitData$Year <-"2018" #Positioning of column permitData <- permitData[, c(7,1:6)] #Rename Column name permitData <-setnames(permitData , old=c("Year","County/Country","Total"), new= #Convert data to numeric typecasting permitData$Year <-as.numeric(permitData$Year) permitData$New <-as.numeric(permitData$New) permitData$Renewal <-as.numeric(permitData$Renewal) permitData$Total_Permit <-as.numeric(permitData$Total_Permit) permitData$Refused <-as.numeric(permitData$Refused) permitData$Withdrawn <-as.numeric(permitData$Withdrawn) #Write Final o/p File write.csv(permitData , file ="Employee_Permitdata_OP.csv",row.names=FALSE)