Task: To develop a data warehouse from multiple structured and unstructured sources of data and implement a minimum of three non-trivial business intelligence queries on the data warehouse with the help of visualizations.
Approach: Created Data warehousing project for Data Warehousing and Business Intelligence module based on entities affecting the rate of crime in the Republic of Ireland. Created Data warehouse and build automated cube to fetch proper data periodically. Used R programming language to clean data, to store data used SQL Server as Database, SSAS for creating Data Cube so the user gets a proper insight of various accident conditions, also used Tableau for various Reports.
Tools: RStudio, SQLServer, SSIS, SSAS, Tableau
VIDEO Description: https://www.youtube.com/watch?v=uRdyZQja66M&t=134s
DWBI - Criminalytics: Entities affecting the Rate of Crime in Republic of Ireland
1. Data Warehousing and Business Intelligence
Project
on
Criminalytics:
Entities affecting the Rate of Crime in Republic of Ireland
Shrikant Uday Samarth
x18129137
MSc/PGDip Data Analytics – 2019/20
Submitted to: Sean Heeney
2. National College of Ireland
Project Submission Sheet – 2017/2018
School of Computing
Student Name: Shrikant Uday Samarth
Student ID: x18129137
Programme: MSc Data Analytics
Year: 2019/20
Module: Data Warehousing and Business Intelligence
Lecturer: Dr. Simon Caton
Submission Due
Date:
12/04/2019
Project Title: Criminalytics: Entities affecting the Rate of Crime in Republic
of Ireland
I hereby certify that the information contained in this (my submission) is information
pertaining to my own individual work that I conducted for this project. All information
other than my own contribution is fully and appropriately referenced and listed in the
relevant bibliography section. I assert that I have not referred to any work(s) other than
those listed. I also include my TurnItIn report with this submission.
ALL materials used must be referenced in the bibliography section. Students are
encouraged to use the Harvard Referencing Standard supplied by the Library. To use
other author’s written or electronic work is an act of plagiarism and may result in disci-
plinary action. Students may be required to undergo a viva (oral examination) if there
is suspicion about the validity of their submitted work.
Signature:
Date: April 12, 2019
PLEASE READ THE FOLLOWING INSTRUCTIONS:
1. Please attach a completed copy of this sheet to each project (including multiple copies).
2. You must ensure that you retain a HARD COPY of ALL projects, both for
your own reference and in case a project is lost or mislaid. It is not sufficient to keep
a copy on computer. Please do not bind projects or place in covers unless specifically
requested.
3. Assignments that are submitted to the Programme Coordinator office must be placed
into the assignment box located outside the office.
Office Use Only
Signature:
Date:
Penalty Applied (if
applicable):
3. Table 1: Mark sheet – do not edit
Criteria Mark Awarded Comment(s)
Objectives of 5
Related Work of 10
Data of 25
ETL of 20
Application of 30
Video of 10
Presentation of 10
Total of 100
4. Project Check List
This section capture the core requirements that the project entails represented as a check
list for convenience.
Used LATEX template
Three Business Requirements listed in introduction
At least one structured data source
At least one unstructured data source
At least three sources of data
Described all sources of data
All sources of data are less than one year old, i.e. released after 17/09/2017
Inserted and discussed star schema
Completed logical data map
Discussed the high level ETL strategy
Provided 3 BI queries
Detailed the sources of data used in each query
Discussed the implications of results in each query
Reviewed at least 5-10 appropriate papers on topic of your DWBI project
5. Criminalytics: Different entities affecting the Rate of
Crime in Republic of Ireland
Shrikant Uday Samarth
x18129137
April 12, 2019
Abstract
The aim of this paper is to understand the spatial patterns of crimes held in
Ireland and to develop better understanding of the role of different factors like
migration, population, unemployment etc. being the main inputs to the rate of
increase or decrease in crime. In recent years according to the research papers there
were complaints of various types of violence and crime Bacon & O’Donoghue (1975).
To address the concern of public safety this Data Warehouse Business Intelligence
model has been developed based on the data from various sources. This analysis
includes data sources from wide range of varieties like Statista, Central Statistics
Office, numbio and many others that helped in understanding the mentioned factors
being involved in criminal rate of Republic of Ireland. These findings can be issued
and used by the Garda Siochana i.e. the Irish Police department to understand
the patterns and predict the increase in crime rates in a specific county through
studying factors.
1 Introduction
A law is defined by every country to maintain the decorum of the country. It includes
few set of rules that help the implementer and the follower believe that there is harmony
and well being with every living being. But when such laws are broken, there is a
discomfort. That discomfort can be defined as a crime. When we speak about crime, the
definition that Oxford Dictionary of Sociology specifies is an offence which goes beyond
the personal and into the public sphere, breaking prohibitory rules or laws, to which
legitimate punishments or sanctions are attached, and which requires the intervention of
a public authority Scott & Marshall (2009). So, a crime can be termed as an outcome of
disobedience to the set of protocols called as law. There are many notable scientists that
have given different definition on crime. ”An intentional act or omission in violation of
criminal law, committed, without defense or justification and sanctioned by law as felony
or misdemeanor Paul (2016).”
There exist crimes in least corrupted countries, but the aim should be a crime free
nation. The aim of this project is to understand the patterns of crimes in Ireland. The
motivation towards it is that how data analytics can help in reduction in crimes rates
and bring about new reforms to curb the crimes in Ireland.
1
6. (Req-1) Does the unemployment in the country have any effect on increase or decrease in
rate of crime and population of Ireland?
(Req-2) Does the increase in population of counties cause any effect on the number of crime
rates registered in the Garda Stations of the respective counties?
(Req-3) Does the immigration in the country have any effect on increase or decrease in rate
of crime in Ireland?
2 Data Sources
The source for the data places a very crucial role in understanding the pattern in crime and
with the quality of data also comes the quantity which helps in more concrete analysis of
data. The project implementation required the use of 2 Unstructured and 5 Structured
Data Sets from different Data Sources to compare different parameters of crime and
provide a conclusive and firm analytics output to the theme.
Source Type Brief Summary
Numbeo.com Unstructured It gives different parameters of crime com-
mitted in each county of Ireland.
Wikipedia.com Unstructured It gives county-wise population of Ireland.
Databank.com Structured It gives unemployment percentage of Male
and Female by basic and Advance Educa-
tion.
Statista.com Structured It provides year-wise (from year 2007-2007)
population of Ireland.
CSO.com Structured This source gives vaious crime offences from
2007-18 quarterly data which is used calcu-
late year-wise total crime offences.
Kaggle.com Structured This source provides criminal offenses com-
mitted and recorded in 563 different Garda
county Stations year wise.
dbei.gov.ie Structured This source gives county-wise work permit
issued by the companies for the year 2018.
Table 2: Summary of sources of data used in the project
2.1 Source 1: Numbeo - Crime in Ireland county-wise:
This dataset is the most important unstructured data with regards to all the datasets
used as this is the dataset that gives us different parameters of crime committed in every
county of Ireland. This dataset has all levels of crime committed and those are marked
in the form of percentage. The dataset tells us the percentage and it also signifies the
intensity of crime rate in form of parameters like Low, Moderate and High values. Using
the Xpath of the web page all levels of crimes with its crime index has been extracted
using the R studio. In this way 15 counties data have been extracted from 15 Numbeo
web pages.
8. Figure 2: Source 2
2.3 Source 3: Databank - Unemployment: percentage with ba-
sic and advanced education
This dataset consists of data of unemployed males and females of Ireland who have gained
a basic and advanced education. This data is used to understand how unemployment can
be a cause of increase or decrease in crime rates in Ireland. The unemployment for both
male and female is in the form of percentage against the Total Labor Force of Ireland. The
above mentioned parameters were selected from the website link given below. The data
was downloaded in the .xlsx format. This source contains 5 columns of information for
2007-2018 year. Worldbank website never provides data created or updated information.
As the data source contains 2018 year data; which is within project requirement range of
one-year time frame.
https://databank.worldbank.org/data/source/world-development-indicators
2.4 Source 4: Statista - Population Growth in Ireland:
This is an important dataset from Statista which displays the population growth in
Ireland past 10 years. The reason this data is very important is to understand the
pattern of population growth and its effect on rise or fall of criminal activities in the state.
The dataset was released on November 2018 which gives population growth percentage
compared to previous years. To make the uniformity with all the dataset, year 2018
information was added from the source mentioned in the details section on the link.
Moreover, percentage data has been converted into population in millions which was
achieved by adding 2006 population which was in millions Irish population analysis (2019)
by doing the calculations in the excel file. Then, data cleaning is done through R studio.
https://www.statista.com/statistics/376895/population-growth-in-ireland/
9. Figure 3: Source 4
2.5 Source 5: Central Statistic Office - Crime Recorded in Ire-
land:
The third dataset used in assistance with the above 2 datasets is the crimes that were
recorded in Ireland for last decade. CSO has released this data publicly for awareness to
the citizens which is a very suitable set for understanding how crimes are registered with
respect to every year and its dependence on unemployment rate and population growth.
From the below CSO link, key table CJQ01 data was used which gives quarter-wise 75
types of recorded crime offences. The data was published on 22/03/2019 11:00:00 by the
CSO website. The quarter wise data was then converted into year wise through R studio
to maintain the uniformity.
https://www.cso.ie/px/pxeirestat/Statire/SelectVarVal/Define.asp?MainTable=
CJQ01&TabStrip=Select&PLanguage=0&FF=1
Figure 4: Source 5
10. 2.6 Source 6: Kaggle - Crimes at Ireland Garda Stations 2007-
2017
This file consists of original variables regarding the criminal offenses committed and
recorded in 563 different Garda Stations year wise. This dataset would help us understand
what type of crimes were committed with the Station ID where the crimes have been
reported through a decade from 2007 to 2017. The data was created on 26/3/2019
contains 4 columns of information for each county with recorded offences per year.
https://www.kaggle.com/johnpwatson/crimesatirelandgardastations20072017
Figure 5: Source 6
2.7 Source 7: 4. Department of Business Enterprise and Inno-
vation - Employment Permit Statistics 2018
This is a statistical data that has been collected from the Irish Government which deals
with county wise work permit issued by the companies for the year 2018. This is a
part of an open-source publication by the Republic of Ireland for the public to access.
This dataset will help us to understand immigration that is carried out in every county of
Ireland from which we can understand the detailed idea of granted/non-granted permits to
the individuals. This data was released on 16/01/2019 contains 5 columns of information
for each county.
https://dbei.gov.ie/en/Publications/Employment-Permit-Statistics-2018.html
Figure 6: Source 7
11. 3 Related Work
According to research done on Crime records in Ireland, increase in population has a
significant relation with the increase in amount of crimes in Ireland. In a paper submitted
by Peter Bacon and Martin ODonoghue by the name The Economics of Crime in Republic
of Ireland: An exploratory paper, they have explored possibilities of applying models
developed elsewhere to an analysis of rising crime rates in Ireland. However, the analysis
also indicates that rising unemployment will be associated with an increase in crimes
against property with violence, and with a decrease in crimes against property without
violence. This turns to be motivation for understanding if the unemployment is a cause
of increase in crime rates in Republic of Ireland Bacon & O’Donoghue (1975).
Considering this as a hypothesis, it was tried with different other parameters that can
also assist increase in crime rates in Ireland. One of the factors considered here is the
rate of unemployment in Ireland against 10 long years. In the paper submitted by Alan
Barrett and Seamus McGuinness by the name The Irish labour market And the great
recession It was found out that due to recession in 2008-09, there was a drastic drop in
employment and from the crime dataset it was found and investigated that the crime
rates were on rise during the recession period. To earn livelihood during the recession,
data showed that people had resorted to criminal activities during this year more than
any other year Barrett & McGuiness (2012). Moreover, in another paper submitted by
Alan Barrett1 and Elish Kelly by the name The Impact of Irelands Recession on the
Labour Market Outcomes of its Immigrants the employee permits had seen a big plum-
met during this year which motivated me to research on this texture so that we can
understand the patterns of mentality of criminals before they could commit crimes Bar-
rett & Kelly (2012). So, to check the findings of last decade the Databank website source
was used to check the unemployment, crime dataset was taken from CSO and to check
any changes major change in population Statista source was used. Alan and Elish also
discussed about the decline in immigrant population which motivated me to check the
Ireland county population texture for the which was taken from Wikipedia and for the
work permits county-wise total work Employee permit dataset was taken from dbei gov-
ernment website and compared it with the county crime index dataset which is based on
the people perceptions which was web scrapped from Numbeo website. Furthermore, a
research paper submitted by CSO on Review of quality and crime statistics explains how
statistics of recorded crime plays a vital role in informing society of the level and types of
offence CSO Review of the Quality of Crime Statistics 2016 (2016); which motivates me
to check the garda stations records dataset which was taken sourced from Kaggle website.
12. 4 Data Model
Coming to the concept of Data warehousing, a data warehousing is a technique for
collecting and managing data from varied sources to provide meaningful business in-
sights. It is a blend of technologies and components which allows the strategic use of
data. It is electronic storage of a large amount of information by a business which is
designed for query and analysis instead of transaction processing. It is a process of
transforming data into information and making it available to users in a timely man-
ner to make a difference What Is Data Warehousing? Types, Definition Example (n.d.).
There are 2 types of approaches that we can follow while using Data Warehouse.
1. Inmon 2. Kimball
When it comes to designing a data warehouse for your business, the two most com-
monly discussed methods are the approaches introduced by Bill Inmon and Ralph Kim-
ball. In Bill Inmons enterprise data warehouse approach (the top-down design), a nor-
malized data model is designed first, then the dimensional data marts, which contain
data required for specific business processes or specific departments, are created from
the data warehouse. In Ralph Kimballs dimensional design approach (the bottom-up
design), the data marts facilitating reports and analysis are created first; these are
then combined to create a broad data warehouse George (2019). In our project, we
have used Kimballs approach to design dimension modelling for building a data ware-
house. The reason to do so is that it occupies less space, makes easy management and
is faster with respect to Inmon’s approach. Ralph Kimball always supported the in-
clusion of the end-users in the process throughout his work Chhabra & Pahwa (2014).
Surely, this approach is suitable for my research project, as the queries will help so-
ciety to make a better place for living. In case if we need to change fact table in
future, then we should use Inmons approach. But in our situation, we don’t need to
roll out any improvements indeed. Therefore, Kimball’s methodology will be a better
alternative to do this undertaking. In this way, to accomplish what I want out of my
task I have joined my datasets using year (2007-2018) which is common in Databank,
CSO and Statista datasets; whereas, unstructured dataset from Wikipedia and Numbeo
is joined with DBIO and Kaggle datasets which are based on counties for 2018 year.
So, from the above dataset I have derived 2 dimensions i.e DimYear and DimCounties.
All these dimensions are discussed below:
DimYear: Dimension year was made as year parameter is common in all datasets.
CSO which contains Crime offences data by year, Databank contains year-wise data of
unemployment of male and female percentage by basic and advance education whereas
statista contains the year-wise population data. Hence, identify the unemployment per-
centage with respect to crime offences year is the common parameter. Hence, Year is the
primary parameter in this data warehouse project. Using SSIS, YearId was generated and
assigned as a primary key for the year dimension.
DimCounty: Dimension County was created as county is common in Numbio, Wikipedia
which are the unstructured datasets and is also common in Kaggle and DBEI which are
the structured datasets. Here county was the common parameter with which Crime rate
from Numbio and employee permits are compared. Hence, County is also the primary key
parameter in this data warehouse. Using SSIS, CountyID was generated and assigned as a
13. primary key for the county dimension.
The Below figure illustrates the star Schema for this project which is further used in
Tableau visualization:
Figure 7: Star Schema
14. 5 Logical Data Map
Logical data Map: The dimensions and the facts of all the datasets are explained below:
Table 3: Transformation, sources and destination for all
components in the Logical Data Map are illustrated in
below Fig: Figure ??
Source Column Destination Column Type Transformation
1 Year DimMovie DimYear Dimension Primary key, 2018 year added to the table, as the data
is given in the website
1 County DimCounty County Dimension Primary key, County name renamed to match consis-
tency
1 Level of crime FactTable Level of crime Fact Round up to 2 decimal $
1 Safety during
daylight
FactTable Safety during
daylight
Fact Round up to 2 decimal $
1 Total Crime
Record
FactTable Total Crime
Record
Fact Round up to 2 decimal $
2 Year DimYear Year Dimension Primary key, 2018 year added to the table, as the data
is given in the website $
2 Administrative
County
DimCounty County Dimension Primary key, County names renamed to match consis-
tency with other dataset
2 Population FactTable Population Fact Comma removed from the values $
3 Year DimYear Year Dimension Primary key, Format matched with other datasets (it
was 2007 [YR2007] format) $
3 Labor force,
total
FactTable Total Labor
Force in
Million
Fact Rounded to 3 decimals $
Continued on next page
15. Table 3 – Continued from previous page
Source Column Destination Column Type Transformation
3 Unemployment
with basic ed-
ucation, male
(% of male
labor force
with basic
education)
FactTable Unemployment
Ba-
sic Education
Male
Percentage
Fact Rounded upto 2 decimals $
3 Unemployment
with basic
education,
female
FactTable Unemployment
Basic Ed-
ucation
female Per-
centage
Fact Rounded upto 2 decimals $
3 Unemployment
with Advance
education,
female
FactTable Unemployment
Advance Ed-
ucation
Female
Percentage
Fact Rounded upto 2 decimals $
3 Unemployment
with Advance
education,
male
FactTable Unemployment
Advance Ed-
ucation
male Per-
centage
Fact Rounded upto 2 decimals $
4 Year DimYear Year Dimension Primary key, 2018 year added from the source men-
tioned in the description $
4 Population
growth com-
pared to
previous year
FactTable Percentage
Rise
Fact Rounded up to 2 decimal
Continued on next page
16. Table 3 – Continued from previous page
Source Column Destination Column Type Transformation
4 population
Million
FactTable Population
in Million
Fact Rounded up to two decimal $
5 Year(Quarter-
wise)
DimYear Year Dimension Quarterwise Year converted into Year format then
transposed to match with the other datasets $
5 Total Of-
fences
FactTable Total Offences Fact Number of offences columns were added and converted
into Total offences
5 Revenue FactTable Revenue Fact Rounded to nearest million $
6 Year DimYear Year Dimension 2018 year was taken from the table, positioning of col-
umn
6 Station DimCounty County Dimension Taken as it is from the source
6 Number
˙of ˙Crime
˙Record
FactTable Total Crime
Record
Fact Comma removed from the values $
7 Year DimYear Year Dimension Taken as it is from the source
7 County/
Country
DimCounty County Dimension Extra spaces were removed from the column $
7 New FactTable New Fact Taken as it is from the source
7 Renewal FactTable Renewal Fact Taken as it is from the source $
7 Total FactTable Total Permit Fact Extra comma removed from the value
7 Refused FactTable Refused Fact Taken as it is from the source $
7 Withdrawn FactTable Withdrawn Fact Taken as it is from the source $
17. 6 ETL Process
ETL means Extraction, Transform and Load is considered to be the foundation database
or data warehousing to reduce the error and minimize the data loss. It is the high-level
perspective of the system can be visualized by conceptual modeling of ETL process.
There are various advantages like system error identification, cost minimization, risk and
scope assessment etc Biswas et al. (2019). For ETL process, according to the requirement
the data has to be cleaned and transformed keeping in mind to remove to expel all sort
of redundancy in information. Following is the ETL procedure which I utilized in my
undertaking:
Figure 8: ETL Process
6.1 Extraction:
We have downloaded structured dataset from CSO, Statista, Databank, DIBO and Kaggle
websites in the excel file format which are cleaned through R. whereas the unstructured
dataset was web scrapped through R from the Wikipedia and Numbeo websites. First
structured dataset which was on crime rate of Ireland data. There were total 48 columns
from 2007-2018 quarter which was converted into year by taking the sum of the quarter
to make the data year-wise. Second dataset which was from Statista on Ireland popula-
tion which gave the Irelands population growth from 2007-2017. Third structured source
was from Databank which was on Unemployment percentage gender statistics. The ex-
tracted excel file contains 6 columns like Year (2007-2018), labor force, Unemployment
percentage with basic education male female, Unemployment percentage with advance
education with male female. Fourth structured dataset was extracted in the form of .xlsx
format which was taken from DBIO website. This dataset was for county-wise employee
18. permit by companies for 2018 year. The dataset contains 27 rows with counties and
7 columns such as year, county, new permits, renewals permits, total permits, refused
and withdrawn permits from Irish embassy. Fifth dataset was taken from Kaggle which
was on guarda station crime records for each county. Initially there were three sheets
in the excel; we have used third sheet which gave the year-wise county station records.
The table contains 5 columns, station id, stations, Divisions, Year and Number of crime
record. For Unstructured dataset, we have used Numbeo as our first unstructured data
source; we did web scrapping on 15 webpages, as each webpage gives different level of
crime data that represents each county. We are not able to scrape data for remain-
ing counties because of the unavailability of information for those counties. To do the
web-scrapping we have used different R packages and function to extract data from 15
counties. The data had been selected using the Xpath from the website. All the pages
are then upended into the data frame. Then, unwanted extra columns and rows were
deleted. In order to make to consistency, county names were renamed to match with the
other datasets. For second unstructured dataset, we have extracted population data from
the Wikipedia. The dataset contains 6 columns namely Rank, Administrative County,
Population, Density, Traditional province, change in the previous census. To do web
scrapping different R libraries were used. Using the htmltab the data was extracted from
Wikipedia. Then, unwanted columns were removed and column names were renamed to
make the consistency throughout the datasets.
6.2 Transformation:
After the extraction and before loading the data in the data warehouse, the data should
be made appropriate to meet the business necessities. The data transformation may
incorporate activities, for example, cleaning, joining, and generating calculated data de-
pending on existing values. This part of the ETL procedure is the most critical and tiring
one and expends a great deal of time as we need to accomplish the cleanest information
to deliver an exact business solution. All the structured data set were in the .xlsx format.
So, first I found out all the required fields from these datasets and which are needs to be
modified for my BI queries. To do this extra columns and spaces were removed from the
datasets. I found out that CSO dataset was given in quarters to make it consistent it is
converted into years through R code. Also, the data had 76 rows of number of offences
which then was added to find the total offences per year. The table is then transformed
using t() function from rows to year to maintain consistency across all dataset. All the
quarters and types of offences which was not required was then dropped from the ta-
ble. For Statista, to make the uniformity with all the dataset, year 2018 information
was added from the source mentioned in the details section of that web page. Moreover,
percentage data has been converted into population in millions which was achieved by
adding 2006 population which was in millions Irish population analysis (2019) by doing
the calculations in the excel file. Then, data cleaning such as extra spaces and columns
were removed through R studio. In the third Databank dataset, cleaning and formatting
was done through R studio to maintain the consistency. In the DBIO dataset, column
names were renamed, moreover extra column and spaces were removed. Also some null
values were removed from the dataset. The Kaggle dataset a lot of null values were
present and to get the desired 2018 year for BI query, unwanted rows and columns were
removed. To do this is.na(x) function was used. Not equal to symbol is used to check
the condition for removing the unwanted rows. From the table we have extracted crime
19. records for 2018 year by removing the unwanted year rows using R studio. Extra spaces
and columns were then removed from the table to maintain the consistency throughout
the datasets. changeling task was to scrape data from the Numbeo website, as each
county represents and the data was available for 15 counties, so for 15 counties I had
to scrape 15 pages. The challenging task was made easy after using R packages such as
rvest,magrittr,RSelenium,httr,dplyr and data.table packages which made our work easy.
We have used function to extract data from 15 counties. The data had been selected using
the Xpath from the website. All the pages are then upended into the data frame. Then,
unwanted extra columns and rows were deleted. In order to make to consistency, county
names were renamed to match with the other datasets. In Wikipedia web-scrapping,
we have extracted population data from the Wikipedia. To do web scrapping htmltab
and data.table was used. Using the htmltab the data was extracted from Wikipedia.
The dataset contains 6 columns, unwanted column such as Rank, Density, Traditional
province and change in the previous census were dropped. Then, column names were
renamed to make the consistency throughout the datasets. All the above changes were
done in R studio and every one of the codes which are utilized to accomplish the outcomes
are referenced in Appendix. Indexing has been done in SQL Server Management studio
tool for the database which was named as Ireland Crime for this project. In this I have
chosen a quality from each table which had unique value for each line and allocated it as
a Primary Key for that specific table. SSMS as a matter of course allots the varchar (50)
as the data type for each attribute. For year which is a numeric value, I have changed
the data type to int(integer).
6.3 Loading:
The last advance in the ETL procedure includes stacking the transformed data into the
end target. Depending on the requirements of the organization, this process varies widely.
Some data warehouses may overwrite existing information with cumulative information;
updating extracted data is frequently done on a daily, weekly, or monthly basis Extract,
transform, load (2019). After creating Ireland Crime database, we have to load all in-
formation in SSMS through SSIS on staging area. Data flow task was used to complete
this process. To load the data, flat file source component was used that helped to load
data from csv format file to SSMS using the OLEDB destination component. OLEDB
component helps to create table and load the data into the SSMS. In Flat file source,
we need to give the table name which needs to be created in SSMS; Moreover, we also
need to give the csv transformed output file. After this we need to set the text qualifier
to inverted commas as a separator. In the preview section file can be seen in the tabular
format. The Advance SSIS provides options to change the datatypes which is important
in order to load the correct data to SSMS. Then in the connection manager of OLEDB,
we can compose a SQL query to make new table in SSMS or SSIS give us Accurate
recommendation to make table dependent on flat source file. We can modify this query
from the new tab in the OLEDB destination component, we can also see the error in
the error section. In all I had 7 flat files, so 7 data flow task was file was taken. After
Files getting transferred in SSMS we have to make Dimension table. A SQL execution
task is utilized in this progression. A SQL content is written in the undertaking itself to
make two dimension tables by using the data from the raw tables which was created by
the data flow task. The Dimension tables provide the context for fact tables for all the
measurements presented in the Data warehouse. Although dimension tables are usually
20. much smaller than fact tables, but they are the heart of the data warehouse because they
provide entry points to data Kimball & Ross (2011). Next, for populating the fact table
another SQL task was created. First, SQL script was written to insert the data from
the raw tables using the inner joins on the dimension tables. It was quite a challenging
task; as after the data is inserted in the table, populating fact table with proper values
and join the table required proper brainstorming. Considering various types of joins we
came to know that for my data inner join is best suitable. After the fact table, cube was
deployed we were able to get star schema. I have connected this cube with the SSIS for
automation. After successful deployment our desired star schema appeared and checked
if all the values in the cubes are correct using the explore option. After confirming all
the values from the cube, we are ready for visualizing the data in a visualization tool i.e
Tableau for this project.
7 Application
For better understanding the Business Intelligence Queries, as per the business require-
ments discussed in Section 1, are visualized with the help of Tableau by mapping the 3
entities separately in a graph and understand the pattern in the data.
7.1 BI Query 1: Does the unemployment in the country have
any effect on increase or decrease in rate of crime and pop-
ulation of Ireland?
To understand the queries, it is to be understood that how the unemployment has affected
Ireland in last recent years. For that, the data to be used should be from 2007 till present.
The factors taken into consideration are as follows,
- Unemployment in Ireland among working adults (men and women) in last decade
(2007 - 2017)
- Total population across the last decade (2007-2017)
- Total number of offenses occurred in the last decade (2007-2017)
1. Unemployment in Ireland among working adults: Here the unemployment based
on population has been extracted from The World Bank which gives the complete details
of unemployment based on citizens who have received basis education with regards to
ones who have received advanced education among all the labor force available in Ireland.
This will help in filtering the number of non-working population from the year 2007 till
2018. This data would help understand variety in unemployment if any specific which is
mapped against the years 2007 and 2018.
2. Total population across last decade: The cumulative population has been extracted
from Wikipedia for the years 2007 till 2017 to understand the growth of population
of Ireland. From this, it is understood that the what amount of population has been
increased or decreased so that it becomes comparable to the total number of crimes and
the unemployment occurring in the Ireland.
3. Total number of offenses occurred in the decade:This dataset comes from Central
Statistics Office where the data has been used in the form of structured dataset which
contains all the offenses occurring in the past decade (2007 till 2017). This dataset will
help an individual to understand the amount of crimes occurring with respect to the
population and unemployment in Ireland. The data was extracted and presented and
21. visualized with the help of Tableau. Below is the visualization received from execution of
first BI query. The data was extracted, presented and visualized with the help of Tableau.
Below is the visualization received from execution of first BI query.
Figure 9: 1st BI Query
7.2 BI Query 2: Does the increase in population of counties
cause any effect on the number of crime rates registered in
the Garda Stations of the respective counties?
To understand this, the datasets that have been used are as follows,
- Garda station crime record (2018)
- Population of Ireland based on counties (2018)
1. Garda station crime records: The dataset has been scrapped from Kaggle which
consisted of all the Garda Station of Ireland county wise that have criminal cases reg-
istered for the year 2018. This data can be very useful with respect to understanding
the crimes registered across the counties which is then compared with the population of
counties in Ireland.
2. Population of Ireland based on counties: This dataset has been extracted from
scraping Wikipedia which has been updated with the population of Ireland county wise
for the year 2018. This will help to understand the pattern of population density across
Ireland.
The data was extracted, presented and visualized with the help of Tableau. Below is
the visualization received from execution of Second BI query.
22. Figure 10: 2rd BI Query
7.3 BI Query 3: Does the immigration in the country have any
effect on increase or decrease in rate of crime in Ireland?
Immigration for any countries help the country to develop in various fields like technology,
science and infrastructure. But it is debatable fact for the residents of the country as
some countries think that immigration would take the jobs away from the local people of
the country. This can be proven with the help of facts and data available. To verify the
validity of this point, two datasets have been used namely,
- Population of Ireland based on county.
- Crimes in Ireland based on county.
- Employee permits to the immigrants provided in every county.
1. Population of Ireland based on county: Earlier, the population has been considered
on the basis of number of years for the whole country. Here the population of counties have
been used as a comparative factor between crimes occurring in Ireland. This dataset has
been extracted from scraping Wikipedia which has been updated with the population
of Ireland county wise for the year 2018. This will help to understand the pattern of
population density across Ireland.
2. Crimes in Ireland based on county: The dataset for crimes in Ireland has been
extracted and scraped from Numbeo.com which has wide range of crimes that have been
committed in Ireland for the year 2018. This data will help to understand crimes for a
specific year occurring on the county level to bifurcate the crime and go deep down to
understand crime variation on county level
3. Employee permits provided to the immigrants on county level for 2018: The dataset
gives the details of the permits given to the immigrants in the Ireland. This dataset will
help to understand how many permits were granted by the Irish government for every
23. county thus giving a whole idea about immigration.
The data was extracted, presented and visualized with the help of Tableau. Below is
the visualization received from execution of Third BI query.
Figure 11: 3rd BI Query
These BI queries have been detailed further in the Discussion section.
7.4 Discussion
From the 1st BI query, it can be understood that from the year 2007 to 2010, there was
a steep increase in the rate of unemployment in Ireland and due to which there is also
an increase in crime rates as well. This can be linked with the Global recession that hit
the whole world Great Recession (n.d.). For the year 2007 to 2010, the recession affected
Ireland as well which caused a huge amount of joblessness and unemployment. From
the data available, during this period there is a steep increase in crime rates too. For
the years 2010-2014, there is no specific change unemployment and crimes committed
but from the year 2015 till 2018, there was a huge amount of decrease in unemployment
rate which in turn has also shown the decrease in rate of crimes committed, so it can be
concluded that there is a direct proportion between rate of crime and unemployment.
The population has been on steady increase from the time period of 2007 to 2018.
But there does not seem any relation between overall population of country and crimes
committed. This can be elaborated in the 2nd BI query where the population of counties
have been compared with the crimes rates registered in the Garda Stations in different
counties in Ireland. From the datasets, it has been found out that with the increase in
population of counties, there has been an increase in number of crime rates registered
in respective county. Thus, it can be clearly justified that there is relation between
population of country and the crimes committed or registered in Ireland based on counties.
24. The 3rd BI query deals with the immigration effect on rate of crimes in Ireland. When
the datasets were mapped in tableau, an interesting factor was noted that Dublin being
the capital of Ireland has highest number immigration filings, but the crimes committed
in the county are less. If this must be compared with the Donegal that has highest
criminal cases registered even when the number of immigration cases filed are less, it is
understood that there is not significant relation between the immigration in Ireland and
the crimes reported in Ireland. On the contrary, it can be inferred that with the increased
amount of security in Dublin, the crime rates are less as compared to Donegal, which is a
remote county and is less secure than Dublin and progressed counties like Cork, Galway
and Limerick.
8 Conclusion and Future Aspects
During the start of project, there was a consideration that the crime rates in Ireland,
though is less, but still there is small amount of unrest which restricts Ireland from being
one of the most uncorrupted countries in world. Crime being one of the factors, the
sub-factors that affect crimes have been viewed and verified with the data validations.
The factors like unemployment, immigration and population rise paved the way towards
criminal activities in Ireland. Factors like unemployment and population influenced crime
rate but immigration didn’t find a place in affecting the crime rate hike. This shows that
even if Ireland has its own problems to deal like border issues with Norther Ireland or
business continuity with Britain after the Brexit, it is still a peace loving country and
encourages a good behavior towards immigration and development.
This project is helpful for future aspects as it deals with criminal analytics. Every
country is affected with crime and Ireland is not different. Recently, UK police had
collaborated with Accenture to perform criminal analytics to understand the patterns of
crimes that happen in UK. This helped UK in huge amount as the system could give a
rough information about the culture of criminals and identify how and when the attacks
would occur in the disturbed areas. This project is termed as The Enterprise approach to
Law Enforcement Accenture Police Center of Excellence (n.d.)”. The same concept can
be implemented in Ireland and can pave a way towards safer and crime free country.
References
Accenture Police Center of Excellence (n.d.).
URL: https://www.accenture.com/gb-en/insight-enterprise-approach-to-law-
enforcement
Bacon, P. & O’Donoghue, M. (1975), ‘The economics of crime in the republic of ireland:
An exploratory paper’, Economic and Social Review 7(1), 19.
Barrett, A. & Kelly, E. (2012), ‘The impact of irelands recession on the labour market
outcomes of its immigrants’, European Journal of Population/Revue europ´eenne de
D´emographie 28(1), 91–111.
Barrett, A. & McGuiness, S. (2012), ‘The irish labour market and the great recession’,
CESifo DICE Report 10(2), 27–33.
25. Biswas, N., Chattapadhyay, S., Mahapatra, G., Chatterjee, S. & Mondal, K. C. (2019),
‘A new approach for conceptual extraction-transformation-loading process modeling’,
International Journal of Ambient Computing and Intelligence (IJACI) 10(1), 30–45.
Chhabra, R. & Pahwa, P. (2014), ‘Data mart designing and integration approaches’,
International Journal of Computer Science and Mobile Computing 3(4), 74–79.
CSO Review of the Quality of Crime Statistics 2016 (2016).
URL: http://www.cso.ie/en/media/csoie/releasespublications/documents/crimejustice/2016/reviewo
Extract, transform, load (2019).
URL: https://en.wikipedia.org/wiki/Extract,transform,l oadTransform
George, S. (2019), ‘Inmon or kimball: Which approach is suitable for your data ware-
house? 2019’.
Great Recession (n.d.).
URL: https://en.wikipedia.org/wiki/GreatRecession
Irish population analysis (2019).
URL: https://en.wikipedia.org/wiki/Irishpopulationanalysis
Kimball, R. & Ross, M. (2011), The data warehouse toolkit: the complete guide to dimen-
sional modeling, John Wiley & Sons.
Paul, T. (2016).
URL: https://www.cliffsnotes.com/study-guides/criminal-justice/crime/definitions-
of-crime
Scott, J. & Marshall, G. (2009), A dictionary of sociology, OUP Oxford.
What Is Data Warehousing? Types, Definition Example (n.d.).
URL: https://www.guru99.com/data-warehousing.html1
26. Appendix
R code used Cleaning and Extraction:
1. Numbeo Web scrapping Code
#Install Packages
install.packages(’rvest ’,repos = "https :// cran.rstudio.com")
install.packages(’magrittr ’,repos = "https :// cran.rstudio.com")
install.packages(’RSelenium ’,repos = "https :// cran.rstudio.com")
install.packages(’httr ’,repos = "https :// cran.rstudio.com")
install.packages(’dplyr ’,repos = "https :// cran.rstudio.com")
#dplyr is the next iteration of plyr , focussed on tools for working with data
install.packages(’data.table ’,repos = "https :// cran.rstudio.com")
#Load packages
library(’rvest ’)
library(’magrittr ’)
library(’RSelenium ’)
library(’httr ’)
library(’dplyr ’)
library(’data.table ’)
#Define Web Scrapping Function and Scrapping Code for Multiple Pages
get_countdata <-function(keyword)
{
url <- paste(’https :// www.numbeo.com/crime/in/’,keyword ,sep="")
#Reading the HTML code from the website
webpage <- read_html(url)
#Getting name of the County
County_crime <- html_nodes(webpage ,’.columnWithName ’)
#Converting the title data to text
Countyname_data <- html_text(County_crime)
#Let ’s have a look at the title
head(Countyname_data)
#Getting number of ratings
Crimenumber_data <- html_nodes(webpage ,’.indexValueTd ’)
#Converting the title data to text
Crimenumber_data <- html_text(Crimenumber_data)
#Let ’s have a look at the title
head(Crimenumber_data)