Database and Analytics Programming - Project report

Database and Analytics Programming
Sarthak Khare
School of Computing
National College of Ireland
Dublin, Ireland
Student ID: x18180485
Jayanta Behera
School of Computing
Dublin, Ireland
Darshana Gowda
School of Computing
Dublin, Ireland
Samruddhi Kanhere
School of Computing
Dublin, Ireland
Abstract—Crimes threaten social peace and also create panic
amongst the society. It is not only the responsibility of law
enforcement agencies to maintain law and order but also of
civilians to remain vigilant and report any unlawful activities in
their vicinity. In order to find a relationship between complaints
lodged at the stations, the number of arrests and court summons
and the prison admissions in the city of New York, we have
performed analysis on data for the year 2018. We have created
visualizations based on features that were common to all the 4
datasets. It has been observed that overall the number of prison
admissions is lower than the complaints lodged. The numbers
further dwindle as we move to court summons and arrests.
Analysis based on Age and Gender for all the 5 boroughs of New
York showed that there was a greater number of males as opposed
to females at every stage and that most of the alleged criminals
fell under the 25-44 age category. A comparative analysis of
count of crimes per capita for all the boroughs revealed that
the highest number of crimes occurred in Bronx, followed by
Manhattan, Brooklyn, Staten Island and Queens.
Index Terms—Crime, New York, database, visualizations
I. INTRODUCTION
Regulation of crime rates and assurance of appropriate
justice is not only essential for the victims but also for
the society altogether. If justice is to prevail, and criminals
punished, crimes need to be reported in forms of complaints
and the same needs to be worked upon by the law enforcement
department to bring justice to the victims. As a part of this
project, our objective is to gain insights from the patterns of
the crimes that are accounted for, in the city of New York. We
will further explore the complaints, arrests, court summons and
prison admissions data and perform a comparative analysis
to understand the relationship between them. We will also
investigate the crime rates for the 5 boroughs in New York
City based on features like Gender, Age Group etc. Analysis
of crimes is essential in helping the law enforcement agen-
cies take effective measures for prevention and reduction of
crimes. It will allow them to gain a better perspective and
boost pre-emptive actions such as increased patrolling and
surveillance which would help reduce criminal activities. The
choice of data resonates with the objective of comparing and
individually analyzing the crime rate, the arrests and court
summons for the crimes as well as the incarcerations at a
given borough. The research question that we aim to answer is:
Are the number of prison admissions commensurate with the
number of complaints registered for various crimes committed
across the administrative districts of New York City?
II. RELATED WORK
Several researches have been performed on crime and
visualization have been created to find patterns and trends in
the crimes based on location and time, as well as type of
crime which will help in predicting if crime could happen in
a certain location or a certain time of day or week. This has
aided the law enforcement personnel to be more vigilant and
take preventive measure to reduce the number of crimes. We
have studied a variety of such research work and attempted to
find researches that are similar or related our objectives.
An analysis has been performed where clustering techniques
are used on Stop, Question and Frisk dataset. The analysis and
prediction done as a part of this research helped in identifying
locations which require higher amount of police patrolling [1].
Visualizations have been used by Bayoumi et al. to identify
the most common location, day of week and time of day for
different categories of crime. As per their analysis, crimes
against people occur more frequently at night as compared to
any other time of the day. Crimes against properties take place
late in the morning or early in the afternoon. These insights
were provided to the law enforcement personnel which could
help them take quick decisions [2].
Using big data analytics, Feng et al. discovered discerning
facts and patterns from the criminal data of three major cities
in the United States. Their aim was to help the police depart-
ment to understand crime in a better way, knowledge of which
can be used for crime detection and for undertaking preventive
measures [3]. Clustering techniques and Association rules
are also used in [4] for investigating crime data and finding
means to prevent the same. Formal concept analysis was
performed on crime data for different geographical locations.
Crimes were split into different categories based on common
attributes. This helped build a more defined model for crime
analysis based on geographical distribution [5].
Text analytics has also been used to perform crime analysis
by Ku, Nyugen and Leroy [6]. An efficient decision control
system was developed for the lesser trained security personnel
using natural language processing that provided an efficient
way to investigate crime with better accuracy.
Analysis was performed using Geo-Spatial data from the
year 2003 to 2015 for San Francisco [7] and it was observed
that the western coast of San Francisco was far safer as
compared to the east coast. As per the analysis of crime over a

period, it was observed that crime occurred mostly during the
weekend. The author identified the three most unsafe regions
of the city using Hotspot technology.
In order to deal with the crime rates, Shah et al. [8]
proposed a framework which would take crime related data
and transform it into visual reports. They have used graphical
representations for summarizing their findings. Live heatmaps
of locations which high density of crimes were created, and
clustering algorithms were implemented on geographical loca-
tions to identify patterns in crimes committed. These assisted
in taking pro-active measures for reduction of crimes.
Many researches have also been performed on crimes and
predictive analysis has been used to predict future occurrences
of crimes [9]. Based on history of criminal incidents, Sivana-
galeela and Rajesh performed clustering of criminal activities.
They have generated a pattern to identify the crime areas based
on achieved data prior to occurrence, which would eventually
help reduce incidents related to crime.
All these have been used for crime detection where im-
portance of big data analysis and data mining methods has
been emphasized, and prediction of crime has been carried
out. However, a comparative analysis of the legal steps has
not been executed as such. From our research of comparing
the proportion of complaints and arrest against court summons
and imprisonment, we can see that the proportions vary by
a considerable amount. These results should help the law
enforcement and judiciary authorities to look back and identify
if the reasons behind this are something to be worked upon.
III. METHODOLOGY
To achieve the objectives of the project, a series of steps
have been followed. A diagrammatic representation of the
process flow followed can be seen in Fig. 1
Fig. 1. Process Flow diagram
A. Data Collection
The first step in this process is gathering appropriate data.
The data for New York City for the year 2018 has been
extracted for analysis. An outline of the 4 related datasets
used for the project is as follows:
• Complaints: This dataset has all the criminal complaints
lodged by victims and witnesses in New York city. It has
about 450k records and 35 features.
• Arrests: This dataset contains information of all the
arrests that took place in the selected year of interest.
It has about 250k rows and 18 attributes.
• Court Summons: The dataset includes information of all
the criminal summons that happened. It has about 89k
rows and 16 features.
• Prison Admissions: Information about all the prison ad-
missions is contained in the above dataset which has a
little above 19k rows and 9 attributes.
All the four datasets in JSON format are programmatically
extracted using open APIs. The first three datasets are obtained
from the New York open data (https://data.cityofnewyork.us/)
whereas the fourth dataset is from the data.gov website. Also,
the population data of New York for the year 2018 is web
scraped.
B. Unstructured Data Storage
As the collected data is in JSON format, MongoDB database
has been used for its storage. The data gathered has been
split and pushed into MongoDB in the form of documents.
MongoDB is an open-source database and is the best for
storing structures like that of JSON
C. Data Preprocessing
This is the most important step of the end to end process.
In this step, the records have been fetched from MongoDB
and converted to dataframe for cleaning the data. All the
preprocessing and transformation has been done using pandas
dataframe.
• Feature Selection: All the unnecessary columns except
the columns required for the analysis are dropped. The
columns such as Gender, Borough (Administrative Dis-
trict), Age Group have been selected for the analysis.
• Feature Calculation: In the Prison Admissions dataset,
the age data present is continuous in nature whereas the
other three datasets contain categorical age data. A new
column has been added to the dataset to capture the age
in the form of categorical values that match the other
3 datasets. Borough column has been introduced and
boroughs corresponding to the county data present in the
dataset have been populated. The complaints, arrests and
court summons datasets have date column, which have
been used to calculate the day and month.
• Missing Data: The missing data are imputed based on the
normal distribution.
• Dealing with Missing Data and NA values: For features
containing higher proportion of missing values, data has
been imputed based on distribution plots to avoid loss
of essential data. Rows have been dropped for features
containing fewer proportions of NA or missing values.

D. Structured Data Storage
In this step, the pre-processed data in the dataframe has
been converted to CSV. To store this clean data, which is
in a structured format, PostgreSQL database has been used.
PostgreSQL being an open-source relational database best
suited for storing structured data.
E. Visualizations and Analysis
In this step, the data has been extracted into pandas
dataframe from PostgreSQL database. This data is used for
further analysis and visualizations. Various visualizations are
created such that they answer the proposed objectives and
research question. All the steps are carried out using Python
programming language. Python being an open source and easy
to use language, provides a variety of packages for analyzing
and visualizing data. To create the visualizations, Python
packages such as Matplotlib, Seaborn and Altair have been
used. Seaborn package is an extension of Matplotlib. Altair is
another user-friendly python package used for visualizations.
The process has been programmed to accommodate user input.
This need has been carried out by taking the year as an input
from the user. The code has been written to accommodate any
data with the same structure. GitHub has been used by the en-
tire team as a version control tool for sharing and maintaining
the codes, data and visualization results throughout the period
of completion of the project.
IV. RESULTS
This section will cover the visualizations and the results
obtained for the analysis which was conducted above.
From Fig. 2, we can observe that the number of complaints
received by the New York Police Department are the highest,
followed by number of arrests made by the department.
However, the number of court summons and incarcerations
are signiﬁcantly lower than the other two.
Fig. 2. Monthly Crime Count
As the area chart in Fig. 2 gives just an overall trend, we
have plotted the individual trends for all the 4 datasets for
detailed analysis of trends. The line chart, as seen in Fig. 3,
helps us deduce the trend for the year 2018. Here, we can see
there is an overall decrease in the crime as the year progresses,
for all the 4 categories. However, during the months of May to
August, complaints made are the highest which then decline
towards the end of the year.
Fig. 3. Monthly Crime Count - Individual Analysis
As observed in Fig. 4, top 10 crimes for complaints and
arrests are very similar, ‘Petit Larceny’ is at the top in
complaints and takes the 3rd
spot in arrests, similarly ‘Assault
3’ also appears in the top 3 in both the categories. However,
if we look at court summons and prison categories, we can
see the top 10 crimes are very dissimilar to complaints and
arrests. Court summons are dominated by crimes like ‘Mo-
tor vehicle Safety Regulations’ and ‘Marijuana Possessions’,
while, incarcerations are mostly made in violent categories of
crimes such as ‘Possession of Weapons’, ‘Robbery’ etc.
Fig. 4. Top 10 Crimes
Fig. 5 gives the total count in each of the categories
by different boroughs of NYC. Here we can see, Brooklyn
gets the highest number of complaints and arrests, whereas,
Manhattan leads in court summons and prison admissions.
Staten Island appears to be the safest of all the boroughs
having the lowest counts in each of the categories.
The above analysis does not give an accurate picture of
the proportions as the population of the boroughs have not
been accounted for. Hence, we plotted the same chart taking
into consideration the population of the boroughs. The count
per capita has been calculated by dividing the individual
count by the population of each of the borough. The updated

Fig. 5. Count by Borough
plot can be seen in Fig. 6. We can now notice that Bronx
actually has the highest number of complaints and arrests,
although Manhattan still leads the way in court summons and
incarcerations. Earlier, we had deemed Staten Island to be the
safest borough. However, we can now see Staten Island is
the 2nd
safest and Queens takes its place in being the safest
borough in NYC.
Fig. 6. Count per Capita by Borough
The stacked bar charts in ﬁgure 7 and 8, show an analysis
of the crimes committed by age groups and gender in each of
the categories by boroughs.
It can be inferred from these 2 ﬁgures that people belonging
to age-group ’25-44’ commit the highest number of crimes
in every category and in every borough and more of the
Fig. 7. Count by Age
crimes are committed by the male members of the society
as compared to females.
Fig. 8. Count by Gender
Fig. 9. Heat Map of Arrests
We have also plotted a heatmap, as seen in Fig. 9, to capture
the areas where the arrests were high in numbers. It shows that
the areas along the borders have lesser arrests whereas there
are higher number of arrests concentrated in the city centers.

V. CONCLUSIONS AND FUTURE WORK
We visualized all the datasets collected from multiple links
and tried to find a relationship between criminal complaints
and juristic conviction. The visualizations provided a clear
picture of the proportion of each of the individual entities.
We found that for the number of criminal complaints lodged,
the arrests made proportionate to around two-third of the
complaints. It could be inferred that an average of 2 arrests
were made by New York police for every 3 complaints lodged
and the trend remains similar throughout the year. The most
common types of crimes of the two entities are observed to
be violence and harassment. The pattern of both complaints
and arrests dropped by the end of the year.
However, we find a steep downfall in the numbers when it
comes to court proceedings. The proportions of court summons
fall below 30% as compared to the arrests. It could be inferred
that the cases might be reverted by the victims or the arrests
did not go to court. From the court summon datasets, the
greatest number of crime types differs as compared to the
complaints and arrests.
From the statistical graphs, it was also inferred that around
70% of the summons are convicted by the court. When
it comes to criminal inference, it was observed that male
commit more crime in comparison to female across all the
boroughs. Most arrests and complaints are lodged against
young adults and middle ages people (age group 25-44 years).
Of the 5 boroughs in New York, Bronx and Manhattan are
comparatively unsafe as the per capita crime rates are higher
than that in Queens and Staten Island.
However, all our researches were limited to New York city,
where we analyzed data only for the year 2018. Additional
datasets are required to draw a firm conclusion about the crime
pattern and criminal justice across US. Data related to reasons
explaining why arrests not being made for the complaints
lodged or why arrests were not taken to court or why the victim
was not imprisoned along with details of date and time of entry
could help gain insights on what factors specifically affect the
proportions. Considering all the information available, a more
accurate crime prediction could be performed using machine
learning methods.
REFERENCES
[1] A. A. Alkhaibari and Ping-Tsai Chung. Cluster analysis for reducing
city crime rates. In 2017 IEEE Long Island Systems, Applications and
Technology Conference (LISAT), pages 1–6, May 2017.
[2] S. Bayoumi, S. AlDakhil, E. AlNakhilan, E. A. Taleb, and H. AlShabib.
A review of crime analysis and visualization. case study: Maryland state,
usa. In 2018 21st Saudi Computer Society National Computer Conference
(NCC), pages 1–6, April 2018.
[3] M. Feng, J. Zheng, J. Ren, A. Hussain, X. Li, Y. Xi, and Q. Liu. Big data
analytics and mining for effective visualization and trends forecasting of
crime data. IEEE Access, 7:106111–106123, 2019.
[4] Hossein Hassani, Xu Huang, Emmanuel Silva, and Mansi Ghodsi. A
review of data mining applications in crime. Statistical Analysis and
Data Mining, 9, 04 2016.
[5] Quist-Aphetsi Kester. Visualization and analysis of geographical crime
patterns using formal concept analysis. INTERNATIONAL JOURNAL OF
REMOTE SENSING AND GEOSCIENCE(IJRSG), 2, 07 2013.
[6] C. Ku, J. H. Nguyen, and G. Leroy. Tasc - crime report visualization
for investigative analysis: A case study. In 2012 IEEE 13th International
Conference on Information Reuse Integration (IRI), pages 466–473, Aug
2012.
[7] Darshan Shah and Ryan Leonard. San francisco crime visualization.
International Journal of Computer Applications, 181:13–19, 07 2018.
[8] Samiullah Shah, Vijdan Khalique, Salahuddin Saddar, and Naeem Ma-
hoto. A framework for visual representation of crime information. Indian
Journal of Science and Technology, 10:1–8, 12 2017.
[9] B. Sivanagaleela and S. Rajesh. Crime analysis and prediction using fuzzy
c-means algorithm. In 2019 3rd International Conference on Trends in
Electronics and Informatics (ICOEI), pages 595–599, April 2019.

Database and Analytics Programming - Project report

Recommended

Recommended

More Related Content

Similar to Database and Analytics Programming - Project report

Similar to Database and Analytics Programming - Project report (20)

Recently uploaded

Recently uploaded (20)

Database and Analytics Programming - Project report