BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...
Internshipreport_SourabhGujar
1.
Name: Sourabh Gujar
Position: Database Research Analyst
Company: Portland Cement Association
Name of supervisor: William Jay Hall
Data, something which was very undermined years ago. There used to be millions of logs
generated from machines around the globe. This data was stored for months and then
deleted, soon people started realising that this data could hold something which usually is
not being explored. Fast forward to today, we are all about data. Organisations have realised
the potential that lies within this. Multiple technologies are being developed for mastering the
art of data exploration and mining. I happened to learn one of them namely SQL.
During my undergraduate years, I was focused on data oriented courses. My Senior year
project cemented my inclination towards databases. That is when I decided that I will be
primarily focussing on Databases. For my Master’s program, the course Database Design in
the Fall of 2015 equipped me well for this internship. Further on with Advanced Databases
under by belly from the Winter quarter it had a clear vision of and a strong grasp of how to
manipulate databases with SQL and PLSQL. These two courses from DePaul University
played an instrumental role in securing this internship.
As an intern at Portland Cement Association my primary role as a Data Analyst was to
analyse large amount of data and generate easy to read graphs. It was the fire data from all
over the United States which was recorded yearly. National Fire Incident Reporting System
abbreviated as NFRIS was an application developed by U.S. Fire Administration as a means
of assessing the nature and scope of the fire problems in United States. It had its inception
in early 1976 and has been growing in terms of participation ever since. The gist of NFIRS is
describing the nature of the call, actions taken by firefighter in response to the call and the
end results. The end results were further divided into number of casualties in terms of
civilians and firefighters along with the loss of property estimated in dollars. This data was
filled in manually on paper or using an application. Other agencies gathered and compiled
this information into an annual report which was distributed in DVDs on request. The amount
of data that is generated every year by NFIRS is incredible, it is nothing like I have worked
on earlier:
Year started: 1976
Fire departments participating: 23,000
Number of years of data I hand: 15 years (20002014)
Number of modules each year: 11
Number of incidents per year: More than 2 million
As the aforementioned numbers signifies that the data is staggering in terms of size. The
actual size of each year’s data is around 1.60 GB and there are 15 years of likewise data.
Given the size of the data, it had multiple steps involved in generating the end results. The
following cycle illustrates the steps involved:
2.
Step 1: Obtaining the data from NFIRS.
The data is generated and published early by National Fire Incident Reporting System in the
form of DVDs which contains all the 11 modules for the given year. These DVDs can be
requested by visiting their website on:
https://www.usfa.fema.gov/data/statistics/order_download_data.html#tools
Step 2: Conversion of data.
The data obtained from the compact discs is in the DBF format which stands for Database
File. The development environment which I used is Oracle SQL Developer for firing SQL
queries and it does not support the DBF. This was a challenge I faced because to convert
the DBF files into Excel sheets had its limitation. The Excel sheet could not accommodate
more than 2 million records and was proving to be inconsistent. To resolve this issue I relied
on on a freeware called Delimit which helped in converting the DBF files into CSV (comma
separated values) which could fit all the records in each module.
Step 3: Loading of data.
Once the data is converted into CSV, it was imported into the development environment.
Prior to the importing, tables were created as per the requirements of each module. I tried to
maintain a naming convention for each schema in Oracle SQL Developer so as to avoid
confusion since each of the 15 years has 4 or more modules. For instance the fire incident
module which has the details of each fire incident in the year 2010 has its schema as
fireincident2010.
Step 4: Data Cleansing.
In my opinion, this is one of the major steps involved in making sure that the end results are
accurate. Without removing the inaccuracies in the data, the results would be inconsistent
and may be repercussions. In the NFIRS data, I found out that the fire data was one of the
3. widely maintained ones although needed to be checked for inconsistencies. There
happened to be a very prominent errata in the dataset which had to do with the number of
states in United States of America. There are 50 in USA as per the constitution, albeit when
I filter the NFIRS data by unique states. It shows up with 56 states. When I cross checked
the names of the states, it occurred to me that some of them were named as ‘1’ , ‘2’ or any
random number which clearly was was incorrect. The reason for this cannot be established
since it may have been caused due to data conversion or incomplete data itself.
Step 5: Firing queries.
The development environment Oracle SQL allows me to manipulate the data using
Sequential Query Language. The need was for results which needed firing queries on
multiple schemas which proved to be a challenge at times. This was because of the sheer
size of the data and the number of modules for each of those 15 years. The wait time for
running each query was significant compared to smaller datasets. The queries tend to get
complex because of multiple schemas involved in the same query. It was quite a brush up of
the joins concept of Databases. My supervisor would ask of me in layman’s terms and I
would fetch hi, the relevant results from the National Fire Incident Reporting Systems
database in Oracle SQL Developer. One of them if as follows:
Number of fire incidents which took place in the United States for the past 15 years where
the Automatic Extinguishing System failed.
Query:
SELECT COUNT(DISTINCT INC_NO) AS NUMBER_OF_INCIDENTS, STATE FROM(
SELECT INC_NO, to_date(to_char(INC_DATE,'09099999'),'MM/DD/YYYY') as
IncidentDate, STATE, FDID, AES_OPER
FROM FIREINCIDENT2010
WHERE
STRUC_TYPE='1'
AND BLDG_ABOVE >'0'
AND STRUC_STAT= '2'
AND AES_PRES ='1' OR AES_PRES ='2' Present AND Partial System Present
GROUP BY INC_NO, to_date(to_char(INC_DATE,'09099999'),'MM/DD/YYYY'), state,
FDID,AES_OPER
ORDER BY to_date(to_char(INC_DATE,'09099999'),'MM/DD/YYYY') ASC
)
WHERE
AES_OPER ='2' OR AES Partially worked
AES_OPER ='4' OR—AES failed
AES_OPER ='0' OR
AES_OPER ='U'
group by STATE
ORDER BY STATE;
4. Graph generated by results using Tableau:
From the above graph, it can be observed that for the years 20002014 the number of
incidents where the Automatic Extinguishing System failed is highest in Washington D.C.
Step 6: Generating graphs.
Apart from my major work on SQL Developer, I was keen on working with Tableau as well, a
graph generating application. I also gave Microsoft's Power BI which is a great tool for
generating and publishing graphs for Businesses. For this process the results which were
generated using the queries were schemas having millions of records which I could not have
possibly exported into Excel. Instead I extracted the count of every state for all the 15 years
which was much easier to graph. Below attached are other graphs which I generated using
Tableau:
5.
6.
7. These graphs conveyed a big amount of information which was extracted from an even
bigger dataset of more than 30 GB of data. These graphs were used by my organisation
Portland Cement Association in advocating the use of cement by providing these graphs to
various State Legislations around United States of America.
The courses which I undertook at DePaul University equipped me well for my internship
which primarily was based on Sequential Query Language. I feel that this internship has
proved to be extremely beneficial not only in terms of adding stars to my resume but also
with the sheer amount of experience it has added to my career.
Tools that I used during my internship at Portland Cement Association:
● Oracle SQL Developer
● Tableau
● Microsoft’s Power BI
● Delimit
Apart from work related experience, the office also held a weekly interns meeting. Such
meetings were attended by all the interns including myself and a guest speaker. Multiple
points were discussed encompassing points like career goals, motivational stories and
resume tips. Every week we had a different speaker who would speak about their personal
experience and help incipients like us to grow in a better way in such a dynamic
environment. In my opinion, these sessions were very insightful ones and will help us in the
long run.
It has always been an intangible dream of mine to do my best in the betterment of humanity.
My internship at Portland Cement Association has in a way helped me to move a step
forward by my graphs which may in the some way help reduce the number of fires and
eventually save lives by changing the laws and advocating the use of cement in building
houses. I understand that this in a way is far fetched but a step towards my dream.