SlideShare a Scribd company logo
1 of 7
Download to read offline
 
Name: Sourabh Gujar 
Position: Database Research Analyst 
Company: Portland Cement Association  
Name of supervisor: William Jay Hall 
 
Data, something which was very undermined years ago. There used to be millions of logs                             
generated from machines around the globe. This data was stored for months and then                           
deleted, soon people started realising that this data could hold something which usually is                           
not being explored. Fast forward to today, we are all about data. Organisations have realised                             
the potential that lies within this. Multiple technologies are being developed for mastering the                           
art of data exploration and mining. I happened to learn one of them namely SQL.  
During my undergraduate years, I was focused on data oriented courses. My Senior year                             
project cemented my inclination towards databases. That is when I decided that I will be                             
primarily focussing on Databases. For my Master’s program, the course Database Design in                         
the Fall of 2015 equipped me well for this internship. Further on with Advanced Databases                             
under by belly from the Winter quarter it had a clear vision of and a strong grasp of how to                                       
manipulate databases with SQL and PL­SQL. These two courses from DePaul University                       
played an instrumental role in securing this internship. 
As an intern at Portland Cement Association my primary role as a Data Analyst was to                               
analyse large amount of data and generate easy to read graphs. It was the fire data from all                                   
over the United States which was recorded yearly. National Fire Incident Reporting System                         
abbreviated as NFRIS was an application developed by U.S. Fire Administration as a means                           
of assessing the nature and scope of the fire problems in United States. It had its inception                                 
in early 1976 and has been growing in terms of participation ever since. The gist of NFIRS is                                   
describing the nature of the call, actions taken by firefighter in response to the call and the                                 
end results. The end results were further divided into number of casualties in terms of                             
civilians and firefighters along with the loss of property estimated in dollars. This data was                             
filled in manually on paper or using an application. Other agencies gathered and compiled                           
this information into an annual report which was distributed in DVDs on request. The amount                             
of data that is generated every year by NFIRS is incredible, it is nothing like I have worked                                   
on earlier: 
 
Year started: 1976 
Fire departments participating: 23,000 
Number of years of data I hand: 15 years (2000­2014) 
Number of modules each year: 11 
Number of incidents per year: More than 2 million  
 
As the aforementioned numbers signifies that the data is staggering in terms of size. The                             
actual size of each year’s data is around 1.60 GB and there are 15 years of likewise data.  
Given the size of the data, it had multiple steps involved in generating the end results. The                                 
following cycle illustrates the steps involved: 
 
Step 1: Obtaining the data from NFIRS. 
The data is generated and published early by National Fire Incident Reporting System in the                             
form of DVDs which contains all the 11 modules for the given year. These DVDs can be                                 
requested by visiting their website on:  
https://www.usfa.fema.gov/data/statistics/order_download_data.html#tools 
 
Step 2: Conversion of data. 
The data obtained from the compact discs is in the DBF format which stands for Database                               
File. The development environment which I used is Oracle SQL Developer for firing SQL                           
queries and it does not support the DBF. This was a challenge I faced because to convert                                 
the DBF files into Excel sheets had its limitation. The Excel sheet could not accommodate                             
more than 2 million records and was proving to be inconsistent. To resolve this issue I relied                                 
on on a freeware called Delimit which helped in converting the DBF files into CSV (comma                               
separated values) which could fit all the records in each module.  
 
Step 3: Loading of data. 
Once the data is converted into CSV, it was imported into the development environment.                           
Prior to the importing, tables were created as per the requirements of each module. I tried to                                 
maintain a naming convention for each schema in Oracle SQL Developer so as to avoid                             
confusion since each of the 15 years has 4 or more modules. For instance the fire incident                                 
module which has the details of each fire incident in the year 2010 has its schema as                                 
fireincident2010. 
 
 
Step 4: Data Cleansing. 
In my opinion, this is one of the major steps involved in making sure that the end results are                                     
accurate. Without removing the inaccuracies in the data, the results would be inconsistent                         
and may be repercussions. In the NFIRS data, I found out that the fire data was one of the                                     
widely maintained ones although needed to be checked for inconsistencies. There                     
happened to be a very prominent errata in the dataset which had to do with the number of                                   
states in United States of America. There are 50 in USA as per the constitution, albeit when                                 
I filter the NFIRS data by unique states. It shows up with 56 states. When I cross checked                                   
the names of the states, it occurred to me that some of them were named as ‘1’ , ‘2’ or any                                         
random number which clearly was was incorrect. The reason for this cannot be established                           
since it may have been caused due to data conversion or incomplete data itself.  
 
Step 5: Firing queries. 
The development environment Oracle SQL allows me to manipulate the data using                       
Sequential Query Language. The need was for results which needed firing queries on                         
multiple schemas which proved to be a challenge at times. This was because of the sheer                               
size of the data and the number of modules for each of those 15 years. The wait time for                                     
running each query was significant compared to smaller datasets. The queries tend to get                           
complex because of multiple schemas involved in the same query. It was quite a brush up of                                 
the joins concept of Databases. My supervisor would ask of me in layman’s terms and I                               
would fetch hi, the relevant results from the National Fire Incident Reporting Systems                         
database in Oracle SQL Developer. One of them if as follows: 
Number of fire incidents which took place in the United States for the past 15 years where                                 
the Automatic Extinguishing System failed. 
 
Query: 
SELECT COUNT(DISTINCT INC_NO) AS NUMBER_OF_INCIDENTS, STATE FROM( 
SELECT INC_NO, to_date(to_char(INC_DATE,'09099999'),'MM/DD/YYYY') as       
IncidentDate, STATE, FDID, AES_OPER 
FROM FIREINCIDENT2010 
WHERE 
STRUC_TYPE='1' 
AND BLDG_ABOVE >'0' 
AND STRUC_STAT= '2' 
AND AES_PRES ='1' OR  AES_PRES ='2'   ­­Present  AND Partial System Present 
GROUP BY INC_NO, to_date(to_char(INC_DATE,'09099999'),'MM/DD/YYYY'), state,         
FDID,AES_OPER 
ORDER BY to_date(to_char(INC_DATE,'09099999'),'MM/DD/YYYY') ASC 
) 
WHERE 
AES_OPER ='2' OR­ ­AES Partially worked 
AES_OPER ='4' OR—AES failed 
AES_OPER ='0' OR 
AES_OPER ='U' 
group by STATE 
ORDER BY STATE; 
 
 
Graph generated by results using Tableau​: 
 
 
 
From the above graph, it can be observed that for the years 2000­2014 the number of                               
incidents where the Automatic Extinguishing System failed is highest in Washington D.C.  
 
Step 6: Generating graphs. 
Apart from my major work on SQL Developer, I was keen on working with Tableau as well, a                                   
graph generating application. I also gave Microsoft's Power BI which is a great tool for                             
generating and publishing graphs for Businesses. For this process the results which were                         
generated using the queries were schemas having millions of records which I could not have                             
possibly exported into Excel. Instead I extracted the count of every state for all the 15 years                                 
which was much easier to graph. Below attached are other graphs which I generated using                             
Tableau: 
 
 
  
 
These graphs conveyed a big amount of information which was extracted from an even                           
bigger dataset of more than 30 GB of data. These graphs were used by my organisation                               
Portland Cement Association in advocating the use of cement by providing these graphs to                           
various State Legislations around United States of America.   
The courses which I undertook at DePaul University equipped me well for my internship                           
which primarily was based on Sequential Query Language. I feel that this internship has                           
proved to be extremely beneficial not only in terms of adding stars to my resume but also                                 
with the sheer amount of experience it has added to my career.   
Tools that I used during my internship at Portland Cement Association: 
● Oracle SQL Developer 
● Tableau 
● Microsoft’s Power BI 
● Delimit 
 
Apart from work related experience, the office also held a weekly interns meeting. Such                           
meetings were attended by all the interns including myself and a guest speaker. Multiple                           
points were discussed encompassing points like career goals, motivational stories and                     
resume tips. Every week we had a different speaker who would speak about their personal                             
experience and help incipients like us to grow in a better way in such a dynamic                               
environment. In my opinion, these sessions were very insightful ones and will help us in the                               
long run. 
It has always been an intangible dream of mine to do my best in the betterment of humanity.                                   
My internship at Portland Cement Association has in a way helped me to move a step                               
forward by my graphs which may in the some way help reduce the number of fires and                                 
eventually save lives by changing the laws and advocating the use of cement in building                             
houses. I understand that this in a way is far fetched but a step towards my dream. 
 
 

More Related Content

Similar to Internshipreport_SourabhGujar

CIS225 Networking II Unit 5 Lab 1NSLookup Tool Purpose .docx
CIS225 Networking II  Unit 5 Lab 1NSLookup Tool       Purpose  .docxCIS225 Networking II  Unit 5 Lab 1NSLookup Tool       Purpose  .docx
CIS225 Networking II Unit 5 Lab 1NSLookup Tool Purpose .docxsleeperharwell
 
markfinleyResumeMarch2016
markfinleyResumeMarch2016markfinleyResumeMarch2016
markfinleyResumeMarch2016Mark Finley
 
Data warehousing and business intelligence project report
Data warehousing and business intelligence project reportData warehousing and business intelligence project report
Data warehousing and business intelligence project reportsonalighai
 
BDW16 London - Scott Krueger, skyscanner - Does More Data Mean Better Decisio...
BDW16 London - Scott Krueger, skyscanner - Does More Data Mean Better Decisio...BDW16 London - Scott Krueger, skyscanner - Does More Data Mean Better Decisio...
BDW16 London - Scott Krueger, skyscanner - Does More Data Mean Better Decisio...Big Data Week
 
Department of Commerce App Challenge: Big Data Dashboards
Department of Commerce App Challenge: Big Data DashboardsDepartment of Commerce App Challenge: Big Data Dashboards
Department of Commerce App Challenge: Big Data DashboardsBrand Niemann
 
Discussion 1 The incorrect implementation of databases ou
Discussion 1 The incorrect implementation of databases ouDiscussion 1 The incorrect implementation of databases ou
Discussion 1 The incorrect implementation of databases ouhuttenangela
 
The Emerging Role of Data Scientists on Software Developmen.docx
The Emerging Role of Data Scientists  on Software Developmen.docxThe Emerging Role of Data Scientists  on Software Developmen.docx
The Emerging Role of Data Scientists on Software Developmen.docxarnoldmeredith47041
 
The Emerging Role of Data Scientists on Software Developmen.docx
The Emerging Role of Data Scientists  on Software Developmen.docxThe Emerging Role of Data Scientists  on Software Developmen.docx
The Emerging Role of Data Scientists on Software Developmen.docxtodd701
 
Data science.chapter-1,2,3
Data science.chapter-1,2,3Data science.chapter-1,2,3
Data science.chapter-1,2,3varshakumar21
 
Discussion post· The proper implementation of a database is es.docx
Discussion post· The proper implementation of a database is es.docxDiscussion post· The proper implementation of a database is es.docx
Discussion post· The proper implementation of a database is es.docxmadlynplamondon
 
The Evolving Role of the Data Engineer - Whitepaper | Qubole
The Evolving Role of the Data Engineer - Whitepaper | QuboleThe Evolving Role of the Data Engineer - Whitepaper | Qubole
The Evolving Role of the Data Engineer - Whitepaper | QuboleVasu S
 
BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...
BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...
BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...Thomas Rones
 

Similar to Internshipreport_SourabhGujar (20)

Data science unit1
Data science unit1Data science unit1
Data science unit1
 
Sq lite module1
Sq lite module1Sq lite module1
Sq lite module1
 
CIS225 Networking II Unit 5 Lab 1NSLookup Tool Purpose .docx
CIS225 Networking II  Unit 5 Lab 1NSLookup Tool       Purpose  .docxCIS225 Networking II  Unit 5 Lab 1NSLookup Tool       Purpose  .docx
CIS225 Networking II Unit 5 Lab 1NSLookup Tool Purpose .docx
 
markfinleyResumeMarch2016
markfinleyResumeMarch2016markfinleyResumeMarch2016
markfinleyResumeMarch2016
 
Data warehousing and business intelligence project report
Data warehousing and business intelligence project reportData warehousing and business intelligence project report
Data warehousing and business intelligence project report
 
BDW16 London - Scott Krueger, skyscanner - Does More Data Mean Better Decisio...
BDW16 London - Scott Krueger, skyscanner - Does More Data Mean Better Decisio...BDW16 London - Scott Krueger, skyscanner - Does More Data Mean Better Decisio...
BDW16 London - Scott Krueger, skyscanner - Does More Data Mean Better Decisio...
 
Department of Commerce App Challenge: Big Data Dashboards
Department of Commerce App Challenge: Big Data DashboardsDepartment of Commerce App Challenge: Big Data Dashboards
Department of Commerce App Challenge: Big Data Dashboards
 
Lecture1-IS322(Data&InfoMang-introduction)
Lecture1-IS322(Data&InfoMang-introduction)Lecture1-IS322(Data&InfoMang-introduction)
Lecture1-IS322(Data&InfoMang-introduction)
 
Lecture1 is322 data&infomanag(introduction)(old curr)
Lecture1 is322 data&infomanag(introduction)(old curr)Lecture1 is322 data&infomanag(introduction)(old curr)
Lecture1 is322 data&infomanag(introduction)(old curr)
 
Lecture1 is322 data&infomanag(introduction)(old curr)
Lecture1 is322 data&infomanag(introduction)(old curr)Lecture1 is322 data&infomanag(introduction)(old curr)
Lecture1 is322 data&infomanag(introduction)(old curr)
 
Discussion 1 The incorrect implementation of databases ou
Discussion 1 The incorrect implementation of databases ouDiscussion 1 The incorrect implementation of databases ou
Discussion 1 The incorrect implementation of databases ou
 
The Emerging Role of Data Scientists on Software Developmen.docx
The Emerging Role of Data Scientists  on Software Developmen.docxThe Emerging Role of Data Scientists  on Software Developmen.docx
The Emerging Role of Data Scientists on Software Developmen.docx
 
The Emerging Role of Data Scientists on Software Developmen.docx
The Emerging Role of Data Scientists  on Software Developmen.docxThe Emerging Role of Data Scientists  on Software Developmen.docx
The Emerging Role of Data Scientists on Software Developmen.docx
 
Database Essay
Database EssayDatabase Essay
Database Essay
 
INT 1010 07-1.pdf
INT 1010 07-1.pdfINT 1010 07-1.pdf
INT 1010 07-1.pdf
 
basis data 02.pptx
basis data 02.pptxbasis data 02.pptx
basis data 02.pptx
 
Data science.chapter-1,2,3
Data science.chapter-1,2,3Data science.chapter-1,2,3
Data science.chapter-1,2,3
 
Discussion post· The proper implementation of a database is es.docx
Discussion post· The proper implementation of a database is es.docxDiscussion post· The proper implementation of a database is es.docx
Discussion post· The proper implementation of a database is es.docx
 
The Evolving Role of the Data Engineer - Whitepaper | Qubole
The Evolving Role of the Data Engineer - Whitepaper | QuboleThe Evolving Role of the Data Engineer - Whitepaper | Qubole
The Evolving Role of the Data Engineer - Whitepaper | Qubole
 
BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...
BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...
BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...
 

Internshipreport_SourabhGujar

  • 1.   Name: Sourabh Gujar  Position: Database Research Analyst  Company: Portland Cement Association   Name of supervisor: William Jay Hall    Data, something which was very undermined years ago. There used to be millions of logs                              generated from machines around the globe. This data was stored for months and then                            deleted, soon people started realising that this data could hold something which usually is                            not being explored. Fast forward to today, we are all about data. Organisations have realised                              the potential that lies within this. Multiple technologies are being developed for mastering the                            art of data exploration and mining. I happened to learn one of them namely SQL.   During my undergraduate years, I was focused on data oriented courses. My Senior year                              project cemented my inclination towards databases. That is when I decided that I will be                              primarily focussing on Databases. For my Master’s program, the course Database Design in                          the Fall of 2015 equipped me well for this internship. Further on with Advanced Databases                              under by belly from the Winter quarter it had a clear vision of and a strong grasp of how to                                        manipulate databases with SQL and PL­SQL. These two courses from DePaul University                        played an instrumental role in securing this internship.  As an intern at Portland Cement Association my primary role as a Data Analyst was to                                analyse large amount of data and generate easy to read graphs. It was the fire data from all                                    over the United States which was recorded yearly. National Fire Incident Reporting System                          abbreviated as NFRIS was an application developed by U.S. Fire Administration as a means                            of assessing the nature and scope of the fire problems in United States. It had its inception                                  in early 1976 and has been growing in terms of participation ever since. The gist of NFIRS is                                    describing the nature of the call, actions taken by firefighter in response to the call and the                                  end results. The end results were further divided into number of casualties in terms of                              civilians and firefighters along with the loss of property estimated in dollars. This data was                              filled in manually on paper or using an application. Other agencies gathered and compiled                            this information into an annual report which was distributed in DVDs on request. The amount                              of data that is generated every year by NFIRS is incredible, it is nothing like I have worked                                    on earlier:    Year started: 1976  Fire departments participating: 23,000  Number of years of data I hand: 15 years (2000­2014)  Number of modules each year: 11  Number of incidents per year: More than 2 million     As the aforementioned numbers signifies that the data is staggering in terms of size. The                              actual size of each year’s data is around 1.60 GB and there are 15 years of likewise data.   Given the size of the data, it had multiple steps involved in generating the end results. The                                  following cycle illustrates the steps involved: 
  • 2.   Step 1: Obtaining the data from NFIRS.  The data is generated and published early by National Fire Incident Reporting System in the                              form of DVDs which contains all the 11 modules for the given year. These DVDs can be                                  requested by visiting their website on:   https://www.usfa.fema.gov/data/statistics/order_download_data.html#tools    Step 2: Conversion of data.  The data obtained from the compact discs is in the DBF format which stands for Database                                File. The development environment which I used is Oracle SQL Developer for firing SQL                            queries and it does not support the DBF. This was a challenge I faced because to convert                                  the DBF files into Excel sheets had its limitation. The Excel sheet could not accommodate                              more than 2 million records and was proving to be inconsistent. To resolve this issue I relied                                  on on a freeware called Delimit which helped in converting the DBF files into CSV (comma                                separated values) which could fit all the records in each module.     Step 3: Loading of data.  Once the data is converted into CSV, it was imported into the development environment.                            Prior to the importing, tables were created as per the requirements of each module. I tried to                                  maintain a naming convention for each schema in Oracle SQL Developer so as to avoid                              confusion since each of the 15 years has 4 or more modules. For instance the fire incident                                  module which has the details of each fire incident in the year 2010 has its schema as                                  fireincident2010.      Step 4: Data Cleansing.  In my opinion, this is one of the major steps involved in making sure that the end results are                                      accurate. Without removing the inaccuracies in the data, the results would be inconsistent                          and may be repercussions. In the NFIRS data, I found out that the fire data was one of the                                     
  • 3. widely maintained ones although needed to be checked for inconsistencies. There                      happened to be a very prominent errata in the dataset which had to do with the number of                                    states in United States of America. There are 50 in USA as per the constitution, albeit when                                  I filter the NFIRS data by unique states. It shows up with 56 states. When I cross checked                                    the names of the states, it occurred to me that some of them were named as ‘1’ , ‘2’ or any                                          random number which clearly was was incorrect. The reason for this cannot be established                            since it may have been caused due to data conversion or incomplete data itself.     Step 5: Firing queries.  The development environment Oracle SQL allows me to manipulate the data using                        Sequential Query Language. The need was for results which needed firing queries on                          multiple schemas which proved to be a challenge at times. This was because of the sheer                                size of the data and the number of modules for each of those 15 years. The wait time for                                      running each query was significant compared to smaller datasets. The queries tend to get                            complex because of multiple schemas involved in the same query. It was quite a brush up of                                  the joins concept of Databases. My supervisor would ask of me in layman’s terms and I                                would fetch hi, the relevant results from the National Fire Incident Reporting Systems                          database in Oracle SQL Developer. One of them if as follows:  Number of fire incidents which took place in the United States for the past 15 years where                                  the Automatic Extinguishing System failed.    Query:  SELECT COUNT(DISTINCT INC_NO) AS NUMBER_OF_INCIDENTS, STATE FROM(  SELECT INC_NO, to_date(to_char(INC_DATE,'09099999'),'MM/DD/YYYY') as        IncidentDate, STATE, FDID, AES_OPER  FROM FIREINCIDENT2010  WHERE  STRUC_TYPE='1'  AND BLDG_ABOVE >'0'  AND STRUC_STAT= '2'  AND AES_PRES ='1' OR  AES_PRES ='2'   ­­Present  AND Partial System Present  GROUP BY INC_NO, to_date(to_char(INC_DATE,'09099999'),'MM/DD/YYYY'), state,          FDID,AES_OPER  ORDER BY to_date(to_char(INC_DATE,'09099999'),'MM/DD/YYYY') ASC  )  WHERE  AES_OPER ='2' OR­ ­AES Partially worked  AES_OPER ='4' OR—AES failed  AES_OPER ='0' OR  AES_OPER ='U'  group by STATE  ORDER BY STATE;     
  • 4. Graph generated by results using Tableau​:        From the above graph, it can be observed that for the years 2000­2014 the number of                                incidents where the Automatic Extinguishing System failed is highest in Washington D.C.     Step 6: Generating graphs.  Apart from my major work on SQL Developer, I was keen on working with Tableau as well, a                                    graph generating application. I also gave Microsoft's Power BI which is a great tool for                              generating and publishing graphs for Businesses. For this process the results which were                          generated using the queries were schemas having millions of records which I could not have                              possibly exported into Excel. Instead I extracted the count of every state for all the 15 years                                  which was much easier to graph. Below attached are other graphs which I generated using                              Tableau: 
  • 6.  
  • 7. These graphs conveyed a big amount of information which was extracted from an even                            bigger dataset of more than 30 GB of data. These graphs were used by my organisation                                Portland Cement Association in advocating the use of cement by providing these graphs to                            various State Legislations around United States of America.    The courses which I undertook at DePaul University equipped me well for my internship                            which primarily was based on Sequential Query Language. I feel that this internship has                            proved to be extremely beneficial not only in terms of adding stars to my resume but also                                  with the sheer amount of experience it has added to my career.    Tools that I used during my internship at Portland Cement Association:  ● Oracle SQL Developer  ● Tableau  ● Microsoft’s Power BI  ● Delimit    Apart from work related experience, the office also held a weekly interns meeting. Such                            meetings were attended by all the interns including myself and a guest speaker. Multiple                            points were discussed encompassing points like career goals, motivational stories and                      resume tips. Every week we had a different speaker who would speak about their personal                              experience and help incipients like us to grow in a better way in such a dynamic                                environment. In my opinion, these sessions were very insightful ones and will help us in the                                long run.  It has always been an intangible dream of mine to do my best in the betterment of humanity.                                    My internship at Portland Cement Association has in a way helped me to move a step                                forward by my graphs which may in the some way help reduce the number of fires and                                  eventually save lives by changing the laws and advocating the use of cement in building                              houses. I understand that this in a way is far fetched but a step towards my dream.