Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Analyze open data of Chicago city data portal
1. COMP 7/8150
Data Science I
Sorting out Leaky Records in
Payments Log
Kishor Datta Gupta
Computer Science
2. COMP 7/8150
Data Science I
Goal/Scope
Develop a classifier for leaky and non-leaky data
Chicago city Council daily release their operational logs.
I will try to identify leaky and non-leaky data in their daily purchase
logs.
3. COMP 7/8150
Data Science I
Lit Review
“The Government should provide opportunities for citizens to
participate in decision-making processes by harnessing collective
knowledge of the society”
“A primary goal of open government, transparency means
disclosure of information about official decisions and activity in
forms that citizens can easily read and use ”
In this respect, the Chicago open data web portal released more than
800 data set in various machine readable formats such as tables, plain
texts or maps about various activities of the city authorities
4. COMP 7/8150
Data Science I
Leaky Records
Example Daily Payments Record
“Day 06-09-2017 For Invoice PVCI17CI018100 paid 2884$ from DEPT OF
GENERAL SERVICES reference contract no 26775”
Definition of leaky is a records containing useful information to purport an attack on
Chicago city infrastructures or violate HIPA, CIPA or other privacy laws. Such as
A record can reveal information about police and emergency response team. As
example: Chicago city police weapon inventory.
It contains details of restricted place as example airport runway electronic signal
system.
It contains cyber security information for city day to day work specially in medical
area. As example data storage facility information.
8. COMP 7/8150
Data Science I
Raw Data Sample
Daily Purchase Log Example
Invoice Amount Date Department
Contract
Number Vendor Name
PVCI17CI018100 2884 06-09-2017
DEPT OF GENERAL
SERVICES 26775 SOUTHWEST INDUSTRIES
PVCI17CI028443 659.56 06-09-2017
DEPT OF GENERAL
SERVICES 30559 CUMMINS N POWER, LLC
PVCI17CI087902 59.58 06-09-2017
DEPT OF GENERAL
SERVICES 33233 OFFICE DEPOT, INC.
PVCI17CI087954 58450 06-09-2017
DEPARTMENT OF
POLICE 25150
ALLIED SERVICES GROUP,
INC.
Data Point 1.24M Updating Every Day
9. COMP 7/8150
Data Science I
Raw Data Sample
Contract Information Example
Description Spec Rv Vendor ID Type Total Ammount Prc Type
OMP - South Airfield Runway 10R-28L -
Site Preparation 26117 119 99339
CONSTRUCTION-
AVIATION 179643.4 BID
Airfield Lighting Control Vault
Improvements - MDW, Spec# 115950,
Req# 79785 28241 17 115950
CONSTRUCTION-
AVIATION 24940 PRC
Phase 16 Residential Sound Insulation
Program ORD -Bid Pkg #2 (200 Homes),
Spec# 117222, Req# 81468 29398 2 117222
CONSTRUCTION-
AVIATION -69837.1 BID
Data Point 131K Updating Every Day
10. COMP 7/8150
Data Science I
Raw Data Sample
Vendor Information Example
Rv Vendor ID Type Vendor Name Address1 Address2 State Zip
119 99339 CONSTRUCTION-AVIATION TURNER-CONCRETE
STRUCTURES-LINDAHL TRI
VENTURE
55 E MONROE ST CHICAGO IL 60603
17 115950 CONSTRUCTION-AVIATION DIVANE BROTHERS. ELECTRIC
CO.
2424 N 25TH AVE FRANKLI
N PARK
IL 60131
2 117222 CONSTRUCTION-AVIATION ASBACH & VANSELOW INC 1000 BROWN
STREET EFT
WAUCO
NDA
IL 60084
Data Point 4,989 Updating Every Month
(expected)
11. COMP 7/8150
Data Science I
Data corpus
Population: All Data available in Chicago city data portal until 30th
November 2018
Purchase Log data point: 1.24M
Cleansed data point : 119277
I1 I2 I3 I4 Amm Contract Dep Sprc PRC ZIP Speccode SpecNum Result
22 174 101 17 2237.08 33697 16 6 0 60660 65 14 1
127 183 0 0 181184.8 8363 18 7 0 50266 422 34 1
11 181 86 495 59.96 33233 0 0 0 60000 300 10 1
11 181 17 156 3465.25 24932 3 5 1 60101 364 73 1
138 174 0 0 1566.21 19550 12 7 0 60062 263 72 1
188 179 0 0 13720.86 28002 11 2 2 20151 7 90 0
11 181 19 315 9523.5 33233 0 0 0 60000 300 10 1
12. COMP 7/8150
Data Science I
..Data Corpus
Curated Data: 30000 data point ( Based on contract document
published + manual curating)
Un observed Data :89277
Training Data: 10000 data point (7603 leaky and 2397 non leaky)
Testing Data: 20000 data point (15336 leaky and 4664 non leaky)
13. COMP 7/8150
Data Science I
Ontology
Invoice number: Each daily log has an invoice number as payment
description reference.
Examples:CV50165009685 PV85168550294
Specification Code: Every purchase order has a specification code
based on purchase type and description.
Specification Type: Every purchase order has a specification Type
based on purchase type and description. There are 49 unique
specification type.
Department: Purchase done under different departments there are
56 department
Procurement: The way vendor get the work order , there are 14
different type such as BID, Sole source, Joint, etc.
Vendor Specification : Vendor type code.
14. COMP 7/8150
Data Science I
Case Study
Application Developed using Public records from the Chicago City
Council
Chicago City Crime is an Android application that implements a useful and simplistic tool for
users to instantly get crime data, based on their current position in Chicago. By becoming more
conscious of how and what kinds of crimes have perpetrated around their area, this allows them
to secure informed judgment and act that will help them and their neighborhood.
Application Developed to analyze financial data
Chicago TIF Viewer is a unique map viewer allowing free access to data and services with
three features: Tax Increment Financing District Information, Ward Contact Information, and US
Census 2010 Unemployment Rates.
[1] Kassen, M. (2013). A promising phenomenon of open data: A case study of the Chicago
open data project. Government Information Quarterly, 30(4), 508-513.
15. COMP 7/8150
Data Science I
Correlation
-ggpairs(data=dataa, columns=c("Result","Sprc","PRC","Dep","SPECcode","SpecNum"), title="payment data")
16. COMP 7/8150
Data Science I
.. Correlation
-ggpairs(data=dataa, columns=c("Result","Sprc","PRC","Dep","SPECcode","SpecNum"), title="payment data")
26. COMP 7/8150
Data Science I
Deliverables
Data Classifier which will classify each purchase record in
purchase logs from Chicago city council as
Leaky
Non-Leaky
Evaluation results of performance
I Calibrate the model against different classifier and accepted accuracy
threshold is 80% and F1 score is >0.85