Your SlideShare is downloading. ×
Conflation, Data Quality and MADness (David Smith)
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Conflation, Data Quality and MADness (David Smith)

806

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
806
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • DS will provide words for this slide
  • Clean & validate source data geocodeIntegrate – unique facility ID linkages matchingSelect best pick data – locational & non-locationalDISCUSS BUSINESS RULES AND PRECEDENCE note: we don’t change program data
  • This slide is meant for viewing in slideshow mode.When in slideshow mode, click on the titles, data collection, Q/A Enhancement and Datapublishing in order to open and close the relevant box of text
  • In addition to cross-media integration, FRS has a core mission of locational data improvement, and as such, FRS has a framework for working toward iterative improvement of locational data. Toward assessing locational data, it is important to understand what the data represents, to allow the best-quality data and best representation to bubble to the top. Some of the key pieces of this are the LRT, which collects up and houses all of the various locational data about a given facility. To attempt to make sense of the various locations, LRT also tracks record level metadata called “MAD codes”, and uses a “best pick” algorithm for sorting through the LRT data. These will be briefly discussed in the following slides.
  • As mentioned, LRT is the set of database tables that house all locational data associated with a given facility. Shown here is DuPont’s Spruance facility near Richmond, VA, which has 48 LRT records, coming from several programs. The table shows the data within the LRT, several of the columns deal with MAD codes.
  • MAD codes – these are “method accuracy description” codes - they helps to understand if we are dealing with something like a high-precision GPS coordinate collected in the field, or a very vague location (dart thrown at a map)It helps us to understand what the location represents, whether air stack locations, water outfalls, front gate, plant centroid, or other type of feature. These may be very important to program offices, but for getting a general location of where the main facility is, they may in some cases be off by quite a bit. For example, a large industrial campus might cover many square miles.
  • MAD codes are a component of EPA’s Latitude/Longitude Data Standard, with some key pieces being information about the locational accuracy estimate, code sets for method of collection, reference point code (descriptions of what the point represents), and spatial reference system (FRS uses the federal NAD83 standard, versus WGS84 or state plane coordinates)
  • Once the earlier data fields have been checked, they are scored by their similarities to other records in the following manner. Their score determines whether or not it will be a new record.
  • The first version of the EPA Latitude/Longitude Data Standard was developed back in 1998. This current version dates from 2006 but in reality, little has changed (except for the code values used) since 1998.
  • Are we talking about developing a business case or communicating a business case, or both? How about:“The FRS community of interest can facilitate a discussion among EN partners to review, improve and communicate the business case for FRS.”I don’t think the “FRS can work with….” phrase is clear. An FRS data steward can work with partners, but FRS itself isn’t capable of working with anyone. Does that mean that EN partners can use FRS to share data and improve its quality? This slide suggests there is some dialogue about FRS involving and EN partners. Aside from the technical aspects of flowing data, I don’t think that’s happening. We aggregate data on facilities dealing with industrial classifications, points of contact, organizational affiliation and other information to provide a holistic view of the facility across programsFor data enhancements, we are now indexing facilities to census block, HUC12s, congressional district and other geographies We are doing a number of validations to check addresses against NAVTEQ streets data and USPS postal databases, we validate lat/long values as well as doing many other checks.
  • Transcript

    • 1. 6/7/2011
      U.S. Environmental Protection Agency
      1
      Conflation, Data Quality and MADness
      ESRI Developer MeetupJune 7th, 2011
      USEPA Office of Environmental Information
      David G Smith PE PLS202-566-0797
      Smith.DavidG@epa.gov
      Twitter:@DruidSmith
    • 2. Metadata??
      6/7/2011
      U.S. Environmental Protection Agency
      2
    • 3. FRS Overview
      Facility Registry System
      FRS is a data aggregator
      FRS performs integration, validation and QA across over 30 federal databases and over 50 state, territory and tribal databases
      FRS contains information on nearly 2.8 million facilities
      > 80% of facilities have lat/long information
    • 4. FRS improves program facility data validity from 40—95% by selecting best contact and location information from multiple data sources
      Allows EPA, public, academic, and investment communities to evaluate compliance with environmental regulations
      Provides robust, complete view of facility information, facilitating cross-media analyses:
      Community-based initiatives
      Environmental justice analyses
      NEPA assessments
      Emergency response
      Other mission needs (TMDL program, climate change analysis, etc.)
      6/7/2011
      U.S. Environmental Protection Agency
      4
      What FRS Does
    • 5. FRS Features
      Provides a more complete, holistic, cross-media view of key facility information
      through verification and
      data management procedures
      Incorporates layers of quality control – the FRS record is checked for completeness, consistency, and validity and is owned by FRS
      Integrates information from program national systems, state master facility records, tribal partners, and other federal agencies
      Supported by a network of data stewards covering
      both geographic and
      programmatic areas of expertise.
      Fully integrated with the Locational Data and the Integrated Error Correction Process (IECP)
      5
    • 6. FRS Features
      Provides essential support for applications that rely on integrated views of facilities
      GIS applications (EnviroMapper, MyEnvironment)
      Public access applications (Envirofacts, Cleanups in My Community (CIMC)
      Enforcement systems and applications (IDEA, OTIS, ECHO, ICIS)
      Offers specialized services to applications in need of accurate facility information
      Emergency Response
      TRI-ME web
      DMR Loadings Tool
      Provides web services, enabling data exchanges with state partners on the Environmental Exchange Network
      6
    • 7. FRS Scope
      Major Programs Represented in FRS
      Air
      AFS AQS
      CAMDBS EGRID
      NEI RBLC
      RFS (Ethanol)
      Water
      PCS ICIS-NPDES
      SDWIS CWNS
      Chemical Releases
      TRIS RMP
      TSCA SSTS
      FRP BRAC
      Hazardous Waste
      ACRES CERCLIS
      RCRAINFO RADINFO
      Enforcement/Compliance
      ICIS ECRM
      NCDB
      Schools
      NCES GNIS
      BIA INDIAN SCHOOL
      Other
      LANDFILL
      http://www.epa.gov/enviro/html/frs_demo/new_crosswalks.html
    • 8. FRS Data Model
      High Level Data Model
      Organization
      Industrial
      Classification
      Affiliation
      Individual
      Individual
      Supplemental
      Interest
      Mailing Address
      Alternative
      Name
      Facility/Site
      Geospatial
      Environmental Interest
    • 9. FRS Data Pipeline
    • 10. QA Process
    • 11. Integration?
      Air Permit
      Coordinate
      Water Permit
      Coordinate
      Toxics Permit
      Coordinate
      Best Facility Coordinate?
    • 12. FRS Processing
      Q/A Enhancement
      Data Collection
      Data Publishing
      • At the publishing stage, extracts of the FRS geospatial database are provided as geospatial downloads
      • 13. The FRS geospatial database provides web services, database connections and spatial queries for a wide variety of web mapping applications, for example MyEnvironment, Cleanup In My CommunityIDEA/ECHO/OTIS and many others
      • 14. For Title 40 regulated programs, CDX collects locational and parametric data for the program offices, and facility data goes to FRS.
      • 15. Several program offices have their own systems that collect and manage locational and parametric data – Envirofacts pulls data from these, and FRS serves as the locational component for Envirofacts
      • 16. FRS contains many data enhancement, lookup and validation services that aid and assist other CDX flows.
      • 17. FRS receives locational data updates and edits from regional data stewards as needed.
      • 18. Envirofacts pulls data from the program offices, taking in parametric data and sending locational data to FRS. FRS serves as the geospatial component of FRS
    • Locational Data Accuracy and Best Pick
      FRS utilizes the EPA Lat/Long Data Standard
      Locational Reference Tables (LRT)
      Method Accuracy Description (MAD)
      Best Pick
    • 19. Locational Reference Table
      All underlying information from programs is retained, to include locational data
      For any given facility, there may be multiple individual locations that have been gathered, e.g. an associated air stack location, water outfall location, front gate location, et cetera
      MAD Codes help us to assess how to handle locational data quality as well as understanding what it represents
      http://www.epa.gov/enviro/html/locational/lrt_viewer.html
    • 20. MAD Codes
      MAD Codes help us to assess how to handle locational data quality
      As well as understanding what it represents
    • 21. MAD Codes
      http://www.exchangenetwork.net/standards/Lat_Long_Standard_08_11_2006_Final.pdf
    • 22. Match & IntegrateFacility Data
      Scoring method: to determine if two records are the same facility
      25 points, parsed street number
      50 points, matching standardized city name, standardized county name, state and zip
      Score 100: an environmental interest is created for the source, and associated to the matched FRS record
      Score 50—100: FRS creates a new record and a new associated environmental interest, the new record is identified as having possible matches
      Score <30: FRS creates a new record with a new interest
    • 23. Select the “Best Pick” Information
      • FRS maintains a database table of manual verifications in the LRT.
      EPA/Regional verifications trump State verifications.
      Manually verified locations trump all the rest regardless of calculated accuracy or qa checks.
      In automated processing, Superfund NPL Site locations trump everything
      Our “normal” process is based on supplied or implied accuracy and QA checks performed (MAD codes).
      EPA Latitude/Longitude Data Standard (http://www.exchangenetwork.net/standards/Lat_Long_Standard_08_11_2006_Final.pdf)
    • 24. Business Case
      Users benefit from high quality integrated locational data for facilities toward enforcement, compliance, analysis, assessment and community impact
      Being able to assess and manage large amounts of data of varying quality, e.g. VGI
    • 25. Thank You - URLs

    ×