Your SlideShare is downloading. ×
  • Like
  • Save
Data Normalization and Alignment in Heterogeneous Data Sets
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Data Normalization and Alignment in Heterogeneous Data Sets

  • 1,178 views
Published

 

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
1,178
On SlideShare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
0
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Data Normalization and Alignment Tales from the Data CryptWWW.DATA–TACTICS.COM © 2012 Data Tactics ARCHITECT – ENGINEER – INTEGRATE – SOLUTIONS
  • 2. Data – The Hard PartWWW.DATA–TACTICS.COM © 2012 Data Tactics ARCHITECT – ENGINEER – INTEGRATE – SOLUTIONS
  • 3. Data Exploitation Information (Data) = Access + Understanding + NormalizationRaw Data (ISOs,partitions, encase) Recovery Usable Data (Structured Tablular, Functional Databases) Interpretation Exploitable Information Analysis KnowledgeWWW.DATA–TACTICS.COM © 2012 Data Tactics ARCHITECT – ENGINEER – INTEGRATE – SOLUTIONS
  • 4. What Day is it? • Problem: – 12/1/2002 – Dec 1st or Jan 12th? – 10K+ spreadsheets – Date column with wide mix of formats • Approach: – Define some rules for best guess – Apache POI to access excel data – Use Java Date routines to attempt to parse data – Use statistical analysis to determine most used formats – Look for non-sensical dates (e.g. months > 12 or years out of range.) – Last ditch heuristic: Date column appeared to be basically in date order – look to nearby rows to determine likely value • Result: 95%+ population of the date field.WWW.DATA–TACTICS.COM © 2012 Data Tactics ARCHITECT – ENGINEER – INTEGRATE – SOLUTIONS
  • 5. Where is it? • Problem: – Databases exported to spreadsheets (two data sets: 119K+ and 18K+) – Multiple coordinate fields (Lat/Lon, MGRS) – Some records with multiple locations – No format checking on field entries • Examples – WELL AND WATER SYSTEM: 41R QQ 30961 96855 SEPTIC TANK: 41R QQ 30946 96869 – 41R QQ 30990 90370, 41R QQ 31005 90337, 41R QQ 31017 90341, 41R QQ 30998 90378 – GR 41R QQ 31 93 – 41R QQ 32123 96814 41R QQ 32003 97004 41R QQ 32053 97204 – GR 41R QQ 3238 9227 TO GR 41 R QQ 3238 9229. CULVERT GR 41R QQ 3250 9238WWW.DATA–TACTICS.COM © 2012 Data Tactics ARCHITECT – ENGINEER – INTEGRATE – SOLUTIONS
  • 6. Where is it? • Approaches – Use NGA’s GEOTRANs software for conversions – Apply rules to determine canonical location: • If Lat/Lon present and parsable use that • If parsable MGRS use that • Use crude NLP (Regex) to extract candidate MGRS coordinates • Use 1st valid coordinate found • Check validity against Province/District bounds – Supplement with an intern • Faster, More accurate • Leveraged the power of Excel – Considered: Implementing multi-point objects for rows. • Results: – 119K row data set = 86.5% bad -> 82.0% bad – 4.5% improvement – 18K row data set = 62.4% bad -> 7.5% bad – 54.9% improvementWWW.DATA–TACTICS.COM © 2012 Data Tactics ARCHITECT – ENGINEER – INTEGRATE – SOLUTIONS
  • 7. Not all districts are equal • Problem: – Multiple data sets spanning 8yrs • Spreadsheets, DBMSs – Looking to do analysis/stats by Province/Districts – Common enumeration problems • Multiple spellings/transliterations • Punctuation • Strange formatting (Alternate names in parens) – District names/boundaries changed multiple times over the data span • Examples: – Eshkashim vs Ishkashiem, Dehdadi vs Dihdadi – Pul-i-khumri vs Puli Khumri – Darwaz-i-bala (nesay) vs DarwazbalaWWW.DATA–TACTICS.COM © 2012 Data Tactics ARCHITECT – ENGINEER – INTEGRATE – SOLUTIONS
  • 8. Not all districts are equal • Approaches: – Find a current canonical set of district names and boundaries – Fix provinces 1st: Manual using Excel – If a geocoord is present use that – Check both in and out of parens – Look at two soundex and double-metaphone – Look at lexical distance • Results – Soundex could help resolve partial double-metaphone matches – Small lexical distances are good indicators but not conversely • Estalef vs Istalif (2) • Shigal Wa Sheltan vs Shaygal wa shital (7)WWW.DATA–TACTICS.COM © 2012 Data Tactics ARCHITECT – ENGINEER – INTEGRATE – SOLUTIONS
  • 9. Not all districts are equal • Representative Data Set Statistics – No geocoords available in data set. – 59.5K+ Rows After Test Correct (%) Incorrect (%) Canonical Lookup 47.48% 52.52% Double-Metaphon/Soundex 92.12% 7.88% Lexical Dist <=2 94.01% 5.99% After Test Unique Misses % Unique Fixed Canonical Lookup 234 Double-Metaphon/Soundex 42 82.05% Lexical Dist <=2 33 21.43%WWW.DATA–TACTICS.COM © 2012 Data Tactics ARCHITECT – ENGINEER – INTEGRATE – SOLUTIONS
  • 10. Would a Babblefish even help? • Problem: – Foreign language SMS messages – Short not always structured sentences – Afghanistan is a polyglot of languages: • Pashtun • Dari/Urdu • Farsi • Arabic – Add to the problem • Abbreviations • Slang • Approaches – Automated Translation services (Google & others)WWW.DATA–TACTICS.COM © 2012 Data Tactics ARCHITECT – ENGINEER – INTEGRATE – SOLUTIONS
  • 11. Would a Babblefish even help? Language Value Original Arabic [mwn in him [stEEsw] [G[ lw[ w] [mqaamaatw] [th] Warsaw[ staasw] [d] [s[ wl] [dywaal] [jw[ y] his kindnesses Farsi The bottom in the Olympic Games described her years national Pashto We will voice higher education authorities to you to the wall thanks Urdu If you run school. skwal : (page 610) •Pl. skārah. • skwal • skwal, s.m. (2nd) Shearing, clipping, cutting off wool, hair, nap, etc. by shears. Pl. skwalūnah. skwal kawul, verb trans. To shear, to clip. See • sʿkawul • sʿkawul verb trans. (caus.) To cause to drink, imbibe, drink up, to water as a , horse, cattle, etc. . To draw out, to unsheath. SeeWWW.DATA–TACTICS.COM © 2012 Data Tactics ARCHITECT – ENGINEER – INTEGRATE – SOLUTIONS
  • 12. DBMS Data Recovery • Problem: – Data is a database that must be reconstructed • Primary Issues: – Media recovery (unusual volume or partition schemes or formats) – Interrogation of backups to determine platform, version, backup or export flavor – Establish database server for correct database platform and version accounting for database physical layout and sizing – Characterset Encoding – Database administration for performance tuning or version upgrades to enable advanced featuresWWW.DATA–TACTICS.COM © 2012 Data Tactics ARCHITECT – ENGINEER – INTEGRATE – SOLUTIONS
  • 13. DBMS Interpretation • Problems: – What is the data structure and what do the components mean • Primary Issues: – Determine database schema entry points • SME knowledge necessary for denormalized data projections • Primary key and Foreign key recovery – Examine meta data for data type distribution and possible embedded structure • E.g., XML nested in CLOBs – Data statistics and quality metrics: data size, density • focus on populated data structures – Temporal data analysis: time hack distribution for all date/time cellsWWW.DATA–TACTICS.COM © 2012 Data Tactics ARCHITECT – ENGINEER – INTEGRATE – SOLUTIONS
  • 14. Scale of the structure problem • Example: – Undocumented schemas with over 7,000 tables and 160,000 columns with minimal foreign key relationship definitions (the relationships between tables are not defined) and over 1 billion data rows – Only approximately 70% of primary keys for tables are defined • Primary Issues: – Need to reverse engineer missing primary keys and foreign keys which represent a portion of SME knowledge of the data structures – Implement algorithms to extract missing foreign key relationships within each schema • http://liris.cnrs.fr/Documents/Liris-3034.pdf • http://www.comp.nus.edu.sg/~zmeihui/vldb10.pdf • http://www.cs.toronto.edu/dcs/theses/MSc/2002-03/Vilarem.msc.pdf • http://webdb09.cse.buffalo.edu/papers/Paper30/rostin_et_al_final.pdf – Complicating Matters • Artificial/Pseudo Keys (e.g. one up numbers) • Compound Keys ( Column A + Column B = Column C)WWW.DATA–TACTICS.COM © 2012 Data Tactics ARCHITECT – ENGINEER – INTEGRATE – SOLUTIONS
  • 15. Primary Key Discovery Vilarem approximation for primary key discovery: Complexity given by the equation Complexity(ExtractKeys) = O(nKeyCands x p log p) Where, nKeyCands = number of key candidates, p = number of tuples (rows), And the number of key candidates is dependant upon the number of columns for a given table.WWW.DATA–TACTICS.COM © 2012 Data Tactics ARCHITECT – ENGINEER – INTEGRATE – SOLUTIONS
  • 16. Foreign Key DiscoveryForeign key discovery without pruningapproximation:Complexity given by the equations:Complexity(ExtractUINDs) = O((nUindCands + nFKCands)x join(p))Where,nUindCands = key-based unary inclusion dependenciesnKeyCands = number of key candidates,p = number of tuples (rows),And the number of key candidates is dependant upon thenumber of columns for a given table.WWW.DATA–TACTICS.COM © 2012 Data Tactics ARCHITECT – ENGINEER – INTEGRATE – SOLUTIONS
  • 17. Horizontal Data Integration • Definition: Horizontal Integration •multiple heterogeneous data resources become aligned in such a way that search and analysis procedures can be applied to their combined content as if they formed a single resource •Challenges •Quantity and variety • Need to do justice to radical heterogeneity in the representation of data and semantics Dynamic environments • Need agile support for retrieval, integration and enrichment of data •Emergence of new data resources • Need in agile, flexible, and incremental integration approachWWW.DATA–TACTICS.COM © 2012 Data Tactics ARCHITECT – ENGINEER – INTEGRATE – SOLUTIONS
  • 18. Unified DataSpace + Semantic Enhancement The Wild • Data sources with rich data & Segment 3 - Model Description semantic context locked in domain Data Rich semantic silos Models context • Data tightly coupled to data-models • Data-models Segment 2 - Data Description tightly coupled to Structured Integration Enrichment storage models Data Exploitation Exploration Silos isolated by Across all sources • Implementation Segment 1 - Artifact Description technology • Storage structure Unstructured Rich data • Data Data context representation • Data modalityWWW.DATA–TACTICS.COM © 2012 Data Tactics ARCHITECT – ENGINEER – INTEGRATE – SOLUTIONS
  • 19. Unified DataSpace •Segment 0 is an artifact store (i.e., binary representation of artifacts). High-Level Conceptual Model of the DataSpace and Ingest/Extraction Flows Segment 3 - Model Semantics •Segment 1 represents artifact semantics . 2 . 2 CONCEPT CONCEPT_ASSOCIATION PREDICATE PREDICATE_ASSOCIATION . . . and includes artifact metadata and Uses Uses associations between the artifacts. Indexing Segment 1 - Artifact Semantics Segment 2 - Data Semantics Semantics of Segment 1 supports search on text SOURCE . + 2 . 2 . Metadata ARTIFACT ARTIFACT_ASSOCIATION TERM . STATEMENT . . . content, geospatial, and artifact meta data. . Data Uses + Metadata Metadata Segment 0 - Artifacts •Segment 2 represents data and semantics Ingest Extraction of structured data elements extracted from artifacts. Indexing of Segment 2 supports search on properties of entities (e.g., Person, Location) based on their properties and relationships. •Segment 3 represents data-models extracted from artifacts and models used for aligning, disambiguating, and enriching the elements of Segments 1 and 2.WWW.DATA–TACTICS.COM © 2012 Data Tactics ARCHITECT – ENGINEER – INTEGRATE – SOLUTIONS
  • 20. Semantic Enhancement• Requirements for Horizontal Integration – The ontologies must be linked together through logical definitions to form a single, nonredundant and consistently evolving integrated network – The ontologies must be capable of evolving in an agile fashion in response to new sorts of data and new analytical and warfighter needs• Creating Ontology Modules – Incremental distributed ontology development • Based on Doctrine; • Involves SMEs in label selection and definition – Ontology development rules and principles • A shared governance and change management process • A common ontology architecture incorporating a common, domain-neutral, upper-level ontology (BFO) – An ontology registry – A simple, repeatable process for ontology development – A process of intelligence data capture through ‘annotation’ or ‘tagging’ of source data artifacts – Feedback between ontology authors and users• SE Architecture – The Upper Level Ontology (ULO) in the SE hierarchy must be maximally general (no overlap with domain ontologies) – The Mid-Level Ontologies (MLOs) introduce successively less general and more detailed representations of types which arise in successively narrower domains until we reach the Lowest Level Ontologies (LLOs). – The LLOs are maximally specific representation of the entities in a particular one-dimensional domainWWW.DATA–TACTICS.COM © 2012 Data Tactics ARCHITECT – ENGINEER – INTEGRATE – SOLUTIONS
  • 21. Modular HierarchyWWW.DATA–TACTICS.COM © 2012 Data Tactics ARCHITECT – ENGINEER – INTEGRATE – SOLUTIONS
  • 22. References• Salmen, et al,. Integration of Intelligence Data through Semantic Enhancement, STIDS 2011 – Strategy for developing an SE suite of orthogonal reference ontology modules• Smith, et al. Ontology for the Intelligence Analyst, CrossTalk: The Journal of Defense Software Engineering November/December 2012,18-25. – Shows how SE approach provides immediate benefits to the intelligence analyst• Smith, et al. Horizontal Integration of Warfighter Intelligence Data - A Shared Semantic Resource for the Intelligence Community – Describes a strategy that is being used for the horizontal integration of warfighter intelligence data within the framework of the US Army’s Distributed Common Ground System Standard Cloud (DSC) initiative – Strategy rests on the development of a set of ontologies that are being incrementally applied to bring about what we call the ‘semantic enhancement’ of data models used within each intelligence disciplineWWW.DATA–TACTICS.COM © 2012 Data Tactics ARCHITECT – ENGINEER – INTEGRATE – SOLUTIONS