Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Using HPCC Systems ML to Map Thousands of Public Records Data Descriptions to Standard Codes

14 views

Published on

As part of the 2018 HPCC Systems Summit Community Day event:

Up first, Farah Alshanik, Clemson University briefly discusses her poster, Equivalence Terms of Text Search Bundle.

Following, Lili Xu and Gus Reyna present their breakout session in the Machine Learning track.

There is a challenge of incorporating public records data into business processes given disparate descriptions across states for similar events, and then finding a standard that gives one consistent meaning for use. This session tells the story of how the HPCC Systems Machine Learning addressed the problem of mapping thousands of disparate public record data descriptions to a corresponding set of standard codes and the future direction for this approach.

Lili Xu is a PhD candidate from DICE lab directed by Dr. Apon in the school of computing of Clemson University. It’s her third time interning in HPCC Systems team working on machine learning applications. Her research area is machine learning, natural language processing and high performance computing. She can speak only three language but she can program more than three languages.

Gus Reyna is a Director with LexisNexis Risk Solutions where he leads the engineering team for the Motor Vehicle Report (MVR) data products. He has been working at LexisNexis for 9 years building data solutions on the HPCC Systems platform.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Using HPCC Systems ML to Map Thousands of Public Records Data Descriptions to Standard Codes

  1. 1. Innovation and Reinvention Driving Transformation OCTOBER 9, 2018 2018 HPCC Systems® Community Day Gus Reyna , Lili Xu Using HPCC Systems ML to Map Thousands of Public Records Data Descriptions to Standard Codes
  2. 2. Introduction • Background • Approach • Exploratory Analysis • Next steps Using HPCC Systems Machine Learning to map thousands of violation descriptions to Standard Violation Codes 4
  3. 3. Background • Public records data: birth/marriage/death certificates, business/professional/contractor licenses, foreclosures and tax liens, etc. • Businesses that use this data in their information technology processes must account for state variations of similar events. • LexisNexis maps public record data from different states to standard categories. • Businesses use the standard categories to create one system that can be used in all states. Using HPCC Systems Machine Learning to map thousands of violation descriptions to Standard Violation Codes 5
  4. 4. Problem • Team of Subject Matter Experts (SME) map public record data to standardized categories. • Data grew faster than team’s mapping capacity. • Data products time to market increased. Using HPCC Systems Machine Learning to map thousands of violation descriptions to Standard Violation Codes 6
  5. 5. Solution • Grow the team, train the team. • BUT, Trainers are SMEs who are not mapping when they’re training new team members • AND, It takes time to learn how to map public records data to standard categories Using HPCC Systems Machine Learning to map thousands of violation descriptions to Standard Violation Codes 7
  6. 6. Solution • Use HPCC Systems machine learning to generate 3 recommended standard categories for public record data … • Which shortens the time new team members become effective mappers and … • Reduces the time required for SMEs to train new team members Using HPCC Systems Machine Learning to map thousands of violation descriptions to Standard Violation Codes 8
  7. 7. Approach Using HPCC Systems Machine Learning to map thousands of violation descriptions to Standard Violation Codes 9 Categorize d Data Clean & Build Vocabulary Training Data Validation data Build Models New Data Models** Run Throug h Model Run Through Model Top 3 Category Recommendations Mappers ** Support Vector Machine (SVM) Naïve Bayes Category Description Category 1 DOGS RUNNING AT LARGE Category 1 CANINE RUNNING AT LARGE PROHIBITED Category 2 RUNNING A RED LIGHT Category 3 FISHING WITHOUT LICENSE
  8. 8. Build Vocabulary Using HPCC Systems Machine Learning to map thousands of violation descriptions to Standard Violation Codes 10 Categorize d Data Clean & Build Vocabulary WORD COUNT RUN 3 LARG 2 CANIN 1 …. …. http://textanalysisonline.com/nltk-porter-stemmer Category Description Category 1 DOGS RUNNING AT LARGE Category 1 CANINE RUNNING AT LARGE PROHIBITED Category 2 RUNNING A RED LIGHT Category 3 FISHING WITHOUT LICENSE
  9. 9. Build & Validate Model Using HPCC Systems Machine Learning to map thousands of violation descriptions to Standard Violation Codes 11 Build Models Models* Run Through Model Category Description Category 1 CANIN RUN AT LARG PROHIBIT Category 2 RUN A RED LIGHT Category 3 FISH WITHOUT LICENS * Support Vector Machine (SVM) Naïve Bayes Category Description Category 1 DOG RUN AT LARG Recommendations Category Description Category 1 Category 1 DOGS RUNNING AT LARGECategory 2 Category 3 Category Description Category 1 DOGS RUNNING AT LARGE Category 1 CANINE RUNNING AT LARGE PROHIBITED Category 2 RUNNING A RED LIGHT Category 3 FISHING WITHOUT LICENSE Training Data Validation data
  10. 10. Process New Data Using HPCC Systems Machine Learning to map thousands of violation descriptions to Standard Violation Codes 12 Models* Top 3 Category Recommendations Mappers * Support Vector Machine (SVM) Naïve Bayes Description CATS RUNNING AT LARGE HUNTING WITHOUT LICENSE Training Data Category Description Category 1 CANINE RUNNING AT LARGE PROHIBITED Category 2 RUNNING A RED LIGHT Category 3 FISHING WITHOUT LICENSE Recommende d Categories Description 1, 2, 3 CATS RUNNING AT LARGE 3, 1, 2 HUNTING WITHOUT LICENSE NEW Data
  11. 11. Approach Using HPCC Systems Machine Learning to map thousands of violation descriptions to Standard Violation Codes 13 Categorize d Data Clean & Build Vocabulary Training Data Validation data Build Models New Data Models** Run Throug h Model Run Through Model Top 3 Category Recommendations Mappers ** Support Vector Machine (SVM) Naïve Bayes Category Description Category 1 DOGS RUNNING AT LARGE Category 1 CANINE RUNNING AT LARGE PROHIBITED Category 2 RUNNING A RED LIGHT Category 3 FISHING WITHOUT LICENSE
  12. 12. Outcome • Backlog of public record data to standard category mapping eliminated. • Time to market for data products shortened, no more delays from data mapping to categories • Happy mapping team – could work on data enhancement projects. Using HPCC Systems Machine Learning to map thousands of violation descriptions to Standard Violation Codes 14
  13. 13. Exploratory Analysis 15 Using HPCC Systems Machine Learning to map thousands of violation descriptions to Standard Violation Codes
  14. 14. NLP Toolkits on HPCC Systems Platform Using HPCC Systems Machine Learning to map thousands of violation descriptions to Standard Violation Codes 16 RECORD DESCRIPTION OPERATING/VEH/OVER MAX HGT RECORD OPERATING VEH OVER MAX HGT RECORD OPERATING VEHICLE OVER MAX HEIGHT RECORD OPER VEHICL MAX HEIGHT TOKENIZO R STOP-WORDS REMOVER SEMANTIC ANALYZOR N-GRAM RECORD OPER VEHICL VEHICL MAX MAX HEIGHT RECORD OPER VEHICL OVER MAX HEIGHT STEMMER
  15. 15. Latent Dirichlet Allocation - Topic Model Using HPCC Systems Machine Learning to map thousands of violation descriptions to Standard Violation Codes 17 • Unsupervised Natural Language Processing(NLP) Algorithm • Explore the Topics in Documents. • Each topic is a distribution over words • Each document is a mixture of topics • Each word is drawn from the topics LDA Topic Model
  16. 16. TOPIC 1 TOPIC 2 TOPIC 3 TOPIC 4 TOPIC 5 TOPIC 6 TOPIC 7 TOPIC 8 TOPIC 9 TOPIC 10 OPER UNLA W PROO F DRIVE IMPRO P PARK VEHIC L SPEED INSUR MOTOCY CL INSUR REQUI R SPEED DRIVE UNLAW SPEED SPEED SPEED IMPRO P REQUIR PROOF VEHICL PROO F OPER OPER UNLAW VEHIC L EY PARK IMPROP INSUR REQUIR UNLAW REQUI R UNLA W SPEED VEHIC L EY UNLA W VEHICL PROO F INSUR MOTORCY CL PARK UNLA W INSUR REQUI R IMPRO P MOTORCY CL FAIL EY PROTE CT FAIL MOTORCY CL IMPROP OPER Scalable LDA on HPCC Systems Platform • Massive Parallel Topic Modeling • Flexible Hyper-Parameter Setup • Experiments Topic Range [10 – 103] Using HPCC Systems Machine Learning to map thousands of violation descriptions to Standard Violation Codes 18 LDA TOPIC MODEL RESULT
  17. 17. Next Steps • Continue exploratory analysis • Additional algorithms • Automatically map data to categories, not just make recommendations • Refine existing and build new models • Solve other business problems with HPCC Systems Machine Learning • Uniform language • Ease of data access • High productivity 19 Using HPCC Systems Machine Learning to map thousands of violation descriptions to Standard Violation Codes

×