Text Mining to Correct Missing CRM Information by Jonathan Sedar

486 views
328 views

Published on

Text Mining to Correct Missing CRM Information by Jonathan Sedar

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
486
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Text Mining to Correct Missing CRM Information by Jonathan Sedar

  1. 1. PyData London 2014 Jonathan Sedar, Applied AI @jonsedar Text mining to correct missing CRM information A practical data science project LIGHTNING EDITION
  2. 2. In a nutshell ● A CRM dataset (100k business accounts) belonging to a national energy supplier ● A knotty problem: multiple accounts per company, without any grouping ids ● How can we to find groups of accounts (larger company structures), using just the CRM data? ● Machine Learning (ML) and Natural Language Processing (NLP) tools and techniques in Python. ● Import: Scikit Learn and TextBlob (NLTK & Pattern)
  3. 3. Show me the data! Company ID Account name Contact name Premises address lines 1 - 4 Billing address lines 1 - 4 1 Bob’s Pizza Big Bob 5 High St, Wexford 5 High St, Wexford 1 Bob’s Pizza Big Bob Temple Bar, D2 5 High St, Wexford 1 Mike’s Kebabs Mad Mike 3 Upper St, Dublin 5 High St, Wexford 2 Mark’s Kebabs Mild Mark 8 Upper St, Dublin Main St, Waterford 3 Fred’s Falafel Fat Fred 9 Henry St, Cork 9 Henry st, cork 3 Fred Fallafell Freddie Bridges St, Galway Henrys St, Cork … x100,000 This crucial bit of info groups the separately- recorded accounts into companies… and was missing from the dataset
  4. 4. Transforming text into useful structures Account holder Business name Premises address Billing address ... x100,000 TF-IDF 2-D MATRIX Cleaned, parsed, tokenised text strings SVD N-D MATRIX NTLK.PunktTokeniser sklearn.TfidfVectorizer sklearn.TruncatedSVD
  5. 5. Now we can identify similar accounts: 1. Suggest similar accounts to be grouped sklearn.MiniBatchKMeans sklear.AffinityPropagation sklearn.RadiusNeighborsClassifier 2. Human validation & verification 3. Incorporate & propagate valid groupings
  6. 6. A pragmatic process using OOTB Python machine learning and human expertise ● A very quick turnaround from raw data to tagged companies to 93% accuracy ● ~40% of accounts found to belong to a company, ~3.5 accounts per company ● NLP toolkits and scikit-learn allowed rapid development and testing of solution ● Incorporated human identification at critical stages: no ML problem is an island
  7. 7. PyData London 2014 Jonathan Sedar, Applied AI @jonsedar Thank you Any questions?

×