Your SlideShare is downloading. ×
0
Text Mining to Correct Missing CRM Information by Jonathan Sedar
Text Mining to Correct Missing CRM Information by Jonathan Sedar
Text Mining to Correct Missing CRM Information by Jonathan Sedar
Text Mining to Correct Missing CRM Information by Jonathan Sedar
Text Mining to Correct Missing CRM Information by Jonathan Sedar
Text Mining to Correct Missing CRM Information by Jonathan Sedar
Text Mining to Correct Missing CRM Information by Jonathan Sedar
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Text Mining to Correct Missing CRM Information by Jonathan Sedar

193

Published on

Text Mining to Correct Missing CRM Information by Jonathan Sedar

Text Mining to Correct Missing CRM Information by Jonathan Sedar

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
193
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
1
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. PyData London 2014 Jonathan Sedar, Applied AI @jonsedar Text mining to correct missing CRM information A practical data science project LIGHTNING EDITION
  • 2. In a nutshell ● A CRM dataset (100k business accounts) belonging to a national energy supplier ● A knotty problem: multiple accounts per company, without any grouping ids ● How can we to find groups of accounts (larger company structures), using just the CRM data? ● Machine Learning (ML) and Natural Language Processing (NLP) tools and techniques in Python. ● Import: Scikit Learn and TextBlob (NLTK & Pattern)
  • 3. Show me the data! Company ID Account name Contact name Premises address lines 1 - 4 Billing address lines 1 - 4 1 Bob’s Pizza Big Bob 5 High St, Wexford 5 High St, Wexford 1 Bob’s Pizza Big Bob Temple Bar, D2 5 High St, Wexford 1 Mike’s Kebabs Mad Mike 3 Upper St, Dublin 5 High St, Wexford 2 Mark’s Kebabs Mild Mark 8 Upper St, Dublin Main St, Waterford 3 Fred’s Falafel Fat Fred 9 Henry St, Cork 9 Henry st, cork 3 Fred Fallafell Freddie Bridges St, Galway Henrys St, Cork … x100,000 This crucial bit of info groups the separately- recorded accounts into companies… and was missing from the dataset
  • 4. Transforming text into useful structures Account holder Business name Premises address Billing address ... x100,000 TF-IDF 2-D MATRIX Cleaned, parsed, tokenised text strings SVD N-D MATRIX NTLK.PunktTokeniser sklearn.TfidfVectorizer sklearn.TruncatedSVD
  • 5. Now we can identify similar accounts: 1. Suggest similar accounts to be grouped sklearn.MiniBatchKMeans sklear.AffinityPropagation sklearn.RadiusNeighborsClassifier 2. Human validation & verification 3. Incorporate & propagate valid groupings
  • 6. A pragmatic process using OOTB Python machine learning and human expertise ● A very quick turnaround from raw data to tagged companies to 93% accuracy ● ~40% of accounts found to belong to a company, ~3.5 accounts per company ● NLP toolkits and scikit-learn allowed rapid development and testing of solution ● Incorporated human identification at critical stages: no ML problem is an island
  • 7. PyData London 2014 Jonathan Sedar, Applied AI @jonsedar Thank you Any questions?

×