ON

I
EDIT
G

NIN

T
LIGH

A practical data science project
● A CRM dataset (100k business accounts) belonging to
a national energy supplier
● A knotty problem: multiple accounts per...
Company
ID

Account name

Contact
name

Premises address
lines 1 - 4

Billing address
lines 1 - 4

1

Bob’s Pizza

Big Bob...
Account holder
Business name
Premises address
Billing address
...

Cleaned, parsed,
tokenised text
strings

NTLK.PunktToke...
1. Suggest similar
accounts to be grouped
2. Human
validation &
verification

3. Incorporate & propagate
valid groupings

...
● A very quick turnaround from raw data to tagged
companies to 93% accuracy
● ~40% of accounts found to belong to a compan...
Any questions?
Upcoming SlideShare
Loading in …5
×

Text mining to correct missing CRM information: a practical data science project

703 views

Published on

20min talk given at PyData London 2014

A client in the energy sector wanted to create predictive behavioural models of business customers at the company level, but the CRM data was messy, often containing several sub-accounts for each business, without any grouping identifiers, and so aggregation was impossible. In this talk I describe a short project where we used text mining, a handful of unsupervised learning techniques and pragmatic use of human skill, to identify the true company level structures in the CRM data.

Published in: Business
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
703
On SlideShare
0
From Embeds
0
Number of Embeds
28
Actions
Shares
0
Downloads
0
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Text mining to correct missing CRM information: a practical data science project

  1. 1. ON I EDIT G NIN T LIGH A practical data science project
  2. 2. ● A CRM dataset (100k business accounts) belonging to a national energy supplier ● A knotty problem: multiple accounts per company, without any grouping ids ● How can we to find groups of accounts (larger company structures), using just the CRM data? ● Machine Learning (ML) and Natural Language Processing (NLP) tools and techniques in Python. ● Import: Scikit Learn and TextBlob (NLTK & Pattern)
  3. 3. Company ID Account name Contact name Premises address lines 1 - 4 Billing address lines 1 - 4 1 Bob’s Pizza Big Bob 5 High St, Wexford 5 High St, Wexford 1 Bob’s Pizza Big Bob Temple Bar, D2 5 High St, Wexford 1 Mike’s Kebabs Mad Mike 3 Upper St, Dublin 5 High St, Wexford 2 Mark’s Kebabs Mild Mark 8 Upper St, Dublin Main St, Waterford 3 Fred’s Falafel Fat Fred 9 Henry St, Cork 9 Henry st, cork 3 Fred Fallafell Freddie Bridges St, Galway Henrys St, Cork This crucial bit of info groups the separatelyrecorded accounts into companies… and was missing from the dataset … x100,000
  4. 4. Account holder Business name Premises address Billing address ... Cleaned, parsed, tokenised text strings NTLK.PunktTokeniser x100,000 sklearn.TfidfVectorizer SVD N-D MATRIX TF-IDF 2-D MATRIX sklearn.TruncatedSVD
  5. 5. 1. Suggest similar accounts to be grouped 2. Human validation & verification 3. Incorporate & propagate valid groupings sklearn.MiniBatchKMeans sklear.AffinityPropagation sklearn.RadiusNeighborsClassifier
  6. 6. ● A very quick turnaround from raw data to tagged companies to 93% accuracy ● ~40% of accounts found to belong to a company, ~3.5 accounts per company ● NLP toolkits and scikit-learn allowed rapid development and testing of solution ● Incorporated human identification at critical stages: no ML problem is an island
  7. 7. Any questions?

×