Three experiments I have done with data science. Related to text analysis, integration. Focusing on the learning's rather than details on how it was done with source code. I feel it is important to see this subject in relation to business problems rather than as pure branch of Statistics. Focusing on what has to be done enabled me to find the right solution from a complicated and very interesting subject.
2. About Me
~15 years | ~12 products | Various roles
Name Gaurav Marwaha
Current Associated with Nucleus Software, having complete ownership of new product
development for loan origination product for banks/ NBFCs.
Driving the technology teams to deliver an internally re-usable product
development framework.
Past Have successfully lead & contributed to multiple product teams in different
domains (GIS/ Health/ e-Governance)
Technology Java/ big-data/ analytics/ Spring/ ESB (Camel)/ Mobile/ Social
Product Conceptualization, Design, Development, Maintenance, EOL, Strategy &
roadmap
Soft Team building, coaching, mentoring
LinkedIn http://in.linkedin.com/in/gauravmarwaha/
3. Table of Contents
› Introduction
› Assumptions
› Experiment 1: Inferring written text
› Experiment 2: Scoring public data
› Experiment 3: Discovering cross sell opportunities
› Learning’s
› Tools & References
5. Introduction
We generated more digital data in the year 2013 than we
have ever before. Everyone wants to know more about me
right from my bank to places I shop. From Google to the
mall store owner. Everyone wants to know what I want
before I myself know that I want it.
Quants have tried to predict stock movement based on
history of trades for years now.
Businesses can leverage the abundantly available data from
smart phones, desktops etc to make critical CAPEX /
marketing decisions. Knowing how to derive value out of
this data is more important today than ever.
6. Assumptions
This short presentation will only focus on problems which I
worked on; it will avoid theoretical aspects of data science.
› Assuming viewers of this have read about:
–
–
–
–
Language processing: Stemming, tokenization, parts of speech tagging
Basics of machine learning clustering/ classification techniques
Point clouds and dimensional analysis on data using them
Java/ J2EE based web application development
› Per my knowledge none of these experiments became part of a
commercial product.
› I have purposefully kept the presentation focused on learning’s
avoiding the nuts & bolts to keep it short.
8. Scenarios
Text analysis refers to inferring valuable
knowledge from a given piece of text which
may help in further action/ decisions.
Customer Support
Text Mining
TEXT
ANALYSIS
Challenges:
1. Slang – we use a lot of phrases which deviate
from the defined grammar of a language.
2. Ambiguity – there is lot of ambiguity in some
sentences where the speaker may be throwing a
pun or a sarcastic remark
3. Language – English and other Hipsanic
languages are not the only ones spoken some
users may mix languages. Like English + Spanish/
Mexican etc.
Auto respond bots for text
Auto respond IVR bots
Auto email responses to email queries
Legal text
Medical records
Social Analysis
Facebook page analyzer
Twitter stream analysis
Other sentiment analysis
Computer Games
AI games
Betting games
EXAMPLE TEXT
Decision Support
Millitary use
Email analysis
9. Customer Support
I will limit the discussion to this topic where a user is writing
in to the customer support during off hours and instead of a
standard reply the query first goes to a bot which tries to
answer it.
There can be numerous other use cases for this service, the
key elements are:
1. The calling application – this is the consumer of the service
which passes the user query
2. Text parser – this is the engine which receives and parses text
3. Dictionary – a list of phrases/ words of interest, used to map the
query to something that the machine understands.
10. Customer Support - How
Security Shell: Oauth
Web application
Text Parser
User keys in query in a simple
contact us page. It is first sent to
parser if low score response is
received same is discarded for a
pre-decided “we will get back to
you response”
Dictionar
y
1
Web application
Standard Spring based web
application
2
Security Shell
Oauth provider shell to help with
REST based security
3
Text Paser
Stanford NLP Parser:
http://nlp.stanford.edu/software/l
ex-parser.shtml and the core-NLP
package
4
Notes:
Dictionary maintencance, finding
nouns/ subjects are all part of
standard documentation/ tutorials.
The tool also supports languages
other than English.
11. Learning's and Possible Uses
Learning’s:
1. Dictionary is a very critical element, a well defined dictionary will
help identify subjects more easily with right scores.
2. Quality of data if second key element, spelling mistakes,
ambiguous sentences and emotions of the writer all play
different roles. A quick example is Porch/ Porsche it is just an e
but it changes a lot.
Uses:
Other than customer support a parser like this can also be used in
sentiment analysis or text analysis.
13. Scenarios
All of us generate tons of public data and
businesses can use it for profiling us both as
exisiting and prospective customers. A better
profiled customer is better served and can
lead to a longer term relationship.
LinkedIn
Facebook
PUBLIC DATA
Challenges:
1. Privacy – The user has to authorize access to
such data
2. Authenticity – people may have fake accounts
3. Volume – The sheer volume of such data may
make it difficult to analyze it in a given time.
Twitter
Blogs
Employment Verification
Type of connections
Recommendations
Personal nature
Interests
Following and followed by
Tweet sentiment/ text analysis
Location data
Text analysis
Knowledge
14. Social/ Public Scores
The experiment is simple, which is to score an individual from
LinkedIn and Twitter data which is further used in employability
checks.
There can be numerous other use cases for this service, the key
elements are:
1.
2.
3.
Social Networks – access to an account/ user’s personal data
A learning database that allows the machine to create good/ bad/
neutral clusters of from existing data
Choosing the right algorithm to identify the cluster
Data:
• LinkedIn: Experience, connections, degrees used for scoring
• Twitter: tweets, followers etc. used for personal scoring
15. Customer Support - How
Web Application
Dictionary
Twitter Score
Engine
Twitter Parser
Final Score
aggregator
Spring Social
LinkedIn Score
Engine
LinkedIn Parser
Training
data set
1
Spring Social
A standard module from
Spring helps us to get data
from social networks to
Java applications very
easily.
2
Parsers
Once data is in, we can write
some parsers/ formatters to
cleanse data or move it into
application defined standard
structures.
3
Twitter Score Engine
This is nothing but an extension of
textual analysis tool with the
dictionary defining words that bring
out substance abuse/ gambling and
other socially unacceptable
characteristics
4
LinkedIn Score Engine
The machine was pre-trained
on some sample data using
standard dimensions
provided by LinkedIn. We
used Encog and Weka .
5
Algorithm
We experimented with some
basic machine learning
algorithms including
Bayesian, K-Means also tried
with fuzzy K-means
16. Learning's and Possible Uses
Learning’s:
1.
2.
3.
4.
Privacy laws across countries do not allow access to such data but
companies are circumventing this by launching mobile apps which
have access to everything on your smart phones.
To make a machine take sane decisions it is critical to have the right
training data this data becomes all the more critical for qualitative
attributes.
If you do not have a data scientist/ statistician then you can play with
different algorithms. Genetic and neural algorithms may sound cool
but they may not give desired results.
Weka is a good tool to visualize the execution and also a tool which
can be used to select the right algorithm.
Uses:
This is a very generic public data profiling application it can have uses in
banks, HR departments and many other places.
18. Scenarios
This is the most complicated of the three scenarios.
Large corporations have hundereds of different
products, millions of customers and thousands of
salesmen across geographies. What is it that an
existing customer will buy next especially in an
enterprise product environment.
INCLINATION
CONNECTIONS
COMMON FRIENDS
DECISION AUTHORITY
PERSONAL
GOALS
CURRENT ESCALATIONS
LAST CHANGE REQUEST
SERVICE HISTORY
CUSTOMER SUPPORT
CUSTOMER CONTACT
LICENSES
”Say a sales person is visiting a customer and he/ she
quickly wants to see what can be sold to this
customer.”
MARKETS & REGIONS
PRODUCTS
MARKET/ REGION
INSTALLATIONS
Challenges:
1. Aggregation – data is being aggregated from
public and private data storres
2. Time – the opportunity presentment window is
very short and lot of data has to be crunched.
3. Availability – Anytime that any service is down
FEATURES
DATA ON
CUSTOMERS IN
THIS MARKET/
REGION
LOCATION
PRICE
WHERE?
CHEAP?
LUXARY?
AVERAGE?
MARKET & REGION
DATA RELATED TO
THE MARKET
MATURITY, STATE
ETC.
19. Cross Selling
This is not a simple experiment, it is aggregation of multiple public and
private data sources.
The key elements being:
1.
2.
3.
Speed of decision/ suggestion
Availability and access to multiple API based services (paid/ free)
Availability of enough data for the machine to have built up knowledge to take
correct decision
Data:
• LinkedIn: Common connects
• Twitter: tweets, followers etc. used for personal profiling
• Jigsaw: Company data
• Yahoo Finance API: Market information
• Customer Support: Analysis of tickets
20. Cross Sell - How
Yahoo
Connector &
Formatter
Web Application
Dictionary
Twitter Score
Engine
Customer
support Data
Final List of
suggestions
Spring Social
LinkedIn Score
Engine
Jigsaw
Connector &
formatter
1
Twitter Parser
Previous Modules
Refer to previous slides for
a description of repeated
modules.
2
Yahoo Connector
3
Fetches data from Yahoo finance
API and formats some
structured/ unstructured data
into more structured data which
can be analyzed
LinkedIn Parser
Training
data set
Jigsaw Connector
Fetches Jigsaw company information
over API calls. Note now this API looks
to have moved to data.com
4
Final Suggestions
Basically a quick aggregator
of data with inbuilt custom
logic for scoring and location
analysis that is once we have
final list of contacts we
overlay salesrep location.
5
Algorithms
Text: combination of noun &
knowledge extraction from
free text using SOLR & NLP
Jigsaw: Company match to
indicate closeness to selected
customer.
21. Learning's and Possible Uses
Learning’s:
1.
2.
3.
4.
Data Quality: Leaving aside the complexity of integration and multiple
data sources. The quality of data and its importance in decision
making, especially in the enterprise world was the critical learning.
In most of real world complicated scenario, there is no one solution
which will fit.
Agile: breaking the problem into several smaller problems made life
more simple.
Human judgment: Whatever the machine may show to the sales rep in
case he/ she ignores and decides to cross sell something else that
has to go back to the machine as learning else the intelligence will
slowly die away.
Uses:
Multiple, leave it to the imagination of the reader.
23. Big Picture – Data Quality
Enterprise/ B2B World
Public/ B2C World
Data entry is a cost center and also corner stone for
enterprise applications. The data that we use for
machines to learn has been mostly captured by
humans over the past years. Data entry is not the
most rewarding career and people tend to make
mistakes like wrong address, figures, names are very
common. Focus on quality of data entry will reduce
the speed which means reduced volumes.
Imagine amazon, when you buy a book what data
does it capture about you: clicks, geo-ip, browser,
products viewed/ liked/ bought/ searched/ etc.
Some data from cookies and your past searches,
your profile. To place the order most of us will give
the right address and phone with payment
information. As you notice lot of data is machine
generated which makes analysis more accurate.
Conclusion
•Curing data is possible but it is important to balance quality, quantity and cost of data entry by designing
applications which strike the right balance in these.
•Master data management, data quality programs and data curing all are costly affairs if done late in the
enterprise
•The aggregation of public and private data sets is a reality in today’s world and ”identity” that is identifying an
individual across these data sets is also a real challenge.
24. Big Picture – Others
Machine Learning
Big Data
Integrations
How much and what is required to
solve problem at hand. Reuse what
is already done and application of
same on business problem is good.
Is not same as data analysis, it can
speed up the analysis and may/
may not be applicable to your
problem
Is the way to go in future, all these
mountains of data will soon
integrate
Agility
Data
Data Scientists
Hit smaller chunks of doable workitems and slowly take down the
larger beast.
Data & Data quality are
tremendously important a few
hundred bad apples can spoil lot
more.
Is an important position in the
overall picture, complicated
scoring/ analysis requires
specialized skills.
26. Tools & References
› Tools:
– The normal Spring JEE stack with many spring modules has been
used to develop these applications
– Eclipse used as source code editor
– The other tools like Stanford NLP, Encog and Weka are listed with
links on individual slides
› References:
– There are good courses on Coursera
– The Stanford, Weka and Encog websites also have lot of reading
material
– Presentation template & graphics provided by Microsoft