SlideShare a Scribd company logo
1 of 27
Data Science
Learning from experiments
About Me
~15 years | ~12 products | Various roles

Name Gaurav Marwaha
Current Associated with Nucleus Software, having complete ownership of new product
development for loan origination product for banks/ NBFCs.
Driving the technology teams to deliver an internally re-usable product
development framework.
Past Have successfully lead & contributed to multiple product teams in different
domains (GIS/ Health/ e-Governance)

Technology Java/ big-data/ analytics/ Spring/ ESB (Camel)/ Mobile/ Social
Product Conceptualization, Design, Development, Maintenance, EOL, Strategy &
roadmap
Soft Team building, coaching, mentoring
LinkedIn http://in.linkedin.com/in/gauravmarwaha/
Table of Contents
› Introduction

› Assumptions
› Experiment 1: Inferring written text
› Experiment 2: Scoring public data

› Experiment 3: Discovering cross sell opportunities
› Learning’s
› Tools & References
Introduction
Introduction
We generated more digital data in the year 2013 than we
have ever before. Everyone wants to know more about me
right from my bank to places I shop. From Google to the
mall store owner. Everyone wants to know what I want
before I myself know that I want it.
Quants have tried to predict stock movement based on
history of trades for years now.
Businesses can leverage the abundantly available data from
smart phones, desktops etc to make critical CAPEX /
marketing decisions. Knowing how to derive value out of
this data is more important today than ever.
Assumptions
This short presentation will only focus on problems which I
worked on; it will avoid theoretical aspects of data science.
› Assuming viewers of this have read about:
–
–
–
–

Language processing: Stemming, tokenization, parts of speech tagging
Basics of machine learning clustering/ classification techniques
Point clouds and dimensional analysis on data using them
Java/ J2EE based web application development

› Per my knowledge none of these experiments became part of a
commercial product.

› I have purposefully kept the presentation focused on learning’s
avoiding the nuts & bolts to keep it short.
Experiment 1: Inferring written text
Scenarios
Text analysis refers to inferring valuable
knowledge from a given piece of text which
may help in further action/ decisions.

Customer Support

Text Mining

TEXT
ANALYSIS
Challenges:
1. Slang – we use a lot of phrases which deviate
from the defined grammar of a language.
2. Ambiguity – there is lot of ambiguity in some
sentences where the speaker may be throwing a
pun or a sarcastic remark
3. Language – English and other Hipsanic
languages are not the only ones spoken some
users may mix languages. Like English + Spanish/
Mexican etc.

Auto respond bots for text
Auto respond IVR bots
Auto email responses to email queries

Legal text
Medical records

Social Analysis

Facebook page analyzer
Twitter stream analysis
Other sentiment analysis

Computer Games

AI games
Betting games
EXAMPLE TEXT

Decision Support

Millitary use
Email analysis
Customer Support
I will limit the discussion to this topic where a user is writing
in to the customer support during off hours and instead of a
standard reply the query first goes to a bot which tries to
answer it.
There can be numerous other use cases for this service, the
key elements are:
1. The calling application – this is the consumer of the service
which passes the user query
2. Text parser – this is the engine which receives and parses text
3. Dictionary – a list of phrases/ words of interest, used to map the
query to something that the machine understands.
Customer Support - How
Security Shell: Oauth

Web application

Text Parser

User keys in query in a simple
contact us page. It is first sent to
parser if low score response is
received same is discarded for a
pre-decided “we will get back to
you response”

Dictionar
y

1

Web application
Standard Spring based web
application

2

Security Shell
Oauth provider shell to help with
REST based security

3

Text Paser
Stanford NLP Parser:
http://nlp.stanford.edu/software/l
ex-parser.shtml and the core-NLP
package

4

Notes:
Dictionary maintencance, finding
nouns/ subjects are all part of
standard documentation/ tutorials.
The tool also supports languages
other than English.
Learning's and Possible Uses
Learning’s:
1. Dictionary is a very critical element, a well defined dictionary will
help identify subjects more easily with right scores.
2. Quality of data if second key element, spelling mistakes,
ambiguous sentences and emotions of the writer all play
different roles. A quick example is Porch/ Porsche it is just an e
but it changes a lot.

Uses:
Other than customer support a parser like this can also be used in
sentiment analysis or text analysis.
Experiment 2: Scoring public data
Scenarios
All of us generate tons of public data and
businesses can use it for profiling us both as
exisiting and prospective customers. A better
profiled customer is better served and can
lead to a longer term relationship.

LinkedIn

Facebook

PUBLIC DATA
Challenges:
1. Privacy – The user has to authorize access to
such data
2. Authenticity – people may have fake accounts
3. Volume – The sheer volume of such data may
make it difficult to analyze it in a given time.

Twitter

Blogs

Employment Verification
Type of connections
Recommendations

Personal nature
Interests

Following and followed by
Tweet sentiment/ text analysis
Location data

Text analysis
Knowledge
Social/ Public Scores
The experiment is simple, which is to score an individual from
LinkedIn and Twitter data which is further used in employability
checks.
There can be numerous other use cases for this service, the key
elements are:
1.
2.
3.

Social Networks – access to an account/ user’s personal data
A learning database that allows the machine to create good/ bad/
neutral clusters of from existing data
Choosing the right algorithm to identify the cluster

Data:
• LinkedIn: Experience, connections, degrees used for scoring
• Twitter: tweets, followers etc. used for personal scoring
Customer Support - How
Web Application
Dictionary
Twitter Score
Engine

Twitter Parser

Final Score
aggregator

Spring Social
LinkedIn Score
Engine

LinkedIn Parser

Training
data set

1

Spring Social
A standard module from
Spring helps us to get data
from social networks to
Java applications very
easily.

2

Parsers
Once data is in, we can write
some parsers/ formatters to
cleanse data or move it into
application defined standard
structures.

3

Twitter Score Engine
This is nothing but an extension of
textual analysis tool with the
dictionary defining words that bring
out substance abuse/ gambling and
other socially unacceptable
characteristics

4

LinkedIn Score Engine
The machine was pre-trained
on some sample data using
standard dimensions
provided by LinkedIn. We
used Encog and Weka .

5

Algorithm
We experimented with some
basic machine learning
algorithms including
Bayesian, K-Means also tried
with fuzzy K-means
Learning's and Possible Uses
Learning’s:
1.
2.

3.
4.

Privacy laws across countries do not allow access to such data but
companies are circumventing this by launching mobile apps which
have access to everything on your smart phones.
To make a machine take sane decisions it is critical to have the right
training data this data becomes all the more critical for qualitative
attributes.
If you do not have a data scientist/ statistician then you can play with
different algorithms. Genetic and neural algorithms may sound cool
but they may not give desired results.
Weka is a good tool to visualize the execution and also a tool which
can be used to select the right algorithm.

Uses:
This is a very generic public data profiling application it can have uses in
banks, HR departments and many other places.
Experiment 3: Discovering cross sell opportunities
Scenarios
This is the most complicated of the three scenarios.
Large corporations have hundereds of different
products, millions of customers and thousands of
salesmen across geographies. What is it that an
existing customer will buy next especially in an
enterprise product environment.

INCLINATION
CONNECTIONS
COMMON FRIENDS
DECISION AUTHORITY

PERSONAL
GOALS

CURRENT ESCALATIONS
LAST CHANGE REQUEST
SERVICE HISTORY

CUSTOMER SUPPORT

CUSTOMER CONTACT

LICENSES

”Say a sales person is visiting a customer and he/ she
quickly wants to see what can be sold to this
customer.”

MARKETS & REGIONS

PRODUCTS

MARKET/ REGION

INSTALLATIONS

Challenges:
1. Aggregation – data is being aggregated from
public and private data storres
2. Time – the opportunity presentment window is
very short and lot of data has to be crunched.
3. Availability – Anytime that any service is down

FEATURES

DATA ON
CUSTOMERS IN
THIS MARKET/
REGION

LOCATION
PRICE
WHERE?

CHEAP?

LUXARY?
AVERAGE?

MARKET & REGION
DATA RELATED TO
THE MARKET
MATURITY, STATE
ETC.
Cross Selling
This is not a simple experiment, it is aggregation of multiple public and
private data sources.
The key elements being:
1.
2.
3.

Speed of decision/ suggestion
Availability and access to multiple API based services (paid/ free)
Availability of enough data for the machine to have built up knowledge to take
correct decision

Data:
• LinkedIn: Common connects
• Twitter: tweets, followers etc. used for personal profiling
• Jigsaw: Company data
• Yahoo Finance API: Market information
• Customer Support: Analysis of tickets
Cross Sell - How
Yahoo
Connector &
Formatter

Web Application
Dictionary
Twitter Score
Engine

Customer
support Data

Final List of
suggestions

Spring Social
LinkedIn Score
Engine

Jigsaw
Connector &
formatter

1

Twitter Parser

Previous Modules
Refer to previous slides for
a description of repeated
modules.

2

Yahoo Connector

3
Fetches data from Yahoo finance
API and formats some
structured/ unstructured data
into more structured data which
can be analyzed

LinkedIn Parser

Training
data set

Jigsaw Connector
Fetches Jigsaw company information
over API calls. Note now this API looks
to have moved to data.com

4

Final Suggestions
Basically a quick aggregator
of data with inbuilt custom
logic for scoring and location
analysis that is once we have
final list of contacts we
overlay salesrep location.

5

Algorithms
Text: combination of noun &
knowledge extraction from
free text using SOLR & NLP
Jigsaw: Company match to
indicate closeness to selected
customer.
Learning's and Possible Uses
Learning’s:
1.
2.
3.
4.

Data Quality: Leaving aside the complexity of integration and multiple
data sources. The quality of data and its importance in decision
making, especially in the enterprise world was the critical learning.
In most of real world complicated scenario, there is no one solution
which will fit.
Agile: breaking the problem into several smaller problems made life
more simple.
Human judgment: Whatever the machine may show to the sales rep in
case he/ she ignores and decides to cross sell something else that
has to go back to the machine as learning else the intelligence will
slowly die away.

Uses:
Multiple, leave it to the imagination of the reader.
Learning’s
Big Picture – Data Quality
Enterprise/ B2B World

Public/ B2C World

Data entry is a cost center and also corner stone for
enterprise applications. The data that we use for
machines to learn has been mostly captured by
humans over the past years. Data entry is not the
most rewarding career and people tend to make
mistakes like wrong address, figures, names are very
common. Focus on quality of data entry will reduce
the speed which means reduced volumes.

Imagine amazon, when you buy a book what data
does it capture about you: clicks, geo-ip, browser,
products viewed/ liked/ bought/ searched/ etc.
Some data from cookies and your past searches,
your profile. To place the order most of us will give
the right address and phone with payment
information. As you notice lot of data is machine
generated which makes analysis more accurate.

Conclusion
•Curing data is possible but it is important to balance quality, quantity and cost of data entry by designing
applications which strike the right balance in these.
•Master data management, data quality programs and data curing all are costly affairs if done late in the
enterprise
•The aggregation of public and private data sets is a reality in today’s world and ”identity” that is identifying an
individual across these data sets is also a real challenge.
Big Picture – Others
Machine Learning

Big Data

Integrations

How much and what is required to
solve problem at hand. Reuse what
is already done and application of
same on business problem is good.

Is not same as data analysis, it can
speed up the analysis and may/
may not be applicable to your
problem

Is the way to go in future, all these
mountains of data will soon
integrate

Agility

Data

Data Scientists

Hit smaller chunks of doable workitems and slowly take down the
larger beast.

Data & Data quality are
tremendously important a few
hundred bad apples can spoil lot
more.

Is an important position in the
overall picture, complicated
scoring/ analysis requires
specialized skills.
Tools & References
Tools & References
› Tools:
– The normal Spring JEE stack with many spring modules has been
used to develop these applications
– Eclipse used as source code editor
– The other tools like Stanford NLP, Encog and Weka are listed with
links on individual slides

› References:
– There are good courses on Coursera
– The Stanford, Weka and Encog websites also have lot of reading
material
– Presentation template & graphics provided by Microsoft
Thank You

More Related Content

What's hot

IRJET - Chat-Bot for College Information System using AI
IRJET -  	  Chat-Bot for College Information System using AIIRJET -  	  Chat-Bot for College Information System using AI
IRJET - Chat-Bot for College Information System using AIIRJET Journal
 
Sentiment Analysis in R
Sentiment Analysis in RSentiment Analysis in R
Sentiment Analysis in REdureka!
 
IRJET- Conversational Assistant based on Sentiment Analysis
IRJET- Conversational Assistant based on Sentiment AnalysisIRJET- Conversational Assistant based on Sentiment Analysis
IRJET- Conversational Assistant based on Sentiment AnalysisIRJET Journal
 
NLP and its applications
NLP and its applicationsNLP and its applications
NLP and its applicationsUtphala P
 
Opinion Mining
Opinion MiningOpinion Mining
Opinion MiningShital Kat
 
Sentiment analyzer and opinion mining
Sentiment analyzer and opinion miningSentiment analyzer and opinion mining
Sentiment analyzer and opinion miningAnkush Mehta
 
How Sentiment Analysis works
How Sentiment Analysis worksHow Sentiment Analysis works
How Sentiment Analysis worksCJ Jenkins
 
Opinion Mining Tutorial (Sentiment Analysis)
Opinion Mining Tutorial (Sentiment Analysis)Opinion Mining Tutorial (Sentiment Analysis)
Opinion Mining Tutorial (Sentiment Analysis)Kavita Ganesan
 
Chat-Bot for College Management System using A.I
Chat-Bot for College Management System using A.IChat-Bot for College Management System using A.I
Chat-Bot for College Management System using A.IIRJET Journal
 
Student information chatbot final report
Student information chatbot  final report Student information chatbot  final report
Student information chatbot final report jaysavani5
 
E-commerce Product Rating
E-commerce Product RatingE-commerce Product Rating
E-commerce Product RatingRanky Disuja
 
Interactive speech based games for autistic children with asperger syndrome
Interactive speech based games for autistic children with asperger syndromeInteractive speech based games for autistic children with asperger syndrome
Interactive speech based games for autistic children with asperger syndromeAmal Abduallah
 
Sentiment of Sentence in Tweets: A Review
Sentiment of Sentence in Tweets: A ReviewSentiment of Sentence in Tweets: A Review
Sentiment of Sentence in Tweets: A Reviewiosrjce
 
Detecting the presence of cyberbullying using computer software
Detecting the presence of cyberbullying using computer softwareDetecting the presence of cyberbullying using computer software
Detecting the presence of cyberbullying using computer softwareAshish Arora
 
Python report on twitter sentiment analysis
Python report on twitter sentiment analysisPython report on twitter sentiment analysis
Python report on twitter sentiment analysisAntaraBhattacharya12
 
Tweet sentiment analysis
Tweet sentiment analysisTweet sentiment analysis
Tweet sentiment analysisAnil Shrestha
 
Detection of cyber-bullying
Detection of cyber-bullying Detection of cyber-bullying
Detection of cyber-bullying Ziar Khan
 
INFORMATION RETRIEVAL FROM TEXT
INFORMATION RETRIEVAL FROM TEXTINFORMATION RETRIEVAL FROM TEXT
INFORMATION RETRIEVAL FROM TEXTijcseit
 

What's hot (20)

IRJET - Chat-Bot for College Information System using AI
IRJET -  	  Chat-Bot for College Information System using AIIRJET -  	  Chat-Bot for College Information System using AI
IRJET - Chat-Bot for College Information System using AI
 
Amazon seniment
Amazon senimentAmazon seniment
Amazon seniment
 
Sentiment Analysis in R
Sentiment Analysis in RSentiment Analysis in R
Sentiment Analysis in R
 
P1803018289
P1803018289P1803018289
P1803018289
 
IRJET- Conversational Assistant based on Sentiment Analysis
IRJET- Conversational Assistant based on Sentiment AnalysisIRJET- Conversational Assistant based on Sentiment Analysis
IRJET- Conversational Assistant based on Sentiment Analysis
 
NLP and its applications
NLP and its applicationsNLP and its applications
NLP and its applications
 
Opinion Mining
Opinion MiningOpinion Mining
Opinion Mining
 
Sentiment analyzer and opinion mining
Sentiment analyzer and opinion miningSentiment analyzer and opinion mining
Sentiment analyzer and opinion mining
 
How Sentiment Analysis works
How Sentiment Analysis worksHow Sentiment Analysis works
How Sentiment Analysis works
 
Opinion Mining Tutorial (Sentiment Analysis)
Opinion Mining Tutorial (Sentiment Analysis)Opinion Mining Tutorial (Sentiment Analysis)
Opinion Mining Tutorial (Sentiment Analysis)
 
Chat-Bot for College Management System using A.I
Chat-Bot for College Management System using A.IChat-Bot for College Management System using A.I
Chat-Bot for College Management System using A.I
 
Student information chatbot final report
Student information chatbot  final report Student information chatbot  final report
Student information chatbot final report
 
E-commerce Product Rating
E-commerce Product RatingE-commerce Product Rating
E-commerce Product Rating
 
Interactive speech based games for autistic children with asperger syndrome
Interactive speech based games for autistic children with asperger syndromeInteractive speech based games for autistic children with asperger syndrome
Interactive speech based games for autistic children with asperger syndrome
 
Sentiment of Sentence in Tweets: A Review
Sentiment of Sentence in Tweets: A ReviewSentiment of Sentence in Tweets: A Review
Sentiment of Sentence in Tweets: A Review
 
Detecting the presence of cyberbullying using computer software
Detecting the presence of cyberbullying using computer softwareDetecting the presence of cyberbullying using computer software
Detecting the presence of cyberbullying using computer software
 
Python report on twitter sentiment analysis
Python report on twitter sentiment analysisPython report on twitter sentiment analysis
Python report on twitter sentiment analysis
 
Tweet sentiment analysis
Tweet sentiment analysisTweet sentiment analysis
Tweet sentiment analysis
 
Detection of cyber-bullying
Detection of cyber-bullying Detection of cyber-bullying
Detection of cyber-bullying
 
INFORMATION RETRIEVAL FROM TEXT
INFORMATION RETRIEVAL FROM TEXTINFORMATION RETRIEVAL FROM TEXT
INFORMATION RETRIEVAL FROM TEXT
 

Similar to Data Science - Experiments

Empowerment Technology: Quarter 1: Module 3
Empowerment Technology: Quarter 1: Module 3Empowerment Technology: Quarter 1: Module 3
Empowerment Technology: Quarter 1: Module 3LIEZLLUMAPAC1
 
IRJET - Twitter Sentiment Analysis using Machine Learning
IRJET -  	  Twitter Sentiment Analysis using Machine LearningIRJET -  	  Twitter Sentiment Analysis using Machine Learning
IRJET - Twitter Sentiment Analysis using Machine LearningIRJET Journal
 
Product Analyst Advisor
Product Analyst AdvisorProduct Analyst Advisor
Product Analyst AdvisorIRJET Journal
 
data science and business analytics
data science and business analyticsdata science and business analytics
data science and business analyticssunnypatil1778
 
Empowerment Technologies - Module 3
Empowerment Technologies - Module 3Empowerment Technologies - Module 3
Empowerment Technologies - Module 3Jesus Rances
 
TOWARDS AUTOMATIC DETECTION OF SENTIMENTS IN CUSTOMER REVIEWS
TOWARDS AUTOMATIC DETECTION OF SENTIMENTS IN CUSTOMER REVIEWSTOWARDS AUTOMATIC DETECTION OF SENTIMENTS IN CUSTOMER REVIEWS
TOWARDS AUTOMATIC DETECTION OF SENTIMENTS IN CUSTOMER REVIEWSijistjournal
 
TOWARDS AUTOMATIC DETECTION OF SENTIMENTS IN CUSTOMER REVIEWS
TOWARDS AUTOMATIC DETECTION OF SENTIMENTS IN CUSTOMER REVIEWSTOWARDS AUTOMATIC DETECTION OF SENTIMENTS IN CUSTOMER REVIEWS
TOWARDS AUTOMATIC DETECTION OF SENTIMENTS IN CUSTOMER REVIEWSijistjournal
 
Five things you need to know about your users before you deploy business inte...
Five things you need to know about your users before you deploy business inte...Five things you need to know about your users before you deploy business inte...
Five things you need to know about your users before you deploy business inte...Nuno Fraga Coelho
 
A comparison of Social Media Monitoring Tools. A white paper from FreshMinds ...
A comparison of Social Media Monitoring Tools. A white paper from FreshMinds ...A comparison of Social Media Monitoring Tools. A white paper from FreshMinds ...
A comparison of Social Media Monitoring Tools. A white paper from FreshMinds ...LiveXtension
 
Open domain Question Answering System - Research project in NLP
Open domain  Question Answering System - Research project in NLPOpen domain  Question Answering System - Research project in NLP
Open domain Question Answering System - Research project in NLPGVS Chaitanya
 
A Novel Voice Based Sentimental Analysis Technique to Mine the User Driven Re...
A Novel Voice Based Sentimental Analysis Technique to Mine the User Driven Re...A Novel Voice Based Sentimental Analysis Technique to Mine the User Driven Re...
A Novel Voice Based Sentimental Analysis Technique to Mine the User Driven Re...IRJET Journal
 
Evaluating User Interfaces
Evaluating User InterfacesEvaluating User Interfaces
Evaluating User InterfacesNancy Jain
 
Patton user modeling
Patton user modelingPatton user modeling
Patton user modelingHindu Dharma
 
DataScientist Job : Between Myths and Reality.pdf
DataScientist Job : Between Myths and Reality.pdfDataScientist Job : Between Myths and Reality.pdf
DataScientist Job : Between Myths and Reality.pdfJedha Bootcamp
 
Discover Requirement
Discover RequirementDiscover Requirement
Discover Requirementzeyadtarek13
 
Book recommendation system using opinion mining technique
Book recommendation system using opinion mining techniqueBook recommendation system using opinion mining technique
Book recommendation system using opinion mining techniqueeSAT Journals
 

Similar to Data Science - Experiments (20)

Empowerment Technology: Quarter 1: Module 3
Empowerment Technology: Quarter 1: Module 3Empowerment Technology: Quarter 1: Module 3
Empowerment Technology: Quarter 1: Module 3
 
IRJET - Twitter Sentiment Analysis using Machine Learning
IRJET -  	  Twitter Sentiment Analysis using Machine LearningIRJET -  	  Twitter Sentiment Analysis using Machine Learning
IRJET - Twitter Sentiment Analysis using Machine Learning
 
Product Analyst Advisor
Product Analyst AdvisorProduct Analyst Advisor
Product Analyst Advisor
 
data science and business analytics
data science and business analyticsdata science and business analytics
data science and business analytics
 
Empowerment Technologies - Module 3
Empowerment Technologies - Module 3Empowerment Technologies - Module 3
Empowerment Technologies - Module 3
 
User Stories
User StoriesUser Stories
User Stories
 
User Stories
User StoriesUser Stories
User Stories
 
TOWARDS AUTOMATIC DETECTION OF SENTIMENTS IN CUSTOMER REVIEWS
TOWARDS AUTOMATIC DETECTION OF SENTIMENTS IN CUSTOMER REVIEWSTOWARDS AUTOMATIC DETECTION OF SENTIMENTS IN CUSTOMER REVIEWS
TOWARDS AUTOMATIC DETECTION OF SENTIMENTS IN CUSTOMER REVIEWS
 
TOWARDS AUTOMATIC DETECTION OF SENTIMENTS IN CUSTOMER REVIEWS
TOWARDS AUTOMATIC DETECTION OF SENTIMENTS IN CUSTOMER REVIEWSTOWARDS AUTOMATIC DETECTION OF SENTIMENTS IN CUSTOMER REVIEWS
TOWARDS AUTOMATIC DETECTION OF SENTIMENTS IN CUSTOMER REVIEWS
 
Five things you need to know about your users before you deploy business inte...
Five things you need to know about your users before you deploy business inte...Five things you need to know about your users before you deploy business inte...
Five things you need to know about your users before you deploy business inte...
 
A comparison of Social Media Monitoring Tools. A white paper from FreshMinds ...
A comparison of Social Media Monitoring Tools. A white paper from FreshMinds ...A comparison of Social Media Monitoring Tools. A white paper from FreshMinds ...
A comparison of Social Media Monitoring Tools. A white paper from FreshMinds ...
 
Open domain Question Answering System - Research project in NLP
Open domain  Question Answering System - Research project in NLPOpen domain  Question Answering System - Research project in NLP
Open domain Question Answering System - Research project in NLP
 
A Novel Voice Based Sentimental Analysis Technique to Mine the User Driven Re...
A Novel Voice Based Sentimental Analysis Technique to Mine the User Driven Re...A Novel Voice Based Sentimental Analysis Technique to Mine the User Driven Re...
A Novel Voice Based Sentimental Analysis Technique to Mine the User Driven Re...
 
Ieee format 5th nccci_a study on factors influencing as a best practice for...
Ieee format 5th nccci_a study on factors influencing as  a  best practice for...Ieee format 5th nccci_a study on factors influencing as  a  best practice for...
Ieee format 5th nccci_a study on factors influencing as a best practice for...
 
Evaluating User Interfaces
Evaluating User InterfacesEvaluating User Interfaces
Evaluating User Interfaces
 
Patton user modeling
Patton user modelingPatton user modeling
Patton user modeling
 
DataScientist Job : Between Myths and Reality.pdf
DataScientist Job : Between Myths and Reality.pdfDataScientist Job : Between Myths and Reality.pdf
DataScientist Job : Between Myths and Reality.pdf
 
Discover Requirement
Discover RequirementDiscover Requirement
Discover Requirement
 
Book recommendation system using opinion mining technique
Book recommendation system using opinion mining techniqueBook recommendation system using opinion mining technique
Book recommendation system using opinion mining technique
 
Sentiment analysis on unstructured review
Sentiment analysis on unstructured reviewSentiment analysis on unstructured review
Sentiment analysis on unstructured review
 

More from Gaurav Marwaha

More from Gaurav Marwaha (8)

Personalizing eCommerce
Personalizing eCommercePersonalizing eCommerce
Personalizing eCommerce
 
Hiring senior people
Hiring senior peopleHiring senior people
Hiring senior people
 
Product management
Product managementProduct management
Product management
 
People in a product team
People in a product teamPeople in a product team
People in a product team
 
Agility in product development
Agility in product developmentAgility in product development
Agility in product development
 
Agiletools
AgiletoolsAgiletools
Agiletools
 
Agile Metrics
Agile MetricsAgile Metrics
Agile Metrics
 
Virtualization 101
Virtualization 101Virtualization 101
Virtualization 101
 

Recently uploaded

Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Hyundai Motor Group
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 

Recently uploaded (20)

Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 

Data Science - Experiments

  • 2. About Me ~15 years | ~12 products | Various roles Name Gaurav Marwaha Current Associated with Nucleus Software, having complete ownership of new product development for loan origination product for banks/ NBFCs. Driving the technology teams to deliver an internally re-usable product development framework. Past Have successfully lead & contributed to multiple product teams in different domains (GIS/ Health/ e-Governance) Technology Java/ big-data/ analytics/ Spring/ ESB (Camel)/ Mobile/ Social Product Conceptualization, Design, Development, Maintenance, EOL, Strategy & roadmap Soft Team building, coaching, mentoring LinkedIn http://in.linkedin.com/in/gauravmarwaha/
  • 3. Table of Contents › Introduction › Assumptions › Experiment 1: Inferring written text › Experiment 2: Scoring public data › Experiment 3: Discovering cross sell opportunities › Learning’s › Tools & References
  • 5. Introduction We generated more digital data in the year 2013 than we have ever before. Everyone wants to know more about me right from my bank to places I shop. From Google to the mall store owner. Everyone wants to know what I want before I myself know that I want it. Quants have tried to predict stock movement based on history of trades for years now. Businesses can leverage the abundantly available data from smart phones, desktops etc to make critical CAPEX / marketing decisions. Knowing how to derive value out of this data is more important today than ever.
  • 6. Assumptions This short presentation will only focus on problems which I worked on; it will avoid theoretical aspects of data science. › Assuming viewers of this have read about: – – – – Language processing: Stemming, tokenization, parts of speech tagging Basics of machine learning clustering/ classification techniques Point clouds and dimensional analysis on data using them Java/ J2EE based web application development › Per my knowledge none of these experiments became part of a commercial product. › I have purposefully kept the presentation focused on learning’s avoiding the nuts & bolts to keep it short.
  • 7. Experiment 1: Inferring written text
  • 8. Scenarios Text analysis refers to inferring valuable knowledge from a given piece of text which may help in further action/ decisions. Customer Support Text Mining TEXT ANALYSIS Challenges: 1. Slang – we use a lot of phrases which deviate from the defined grammar of a language. 2. Ambiguity – there is lot of ambiguity in some sentences where the speaker may be throwing a pun or a sarcastic remark 3. Language – English and other Hipsanic languages are not the only ones spoken some users may mix languages. Like English + Spanish/ Mexican etc. Auto respond bots for text Auto respond IVR bots Auto email responses to email queries Legal text Medical records Social Analysis Facebook page analyzer Twitter stream analysis Other sentiment analysis Computer Games AI games Betting games EXAMPLE TEXT Decision Support Millitary use Email analysis
  • 9. Customer Support I will limit the discussion to this topic where a user is writing in to the customer support during off hours and instead of a standard reply the query first goes to a bot which tries to answer it. There can be numerous other use cases for this service, the key elements are: 1. The calling application – this is the consumer of the service which passes the user query 2. Text parser – this is the engine which receives and parses text 3. Dictionary – a list of phrases/ words of interest, used to map the query to something that the machine understands.
  • 10. Customer Support - How Security Shell: Oauth Web application Text Parser User keys in query in a simple contact us page. It is first sent to parser if low score response is received same is discarded for a pre-decided “we will get back to you response” Dictionar y 1 Web application Standard Spring based web application 2 Security Shell Oauth provider shell to help with REST based security 3 Text Paser Stanford NLP Parser: http://nlp.stanford.edu/software/l ex-parser.shtml and the core-NLP package 4 Notes: Dictionary maintencance, finding nouns/ subjects are all part of standard documentation/ tutorials. The tool also supports languages other than English.
  • 11. Learning's and Possible Uses Learning’s: 1. Dictionary is a very critical element, a well defined dictionary will help identify subjects more easily with right scores. 2. Quality of data if second key element, spelling mistakes, ambiguous sentences and emotions of the writer all play different roles. A quick example is Porch/ Porsche it is just an e but it changes a lot. Uses: Other than customer support a parser like this can also be used in sentiment analysis or text analysis.
  • 12. Experiment 2: Scoring public data
  • 13. Scenarios All of us generate tons of public data and businesses can use it for profiling us both as exisiting and prospective customers. A better profiled customer is better served and can lead to a longer term relationship. LinkedIn Facebook PUBLIC DATA Challenges: 1. Privacy – The user has to authorize access to such data 2. Authenticity – people may have fake accounts 3. Volume – The sheer volume of such data may make it difficult to analyze it in a given time. Twitter Blogs Employment Verification Type of connections Recommendations Personal nature Interests Following and followed by Tweet sentiment/ text analysis Location data Text analysis Knowledge
  • 14. Social/ Public Scores The experiment is simple, which is to score an individual from LinkedIn and Twitter data which is further used in employability checks. There can be numerous other use cases for this service, the key elements are: 1. 2. 3. Social Networks – access to an account/ user’s personal data A learning database that allows the machine to create good/ bad/ neutral clusters of from existing data Choosing the right algorithm to identify the cluster Data: • LinkedIn: Experience, connections, degrees used for scoring • Twitter: tweets, followers etc. used for personal scoring
  • 15. Customer Support - How Web Application Dictionary Twitter Score Engine Twitter Parser Final Score aggregator Spring Social LinkedIn Score Engine LinkedIn Parser Training data set 1 Spring Social A standard module from Spring helps us to get data from social networks to Java applications very easily. 2 Parsers Once data is in, we can write some parsers/ formatters to cleanse data or move it into application defined standard structures. 3 Twitter Score Engine This is nothing but an extension of textual analysis tool with the dictionary defining words that bring out substance abuse/ gambling and other socially unacceptable characteristics 4 LinkedIn Score Engine The machine was pre-trained on some sample data using standard dimensions provided by LinkedIn. We used Encog and Weka . 5 Algorithm We experimented with some basic machine learning algorithms including Bayesian, K-Means also tried with fuzzy K-means
  • 16. Learning's and Possible Uses Learning’s: 1. 2. 3. 4. Privacy laws across countries do not allow access to such data but companies are circumventing this by launching mobile apps which have access to everything on your smart phones. To make a machine take sane decisions it is critical to have the right training data this data becomes all the more critical for qualitative attributes. If you do not have a data scientist/ statistician then you can play with different algorithms. Genetic and neural algorithms may sound cool but they may not give desired results. Weka is a good tool to visualize the execution and also a tool which can be used to select the right algorithm. Uses: This is a very generic public data profiling application it can have uses in banks, HR departments and many other places.
  • 17. Experiment 3: Discovering cross sell opportunities
  • 18. Scenarios This is the most complicated of the three scenarios. Large corporations have hundereds of different products, millions of customers and thousands of salesmen across geographies. What is it that an existing customer will buy next especially in an enterprise product environment. INCLINATION CONNECTIONS COMMON FRIENDS DECISION AUTHORITY PERSONAL GOALS CURRENT ESCALATIONS LAST CHANGE REQUEST SERVICE HISTORY CUSTOMER SUPPORT CUSTOMER CONTACT LICENSES ”Say a sales person is visiting a customer and he/ she quickly wants to see what can be sold to this customer.” MARKETS & REGIONS PRODUCTS MARKET/ REGION INSTALLATIONS Challenges: 1. Aggregation – data is being aggregated from public and private data storres 2. Time – the opportunity presentment window is very short and lot of data has to be crunched. 3. Availability – Anytime that any service is down FEATURES DATA ON CUSTOMERS IN THIS MARKET/ REGION LOCATION PRICE WHERE? CHEAP? LUXARY? AVERAGE? MARKET & REGION DATA RELATED TO THE MARKET MATURITY, STATE ETC.
  • 19. Cross Selling This is not a simple experiment, it is aggregation of multiple public and private data sources. The key elements being: 1. 2. 3. Speed of decision/ suggestion Availability and access to multiple API based services (paid/ free) Availability of enough data for the machine to have built up knowledge to take correct decision Data: • LinkedIn: Common connects • Twitter: tweets, followers etc. used for personal profiling • Jigsaw: Company data • Yahoo Finance API: Market information • Customer Support: Analysis of tickets
  • 20. Cross Sell - How Yahoo Connector & Formatter Web Application Dictionary Twitter Score Engine Customer support Data Final List of suggestions Spring Social LinkedIn Score Engine Jigsaw Connector & formatter 1 Twitter Parser Previous Modules Refer to previous slides for a description of repeated modules. 2 Yahoo Connector 3 Fetches data from Yahoo finance API and formats some structured/ unstructured data into more structured data which can be analyzed LinkedIn Parser Training data set Jigsaw Connector Fetches Jigsaw company information over API calls. Note now this API looks to have moved to data.com 4 Final Suggestions Basically a quick aggregator of data with inbuilt custom logic for scoring and location analysis that is once we have final list of contacts we overlay salesrep location. 5 Algorithms Text: combination of noun & knowledge extraction from free text using SOLR & NLP Jigsaw: Company match to indicate closeness to selected customer.
  • 21. Learning's and Possible Uses Learning’s: 1. 2. 3. 4. Data Quality: Leaving aside the complexity of integration and multiple data sources. The quality of data and its importance in decision making, especially in the enterprise world was the critical learning. In most of real world complicated scenario, there is no one solution which will fit. Agile: breaking the problem into several smaller problems made life more simple. Human judgment: Whatever the machine may show to the sales rep in case he/ she ignores and decides to cross sell something else that has to go back to the machine as learning else the intelligence will slowly die away. Uses: Multiple, leave it to the imagination of the reader.
  • 23. Big Picture – Data Quality Enterprise/ B2B World Public/ B2C World Data entry is a cost center and also corner stone for enterprise applications. The data that we use for machines to learn has been mostly captured by humans over the past years. Data entry is not the most rewarding career and people tend to make mistakes like wrong address, figures, names are very common. Focus on quality of data entry will reduce the speed which means reduced volumes. Imagine amazon, when you buy a book what data does it capture about you: clicks, geo-ip, browser, products viewed/ liked/ bought/ searched/ etc. Some data from cookies and your past searches, your profile. To place the order most of us will give the right address and phone with payment information. As you notice lot of data is machine generated which makes analysis more accurate. Conclusion •Curing data is possible but it is important to balance quality, quantity and cost of data entry by designing applications which strike the right balance in these. •Master data management, data quality programs and data curing all are costly affairs if done late in the enterprise •The aggregation of public and private data sets is a reality in today’s world and ”identity” that is identifying an individual across these data sets is also a real challenge.
  • 24. Big Picture – Others Machine Learning Big Data Integrations How much and what is required to solve problem at hand. Reuse what is already done and application of same on business problem is good. Is not same as data analysis, it can speed up the analysis and may/ may not be applicable to your problem Is the way to go in future, all these mountains of data will soon integrate Agility Data Data Scientists Hit smaller chunks of doable workitems and slowly take down the larger beast. Data & Data quality are tremendously important a few hundred bad apples can spoil lot more. Is an important position in the overall picture, complicated scoring/ analysis requires specialized skills.
  • 26. Tools & References › Tools: – The normal Spring JEE stack with many spring modules has been used to develop these applications – Eclipse used as source code editor – The other tools like Stanford NLP, Encog and Weka are listed with links on individual slides › References: – There are good courses on Coursera – The Stanford, Weka and Encog websites also have lot of reading material – Presentation template & graphics provided by Microsoft