A talk delivered at
EXTRACT. LOAD. VISUALIZE
Becoming a data ninja
• WEB CRAWLING – THE DESIGN
• DIY – WEB CRAWLING
• TEXT ANALYTICS– THE DESIGN
• CASE STUDY
• DIY – TEXT ANALYTICS
• SOME UNSOLICITED ADVICE? – BUILDING YOUR FIRM
WE WILL TALK OF….
Who are we ?
We are a data science company, founded in 2009– with special interest in making the
world an intelligent place to live in.
We identify data and bring it to light, making it visible, cohesive, comparable and easy to
understand so that it really does support YOU in making the right decisions.
Who am I ?
I am a Practice Lead at JSM for Natural Language Processing & Machine Learning. I have
architected multiple solutions in the area of text analytics for multiple industries like finance,
healthcare, food & beverages & hospitality.
AREAS WE WORK ON
Sales Pitch Analysis
Predictive + IoT
Scoping and Product
NLP, ML, Text
• HTML pages are like textbooks – content, titles, subtitles, paragraphs and so on
A crawler is a program that visits Web sites and reads their pages and other information in order to create
entries for a search engine index.
The major search engines on the Web all have such a program, which is also known as a "spider" or a "bot."
OVERVIEW– WEB CRAWLING
• BEST of the lot
• Gives great flexibility – just click and extract
• Most of the sites are compatible
• Easy CSV/Google Docs Export
• Provides APIs for regular data updates
• Low training time
You want to monitor feedback
You want to stay ahead of competition
You want to create pricing strategy
• Works where Import.io fails
• Bit buggy, but does a god job of providing flexible choices of data extraction
• Most of the sites are compatible
• Easy CSV Export
• NO APIs for regular data updates
• Moderate learning curve
• Don’t scrape too fast or you will get banned
• Respect robots.txt
• Extract only what you need
• Don’t overload their servers
• Don’t take data what’s not yours – only the data in public domain
As a brand owner with significant investments in social media, the
usual questions you might have in mind…
• Is the brand exuding same attributes I intended it to be?
• Is my internet presence helping me ?
• Can I measure my ROI for the money I spent?
• What are the measurable metrics for effective social media
• When can I exploit emerging trends for my brand?
• How can I understand my customers better?
WHAT WOULD YOU WANT?
WHAT WOULD YOU WANT TO TARGET?
Users talk about
share cat videos
and engage in
Users engage with
the brand to air
For a brand, say X, there would be thousands of conversations on blogs,
websites and social media revolving around X.
Several hundred blog posts are published that contain the keywords “X” This
company needs to know:
- How many of these posts are relevant and actually expressing opinions
- How many relevant posts are negative, and how many are positive?
- What particular aspects and features of X are being praised or criticized?
- For all of the above, what is the trend for the past few time periods?
- How is my brand perceived among people demographics?
- How is my brand faring against my competitors?
QUESTIONS TO ANSWER
You build an algorithm, machine learns patterns, machine predicts, rinse & repeat.
Analyzing unstructured text, assign structure, load into a BI/program to visualize
HOW WE HELPED A RESTAURANT SERVE THEIR CUSTOMERS BETTER
The client had thousands of customer reviews which they wanted to analyse - to
understand customer feedback and identify improvement opportunities.
The broad questions we focused on;
What did they say about the
Keywords & topics of discussion across
What elements of the restaurant would
they want improved? – service, staff
behaviour, ambience etc.
When did the customer visit the store?
How is client’s traffic distributed over
Ticket sizes across multiple customer
dimensions – age, gender, ratings,
location, time of visit etc.
Overall customer sentiments & views
PRIMARY FOCUS AREAS SECONDARY FOCUS AREAS
TOPICS KEYWORDS SENTIMENT POINT OF SALES
Initiate ML models ,
NER , parsers &
Initiate detection rules
for topics, keywords,
gender, sentiment and
PRE - PROCESSING PARSING &
Author Value Type Sentiment
Duncan Riley Samsung Galaxy S6 Entity Positive
Duncan Riley Apple Entity Negative
Duncan Riley LoopPay Entity Neutral
Duncan Riley mobile payments Keyword Positive
Duncan Riley point of sales Keyword Positive
Structuring data from free flowing text is easy to use by existing reporting and business intelligence software.
Insights from the final reports can now be used for decision-making by the PR firm and their client
Actual blog post parsed through our
• Purse strings WILL be tightened
• Fewer ‘unicorn-level’ valuations
• No investment without revenue or clients
• MVP with traction – a must have
• Acquire or be acquired – or be a acquisition target
• Needs > Wants – solve a problem, but scope it out
•Bootstrapped startups will win, not the valuation-hungry
Don't sell what you do - disguise it
Ex. Create marketing strategy, not analytics/Tableau
A product that promotes laziness or transfers laziness from
seller to buyer will sell
Is it something that Google/FB provides or can do? Even if it's
partial, danger is real.
The product should be IP-able, scalable and monetize-able -
atleast 2/3 must be met