SlideShare a Scribd company logo
A talk delivered at
IIM Lucknow
24/01/2016
EXTRACT. LOAD. VISUALIZE
Becoming a data ninja
• INTRODUCTION
• WEB CRAWLING – THE DESIGN
• DIY – WEB CRAWLING
• TEXT ANALYTICS– THE DESIGN
• CASE STUDY
• DIY – TEXT ANALYTICS
• SOME UNSOLICITED ADVICE? – BUILDING YOUR FIRM
WE WILL TALK OF….
Who are we ?
We are a data science company, founded in 2009– with special interest in making the
world an intelligent place to live in.
We identify data and bring it to light, making it visible, cohesive, comparable and easy to
understand so that it really does support YOU in making the right decisions.
Who am I ?
I am a Practice Lead at JSM for Natural Language Processing & Machine Learning. I have
architected multiple solutions in the area of text analytics for multiple industries like finance,
healthcare, food & beverages & hospitality.
AREAS WE WORK ON
PHARMA
Sales Pitch Analysis
RETAIL
Predictive + IoT
FINANCE
Competitive
Intelligence
F&B
Customer Insights
MR
Scoping and Product
Evaluation
SaaS
NLP, ML, Text
WHAT DO CLIENTS WANT
TOOLS WE PLAY WITH
OPEN SOURCE
Inexpensive
DATABASES
Fast & Scalable
INSIGHTS
Python, R
TECHNIQUES
Latest yet tested
VISUALIZATIONS
D3, GCharts, Tableau
MANAGEMENT
Basecamp
Low Cost Data Collection
+
Comprehensive Analytics
PART 1
Scraping data from the web
HTML PAGES
• HTML pages are like textbooks – content, titles, subtitles, paragraphs and so on
•Javascript adds interactivity to the HTML pages
HTML PAGES
A crawler is a program that visits Web sites and reads their pages and other information in order to create
entries for a search engine index.
The major search engines on the Web all have such a program, which is also known as a "spider" or a "bot."
OVERVIEW– WEB CRAWLING
BUT, YOU ARE MANAGERS.
DO YOU NEED THIS ?
YES !
You don’t need to write a code, promise.*
* T&Cs apply
Tool 1
Import.io
• BEST of the lot
• Gives great flexibility – just click and extract
• Most of the sites are compatible
• Easy CSV/Google Docs Export
• Provides APIs for regular data updates
• Low training time
INTRODUCTION
USE CASES
You want to monitor feedback
http://www.consumercomplaints.in/snapdeal-com-b100038
USE CASES
You want to stay ahead of competition
http://cashkaro.com/
USE CASES
You want to create pricing strategy
http://www.shopclues.com/mobiles/unboxed-mobiles.html
Tool 2
webscraper.io
• Works where Import.io fails
• Bit buggy, but does a god job of providing flexible choices of data extraction
• Most of the sites are compatible
• Easy CSV Export
• NO APIs for regular data updates
• Moderate learning curve
INTRODUCTION
LET’S GET OUR HANDS DIRTY
• Don’t scrape too fast or you will get banned
• Respect robots.txt
• Extract only what you need
• Don’t overload their servers
• Don’t take data what’s not yours – only the data in public domain
PRECAUTIONS
TEXT ANALYTICS
Mine gold from mountain heaps
As a brand owner with significant investments in social media, the
usual questions you might have in mind…
• Is the brand exuding same attributes I intended it to be?
• Is my internet presence helping me ?
• Can I measure my ROI for the money I spent?
• What are the measurable metrics for effective social media
management?
• When can I exploit emerging trends for my brand?
• How can I understand my customers better?
26
WHAT WOULD YOU WANT?
27
WHAT WOULD YOU WANT TO TARGET?
TRANSACTIONAL
CONVERSATIONS
Users talk about
current, events,
share cat videos
and engage in
trivial gossip
INFORMATIONAL
CONVERSTIONS
Users engage with
the brand to air
appreciation or
complaints.
28
DATA SOURCES
Social
Media
Comments
Brand
Website
Blogs
Customer
Emails
Customer
Surveys
News
Websites
For a brand, say X, there would be thousands of conversations on blogs,
websites and social media revolving around X.
Several hundred blog posts are published that contain the keywords “X” This
company needs to know:
- How many of these posts are relevant and actually expressing opinions
about X?
- How many relevant posts are negative, and how many are positive?
- What particular aspects and features of X are being praised or criticized?
- For all of the above, what is the trend for the past few time periods?
- How is my brand perceived among people demographics?
- How is my brand faring against my competitors?
29
QUESTIONS TO ANSWER
BRIEF TERMINOLOGY
You build an algorithm, machine learns patterns, machine predicts, rinse & repeat.
MACHINE LEARNING
TEXT ANALYTICS
Analyzing unstructured text, assign structure, load into a BI/program to visualize
CASE STUDY
HOW WE HELPED A RESTAURANT SERVE THEIR CUSTOMERS BETTER
PROBLEM STATEMENT
The client had thousands of customer reviews which they wanted to analyse - to
understand customer feedback and identify improvement opportunities.
The broad questions we focused on;
What did they say about the
restaurant?
Keywords & topics of discussion across
the comments
What elements of the restaurant would
they want improved? – service, staff
behaviour, ambience etc.
When did the customer visit the store?
How is client’s traffic distributed over
time?
Ticket sizes across multiple customer
dimensions – age, gender, ratings,
location, time of visit etc.
Overall customer sentiments & views
about UCH
PRIMARY FOCUS AREAS SECONDARY FOCUS AREAS
FOCUS AREAS
TOPICS KEYWORDS SENTIMENT POINT OF SALES
APPROACH
Extract data
and validate
Corpus from
social media
Tokenise and
remove stop
words
Initiate ML models ,
NER , parsers &
topic algorithms
Initiate detection rules
for topics, keywords,
gender, sentiment and
multi-word concept
detection
Final Output
PRE - PROCESSING PARSING &
ANALYSIS
OUTPUT
Part of
Speech (POS)
Tagger
Author Value Type Sentiment
Duncan Riley Samsung Galaxy S6 Entity Positive
Duncan Riley Apple Entity Negative
Duncan Riley LoopPay Entity Neutral
Duncan Riley mobile payments Keyword Positive
Duncan Riley point of sales Keyword Positive
Structuring data from free flowing text is easy to use by existing reporting and business intelligence software.
Insights from the final reports can now be used for decision-making by the PR firm and their client
Actual blog post parsed through our
SmartText Engine
LUNCHBOX
www.lunchbox.subcortext.com
RATINGS – HOW MUCH IS BETTER ?
Increasing scope of differentiating operational improvements
Decreasing scope of customer loyalty
OUR PLATFORM
7.7 Mn 92.6 K 62.6 K16
reviews restaurants user profilestopics
As on 28th October, 2015
LUNCHBOX - SCREENSHOTS & MOCKUPS
Lunchbox – Search Screen
Lunchbox – ViewMe
Lunchbox – ViewMe
Lunchbox – RankMe
Lunchbox – MarketMe
BUT HEY ! THIS NEEDS ME TO WRITE A CODE.
YOU LIAR !
SCRAPE > VOSVIEWER > GEPHI > TABLEAU
Scrape the web !
Excel
= ISNUMBER(SEARCH($N$1,H2))
VOSViewer
Gephi
Tableau
LET’S GET OUR HANDS DIRTY, AGAIN !
SOME UNSOLICITED ADVICE ?
• Purse strings WILL be tightened
• Fewer ‘unicorn-level’ valuations
• No investment without revenue or clients
• MVP with traction – a must have
• Acquire or be acquired – or be a acquisition target
• Needs > Wants – solve a problem, but scope it out
•Bootstrapped startups will win, not the valuation-hungry
PREDICTIONS
Don't sell what you do - disguise it
Ex. Create marketing strategy, not analytics/Tableau
Nugget 1
A product that promotes laziness or transfers laziness from
seller to buyer will sell
Nugget 2
Is it something that Google/FB provides or can do? Even if it's
partial, danger is real.
Nugget 3
The product should be IP-able, scalable and monetize-able -
atleast 2/3 must be met
Nugget 4
Read Read Read
techcruch, recode, crunchbase, yourstory, nextbigwhat,
vccircle
Nugget 5
QUESTIONS ?
MANAS RANJAN KAR
manas@jsm.email
+91-9971 420 188
www.unlocktext.com
#connect

More Related Content

What's hot

The Performance Content Framework
The Performance Content FrameworkThe Performance Content Framework
The Performance Content Framework
Performics EMEA
 
EIA2018Italy - Emad Saif - Business & Revenue Models
EIA2018Italy - Emad Saif - Business & Revenue ModelsEIA2018Italy - Emad Saif - Business & Revenue Models
EIA2018Italy - Emad Saif - Business & Revenue Models
European Innovation Academy
 
The New Publishing Skillset: Change Management in a Creative Industry - Tech ...
The New Publishing Skillset: Change Management in a Creative Industry - Tech ...The New Publishing Skillset: Change Management in a Creative Industry - Tech ...
The New Publishing Skillset: Change Management in a Creative Industry - Tech ...
BookNet Canada
 
UW Entrefest - 10 Startup Business Models that work and 3 that fail
UW Entrefest - 10 Startup Business Models that work and 3 that fail UW Entrefest - 10 Startup Business Models that work and 3 that fail
UW Entrefest - 10 Startup Business Models that work and 3 that fail
Dave Parker
 
#GetUngagged - Scaling SEO Services
#GetUngagged - Scaling SEO Services#GetUngagged - Scaling SEO Services
#GetUngagged - Scaling SEO Services
Hannah Thorpe
 
An open letter to marketers, AI won’t steal your job
An open letter to marketers, AI won’t steal your jobAn open letter to marketers, AI won’t steal your job
An open letter to marketers, AI won’t steal your job
Darcie Thompson-Fields
 
How to become Data Driven for startups - keboola
How to become Data Driven for startups - keboolaHow to become Data Driven for startups - keboola
How to become Data Driven for startups - keboola
Pavel Dolezal
 
Janna Bastow — The Power of Product Focus (Turing Festival 2016)
Janna Bastow — The Power of Product Focus (Turing Festival 2016)Janna Bastow — The Power of Product Focus (Turing Festival 2016)
Janna Bastow — The Power of Product Focus (Turing Festival 2016)
Turing Fest
 
Personalization Everywhere! Create a Personalization Strategy
Personalization Everywhere! Create a Personalization StrategyPersonalization Everywhere! Create a Personalization Strategy
Personalization Everywhere! Create a Personalization Strategy
Hull
 
Demand quest seo training
Demand quest seo trainingDemand quest seo training
Demand quest seo training
Nate Plaunt
 
Attract and match for icx
Attract and match for icxAttract and match for icx
Attract and match for icx
Ramita Vig
 
RankWatch Brief
RankWatch BriefRankWatch Brief
RankWatch Brief
Amitab Dev
 
How to Scale Sales Development with Two and a Half Reps
How to Scale Sales Development with Two and a Half RepsHow to Scale Sales Development with Two and a Half Reps
How to Scale Sales Development with Two and a Half Reps
Sales Hacker
 
12 Key Levers of SaaS Success
12 Key Levers of SaaS Success12 Key Levers of SaaS Success
12 Key Levers of SaaS Success
saastr
 
The Neuromarketing Toolkit - Chinwag Psych - 4 Feb 2014
The Neuromarketing Toolkit - Chinwag Psych - 4 Feb 2014The Neuromarketing Toolkit - Chinwag Psych - 4 Feb 2014
The Neuromarketing Toolkit - Chinwag Psych - 4 Feb 2014
Craig Sullivan
 
Machine Learning is Much More Than Product Recommendations
Machine Learning is Much More Than Product RecommendationsMachine Learning is Much More Than Product Recommendations
Machine Learning is Much More Than Product Recommendations
Skava
 
UX Conversion Camp: Matalan
UX Conversion Camp: MatalanUX Conversion Camp: Matalan
UX Conversion Camp: Matalan
Lisa Duddington MSc
 
Onboard like a juggernaut - Elite camp 2015
Onboard like a juggernaut - Elite camp 2015Onboard like a juggernaut - Elite camp 2015
Onboard like a juggernaut - Elite camp 2015
Conversionista
 
Building Shopify Store Proposal PowerPoint Presentation Slides
Building Shopify Store Proposal PowerPoint Presentation SlidesBuilding Shopify Store Proposal PowerPoint Presentation Slides
Building Shopify Store Proposal PowerPoint Presentation Slides
SlideTeam
 

What's hot (19)

The Performance Content Framework
The Performance Content FrameworkThe Performance Content Framework
The Performance Content Framework
 
EIA2018Italy - Emad Saif - Business & Revenue Models
EIA2018Italy - Emad Saif - Business & Revenue ModelsEIA2018Italy - Emad Saif - Business & Revenue Models
EIA2018Italy - Emad Saif - Business & Revenue Models
 
The New Publishing Skillset: Change Management in a Creative Industry - Tech ...
The New Publishing Skillset: Change Management in a Creative Industry - Tech ...The New Publishing Skillset: Change Management in a Creative Industry - Tech ...
The New Publishing Skillset: Change Management in a Creative Industry - Tech ...
 
UW Entrefest - 10 Startup Business Models that work and 3 that fail
UW Entrefest - 10 Startup Business Models that work and 3 that fail UW Entrefest - 10 Startup Business Models that work and 3 that fail
UW Entrefest - 10 Startup Business Models that work and 3 that fail
 
#GetUngagged - Scaling SEO Services
#GetUngagged - Scaling SEO Services#GetUngagged - Scaling SEO Services
#GetUngagged - Scaling SEO Services
 
An open letter to marketers, AI won’t steal your job
An open letter to marketers, AI won’t steal your jobAn open letter to marketers, AI won’t steal your job
An open letter to marketers, AI won’t steal your job
 
How to become Data Driven for startups - keboola
How to become Data Driven for startups - keboolaHow to become Data Driven for startups - keboola
How to become Data Driven for startups - keboola
 
Janna Bastow — The Power of Product Focus (Turing Festival 2016)
Janna Bastow — The Power of Product Focus (Turing Festival 2016)Janna Bastow — The Power of Product Focus (Turing Festival 2016)
Janna Bastow — The Power of Product Focus (Turing Festival 2016)
 
Personalization Everywhere! Create a Personalization Strategy
Personalization Everywhere! Create a Personalization StrategyPersonalization Everywhere! Create a Personalization Strategy
Personalization Everywhere! Create a Personalization Strategy
 
Demand quest seo training
Demand quest seo trainingDemand quest seo training
Demand quest seo training
 
Attract and match for icx
Attract and match for icxAttract and match for icx
Attract and match for icx
 
RankWatch Brief
RankWatch BriefRankWatch Brief
RankWatch Brief
 
How to Scale Sales Development with Two and a Half Reps
How to Scale Sales Development with Two and a Half RepsHow to Scale Sales Development with Two and a Half Reps
How to Scale Sales Development with Two and a Half Reps
 
12 Key Levers of SaaS Success
12 Key Levers of SaaS Success12 Key Levers of SaaS Success
12 Key Levers of SaaS Success
 
The Neuromarketing Toolkit - Chinwag Psych - 4 Feb 2014
The Neuromarketing Toolkit - Chinwag Psych - 4 Feb 2014The Neuromarketing Toolkit - Chinwag Psych - 4 Feb 2014
The Neuromarketing Toolkit - Chinwag Psych - 4 Feb 2014
 
Machine Learning is Much More Than Product Recommendations
Machine Learning is Much More Than Product RecommendationsMachine Learning is Much More Than Product Recommendations
Machine Learning is Much More Than Product Recommendations
 
UX Conversion Camp: Matalan
UX Conversion Camp: MatalanUX Conversion Camp: Matalan
UX Conversion Camp: Matalan
 
Onboard like a juggernaut - Elite camp 2015
Onboard like a juggernaut - Elite camp 2015Onboard like a juggernaut - Elite camp 2015
Onboard like a juggernaut - Elite camp 2015
 
Building Shopify Store Proposal PowerPoint Presentation Slides
Building Shopify Store Proposal PowerPoint Presentation SlidesBuilding Shopify Store Proposal PowerPoint Presentation Slides
Building Shopify Store Proposal PowerPoint Presentation Slides
 

Viewers also liked

Adrian ,michael
Adrian ,michaelAdrian ,michael
Adrian ,michael
Adriancho Cevallos
 
Brangaccio Resume2
Brangaccio Resume2Brangaccio Resume2
Brangaccio Resume2
John Brangaccio
 
финансовый учет (урок 1)
финансовый учет (урок 1)финансовый учет (урок 1)
финансовый учет (урок 1)
Aidana Baimurzayeva
 
The Brave Heart - Shubhangi Satpute
The Brave Heart  - Shubhangi SatputeThe Brave Heart  - Shubhangi Satpute
The Brave Heart - Shubhangi Satpute
Amit Sikdar
 
Energy outlook 2009_el
Energy outlook 2009_el Energy outlook 2009_el
Energy outlook 2009_el
Bestman Fdsf
 
Animação de Imagens e Cinema de Animação
Animação de Imagens e Cinema de AnimaçãoAnimação de Imagens e Cinema de Animação
Animação de Imagens e Cinema de Animação
Jose Alberto Rodrigues
 
Rita karter kak_rabotaet_mozg
Rita karter kak_rabotaet_mozgRita karter kak_rabotaet_mozg
Rita karter kak_rabotaet_mozgExvet12
 
PROFILE
PROFILEPROFILE
PROFILE
farooq ahamed
 
μακροοικονομικη θεωρια
μακροοικονομικη θεωριαμακροοικονομικη θεωρια
μακροοικονομικη θεωρια
Bestman Fdsf
 
Portfolio
PortfolioPortfolio
Portfolio
Saikat Das
 
5 признаков, что Ваш email маркетинг уже перешагнул в 2016г. Вебинар WebPromo...
5 признаков, что Ваш email маркетинг уже перешагнул в 2016г. Вебинар WebPromo...5 признаков, что Ваш email маркетинг уже перешагнул в 2016г. Вебинар WebPromo...
5 признаков, что Ваш email маркетинг уже перешагнул в 2016г. Вебинар WebPromo...
Академия интернет-маркетинга «WebPromoExperts»
 
How to fit social media in your mobile strategy
How to fit social media in your mobile strategyHow to fit social media in your mobile strategy
How to fit social media in your mobile strategy
In the Pocket
 
CERLIS 2011: Students doing popular science: Visual communication in an emerg...
CERLIS 2011: Students doing popular science: Visual communication in an emerg...CERLIS 2011: Students doing popular science: Visual communication in an emerg...
CERLIS 2011: Students doing popular science: Visual communication in an emerg...
cahafner
 
Ficha y a ti que te importa
Ficha y a ti que te importaFicha y a ti que te importa
Ficha y a ti que te importa
Esthervampire
 

Viewers also liked (14)

Adrian ,michael
Adrian ,michaelAdrian ,michael
Adrian ,michael
 
Brangaccio Resume2
Brangaccio Resume2Brangaccio Resume2
Brangaccio Resume2
 
финансовый учет (урок 1)
финансовый учет (урок 1)финансовый учет (урок 1)
финансовый учет (урок 1)
 
The Brave Heart - Shubhangi Satpute
The Brave Heart  - Shubhangi SatputeThe Brave Heart  - Shubhangi Satpute
The Brave Heart - Shubhangi Satpute
 
Energy outlook 2009_el
Energy outlook 2009_el Energy outlook 2009_el
Energy outlook 2009_el
 
Animação de Imagens e Cinema de Animação
Animação de Imagens e Cinema de AnimaçãoAnimação de Imagens e Cinema de Animação
Animação de Imagens e Cinema de Animação
 
Rita karter kak_rabotaet_mozg
Rita karter kak_rabotaet_mozgRita karter kak_rabotaet_mozg
Rita karter kak_rabotaet_mozg
 
PROFILE
PROFILEPROFILE
PROFILE
 
μακροοικονομικη θεωρια
μακροοικονομικη θεωριαμακροοικονομικη θεωρια
μακροοικονομικη θεωρια
 
Portfolio
PortfolioPortfolio
Portfolio
 
5 признаков, что Ваш email маркетинг уже перешагнул в 2016г. Вебинар WebPromo...
5 признаков, что Ваш email маркетинг уже перешагнул в 2016г. Вебинар WebPromo...5 признаков, что Ваш email маркетинг уже перешагнул в 2016г. Вебинар WebPromo...
5 признаков, что Ваш email маркетинг уже перешагнул в 2016г. Вебинар WebPromo...
 
How to fit social media in your mobile strategy
How to fit social media in your mobile strategyHow to fit social media in your mobile strategy
How to fit social media in your mobile strategy
 
CERLIS 2011: Students doing popular science: Visual communication in an emerg...
CERLIS 2011: Students doing popular science: Visual communication in an emerg...CERLIS 2011: Students doing popular science: Visual communication in an emerg...
CERLIS 2011: Students doing popular science: Visual communication in an emerg...
 
Ficha y a ti que te importa
Ficha y a ti que te importaFicha y a ti que te importa
Ficha y a ti que te importa
 

Similar to EXTRACT, LOAD, VISUALIZE Become a Data Ninja

Building an Online Presence
Building an Online PresenceBuilding an Online Presence
Building an Online Presence
Renée Nesseth
 
Six Things To Make Analytics Work - Exponea
Six Things To Make Analytics Work - ExponeaSix Things To Make Analytics Work - Exponea
Six Things To Make Analytics Work - Exponea
Nexteria
 
HacktoberFestPune - DSC MESCOE x DSC PVGCOET
HacktoberFestPune - DSC MESCOE x DSC PVGCOETHacktoberFestPune - DSC MESCOE x DSC PVGCOET
HacktoberFestPune - DSC MESCOE x DSC PVGCOET
TanyaRaina3
 
Blitzscaling Session 9: Village Stage
Blitzscaling Session 9: Village StageBlitzscaling Session 9: Village Stage
Blitzscaling Session 9: Village Stage
Greylock Partners
 
Мастер-класс Стартап Академии Сколково 16/02 на Get2Know
Мастер-класс Стартап Академии Сколково 16/02 на Get2KnowМастер-класс Стартап Академии Сколково 16/02 на Get2Know
Мастер-класс Стартап Академии Сколково 16/02 на Get2Know
get-to-know
 
Transitioning to-lean-at-infochimps
Transitioning to-lean-at-infochimpsTransitioning to-lean-at-infochimps
Transitioning to-lean-at-infochimps
Ash Maurya
 
Dashlane Mission Teams
Dashlane Mission TeamsDashlane Mission Teams
Dashlane Mission Teams
Dashlane
 
Scaling SEO by Building Products - Search London Meetup Nov 17
Scaling SEO by Building Products - Search London Meetup Nov 17Scaling SEO by Building Products - Search London Meetup Nov 17
Scaling SEO by Building Products - Search London Meetup Nov 17
Fabrizio Ballarini
 
Your Roadmap, Your Product Story & Datadriven Product Management
Your Roadmap, Your Product Story & Datadriven Product ManagementYour Roadmap, Your Product Story & Datadriven Product Management
Your Roadmap, Your Product Story & Datadriven Product Management
Product School
 
ferret_company_facts_en(30.03.17)
ferret_company_facts_en(30.03.17)ferret_company_facts_en(30.03.17)
ferret_company_facts_en(30.03.17)
ferretslides
 
Conversion Optimization Framework to Build Sustainable and Repeat Growth
Conversion Optimization Framework to Build Sustainable and Repeat GrowthConversion Optimization Framework to Build Sustainable and Repeat Growth
Conversion Optimization Framework to Build Sustainable and Repeat Growth
Tushar Purohit
 
Developing a cult of analytics
Developing a cult of analyticsDeveloping a cult of analytics
Developing a cult of analytics
Steve Jackson
 
Your competition now matters: Building a SaaS Company Isn't What it Used to B...
Your competition now matters: Building a SaaS Company Isn't What it Used to B...Your competition now matters: Building a SaaS Company Isn't What it Used to B...
Your competition now matters: Building a SaaS Company Isn't What it Used to B...
Price Intelligently
 
Lean Startup Tools for Scrum Product Owners
Lean Startup Tools for Scrum Product OwnersLean Startup Tools for Scrum Product Owners
Lean Startup Tools for Scrum Product Owners
TechWell
 
Finding the one growth metric that matters
Finding the one growth metric that mattersFinding the one growth metric that matters
Finding the one growth metric that matters
Sean Ellis
 
Vertical in 60 days
Vertical in 60 daysVertical in 60 days
Vertical in 60 days
marc714376
 
World of Watson 2016: Journey to Cognitive Excellence - Harness the Force of ...
World of Watson 2016: Journey to Cognitive Excellence - Harness the Force of ...World of Watson 2016: Journey to Cognitive Excellence - Harness the Force of ...
World of Watson 2016: Journey to Cognitive Excellence - Harness the Force of ...
Julie Severance
 
Discover Your Searchability Factor by SEO King
Discover Your Searchability Factor by SEO KingDiscover Your Searchability Factor by SEO King
Discover Your Searchability Factor by SEO King
IMSeoKing.com
 
Attribution Reporting in HubSpot
Attribution Reporting in HubSpotAttribution Reporting in HubSpot
Attribution Reporting in HubSpot
Whitehat Inbound Marketing Agency
 
Data Discovery and BI - Is there Really a Difference?
Data Discovery and BI - Is there Really a Difference?Data Discovery and BI - Is there Really a Difference?
Data Discovery and BI - Is there Really a Difference?
Inside Analysis
 

Similar to EXTRACT, LOAD, VISUALIZE Become a Data Ninja (20)

Building an Online Presence
Building an Online PresenceBuilding an Online Presence
Building an Online Presence
 
Six Things To Make Analytics Work - Exponea
Six Things To Make Analytics Work - ExponeaSix Things To Make Analytics Work - Exponea
Six Things To Make Analytics Work - Exponea
 
HacktoberFestPune - DSC MESCOE x DSC PVGCOET
HacktoberFestPune - DSC MESCOE x DSC PVGCOETHacktoberFestPune - DSC MESCOE x DSC PVGCOET
HacktoberFestPune - DSC MESCOE x DSC PVGCOET
 
Blitzscaling Session 9: Village Stage
Blitzscaling Session 9: Village StageBlitzscaling Session 9: Village Stage
Blitzscaling Session 9: Village Stage
 
Мастер-класс Стартап Академии Сколково 16/02 на Get2Know
Мастер-класс Стартап Академии Сколково 16/02 на Get2KnowМастер-класс Стартап Академии Сколково 16/02 на Get2Know
Мастер-класс Стартап Академии Сколково 16/02 на Get2Know
 
Transitioning to-lean-at-infochimps
Transitioning to-lean-at-infochimpsTransitioning to-lean-at-infochimps
Transitioning to-lean-at-infochimps
 
Dashlane Mission Teams
Dashlane Mission TeamsDashlane Mission Teams
Dashlane Mission Teams
 
Scaling SEO by Building Products - Search London Meetup Nov 17
Scaling SEO by Building Products - Search London Meetup Nov 17Scaling SEO by Building Products - Search London Meetup Nov 17
Scaling SEO by Building Products - Search London Meetup Nov 17
 
Your Roadmap, Your Product Story & Datadriven Product Management
Your Roadmap, Your Product Story & Datadriven Product ManagementYour Roadmap, Your Product Story & Datadriven Product Management
Your Roadmap, Your Product Story & Datadriven Product Management
 
ferret_company_facts_en(30.03.17)
ferret_company_facts_en(30.03.17)ferret_company_facts_en(30.03.17)
ferret_company_facts_en(30.03.17)
 
Conversion Optimization Framework to Build Sustainable and Repeat Growth
Conversion Optimization Framework to Build Sustainable and Repeat GrowthConversion Optimization Framework to Build Sustainable and Repeat Growth
Conversion Optimization Framework to Build Sustainable and Repeat Growth
 
Developing a cult of analytics
Developing a cult of analyticsDeveloping a cult of analytics
Developing a cult of analytics
 
Your competition now matters: Building a SaaS Company Isn't What it Used to B...
Your competition now matters: Building a SaaS Company Isn't What it Used to B...Your competition now matters: Building a SaaS Company Isn't What it Used to B...
Your competition now matters: Building a SaaS Company Isn't What it Used to B...
 
Lean Startup Tools for Scrum Product Owners
Lean Startup Tools for Scrum Product OwnersLean Startup Tools for Scrum Product Owners
Lean Startup Tools for Scrum Product Owners
 
Finding the one growth metric that matters
Finding the one growth metric that mattersFinding the one growth metric that matters
Finding the one growth metric that matters
 
Vertical in 60 days
Vertical in 60 daysVertical in 60 days
Vertical in 60 days
 
World of Watson 2016: Journey to Cognitive Excellence - Harness the Force of ...
World of Watson 2016: Journey to Cognitive Excellence - Harness the Force of ...World of Watson 2016: Journey to Cognitive Excellence - Harness the Force of ...
World of Watson 2016: Journey to Cognitive Excellence - Harness the Force of ...
 
Discover Your Searchability Factor by SEO King
Discover Your Searchability Factor by SEO KingDiscover Your Searchability Factor by SEO King
Discover Your Searchability Factor by SEO King
 
Attribution Reporting in HubSpot
Attribution Reporting in HubSpotAttribution Reporting in HubSpot
Attribution Reporting in HubSpot
 
Data Discovery and BI - Is there Really a Difference?
Data Discovery and BI - Is there Really a Difference?Data Discovery and BI - Is there Really a Difference?
Data Discovery and BI - Is there Really a Difference?
 

Recently uploaded

一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
g4dpvqap0
 
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Fernanda Palhano
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
sameer shah
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
nuttdpt
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
Bill641377
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
kuntobimo2016
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
74nqk8xf
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
Timothy Spann
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
soxrziqu
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
Roger Valdez
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
Lars Albertsson
 
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
zsjl4mimo
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
nyfuhyz
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
AndrzejJarynowski
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
Timothy Spann
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
v7oacc3l
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
jerlynmaetalle
 

Recently uploaded (20)

一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
 
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
 
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
 

EXTRACT, LOAD, VISUALIZE Become a Data Ninja

  • 1. A talk delivered at IIM Lucknow 24/01/2016 EXTRACT. LOAD. VISUALIZE Becoming a data ninja
  • 2.
  • 3.
  • 4. • INTRODUCTION • WEB CRAWLING – THE DESIGN • DIY – WEB CRAWLING • TEXT ANALYTICS– THE DESIGN • CASE STUDY • DIY – TEXT ANALYTICS • SOME UNSOLICITED ADVICE? – BUILDING YOUR FIRM WE WILL TALK OF….
  • 5. Who are we ? We are a data science company, founded in 2009– with special interest in making the world an intelligent place to live in. We identify data and bring it to light, making it visible, cohesive, comparable and easy to understand so that it really does support YOU in making the right decisions. Who am I ? I am a Practice Lead at JSM for Natural Language Processing & Machine Learning. I have architected multiple solutions in the area of text analytics for multiple industries like finance, healthcare, food & beverages & hospitality.
  • 6. AREAS WE WORK ON PHARMA Sales Pitch Analysis RETAIL Predictive + IoT FINANCE Competitive Intelligence F&B Customer Insights MR Scoping and Product Evaluation SaaS NLP, ML, Text
  • 8. TOOLS WE PLAY WITH OPEN SOURCE Inexpensive DATABASES Fast & Scalable INSIGHTS Python, R TECHNIQUES Latest yet tested VISUALIZATIONS D3, GCharts, Tableau MANAGEMENT Basecamp
  • 9. Low Cost Data Collection + Comprehensive Analytics
  • 10. PART 1 Scraping data from the web
  • 12. • HTML pages are like textbooks – content, titles, subtitles, paragraphs and so on •Javascript adds interactivity to the HTML pages HTML PAGES
  • 13. A crawler is a program that visits Web sites and reads their pages and other information in order to create entries for a search engine index. The major search engines on the Web all have such a program, which is also known as a "spider" or a "bot." OVERVIEW– WEB CRAWLING
  • 14. BUT, YOU ARE MANAGERS. DO YOU NEED THIS ?
  • 15. YES ! You don’t need to write a code, promise.* * T&Cs apply
  • 17. • BEST of the lot • Gives great flexibility – just click and extract • Most of the sites are compatible • Easy CSV/Google Docs Export • Provides APIs for regular data updates • Low training time INTRODUCTION
  • 18. USE CASES You want to monitor feedback http://www.consumercomplaints.in/snapdeal-com-b100038
  • 19. USE CASES You want to stay ahead of competition http://cashkaro.com/
  • 20. USE CASES You want to create pricing strategy http://www.shopclues.com/mobiles/unboxed-mobiles.html
  • 22. • Works where Import.io fails • Bit buggy, but does a god job of providing flexible choices of data extraction • Most of the sites are compatible • Easy CSV Export • NO APIs for regular data updates • Moderate learning curve INTRODUCTION
  • 23. LET’S GET OUR HANDS DIRTY
  • 24. • Don’t scrape too fast or you will get banned • Respect robots.txt • Extract only what you need • Don’t overload their servers • Don’t take data what’s not yours – only the data in public domain PRECAUTIONS
  • 25. TEXT ANALYTICS Mine gold from mountain heaps
  • 26. As a brand owner with significant investments in social media, the usual questions you might have in mind… • Is the brand exuding same attributes I intended it to be? • Is my internet presence helping me ? • Can I measure my ROI for the money I spent? • What are the measurable metrics for effective social media management? • When can I exploit emerging trends for my brand? • How can I understand my customers better? 26 WHAT WOULD YOU WANT?
  • 27. 27 WHAT WOULD YOU WANT TO TARGET? TRANSACTIONAL CONVERSATIONS Users talk about current, events, share cat videos and engage in trivial gossip INFORMATIONAL CONVERSTIONS Users engage with the brand to air appreciation or complaints.
  • 29. For a brand, say X, there would be thousands of conversations on blogs, websites and social media revolving around X. Several hundred blog posts are published that contain the keywords “X” This company needs to know: - How many of these posts are relevant and actually expressing opinions about X? - How many relevant posts are negative, and how many are positive? - What particular aspects and features of X are being praised or criticized? - For all of the above, what is the trend for the past few time periods? - How is my brand perceived among people demographics? - How is my brand faring against my competitors? 29 QUESTIONS TO ANSWER
  • 30. BRIEF TERMINOLOGY You build an algorithm, machine learns patterns, machine predicts, rinse & repeat. MACHINE LEARNING TEXT ANALYTICS Analyzing unstructured text, assign structure, load into a BI/program to visualize
  • 31. CASE STUDY HOW WE HELPED A RESTAURANT SERVE THEIR CUSTOMERS BETTER
  • 32. PROBLEM STATEMENT The client had thousands of customer reviews which they wanted to analyse - to understand customer feedback and identify improvement opportunities. The broad questions we focused on; What did they say about the restaurant? Keywords & topics of discussion across the comments What elements of the restaurant would they want improved? – service, staff behaviour, ambience etc. When did the customer visit the store? How is client’s traffic distributed over time? Ticket sizes across multiple customer dimensions – age, gender, ratings, location, time of visit etc. Overall customer sentiments & views about UCH PRIMARY FOCUS AREAS SECONDARY FOCUS AREAS
  • 33. FOCUS AREAS TOPICS KEYWORDS SENTIMENT POINT OF SALES
  • 34. APPROACH Extract data and validate Corpus from social media Tokenise and remove stop words Initiate ML models , NER , parsers & topic algorithms Initiate detection rules for topics, keywords, gender, sentiment and multi-word concept detection Final Output PRE - PROCESSING PARSING & ANALYSIS OUTPUT Part of Speech (POS) Tagger
  • 35. Author Value Type Sentiment Duncan Riley Samsung Galaxy S6 Entity Positive Duncan Riley Apple Entity Negative Duncan Riley LoopPay Entity Neutral Duncan Riley mobile payments Keyword Positive Duncan Riley point of sales Keyword Positive Structuring data from free flowing text is easy to use by existing reporting and business intelligence software. Insights from the final reports can now be used for decision-making by the PR firm and their client Actual blog post parsed through our SmartText Engine
  • 37. RATINGS – HOW MUCH IS BETTER ? Increasing scope of differentiating operational improvements Decreasing scope of customer loyalty
  • 38. OUR PLATFORM 7.7 Mn 92.6 K 62.6 K16 reviews restaurants user profilestopics As on 28th October, 2015
  • 45. BUT HEY ! THIS NEEDS ME TO WRITE A CODE. YOU LIAR !
  • 46. SCRAPE > VOSVIEWER > GEPHI > TABLEAU
  • 48. Excel
  • 51. Gephi
  • 53. LET’S GET OUR HANDS DIRTY, AGAIN !
  • 55.
  • 56. • Purse strings WILL be tightened • Fewer ‘unicorn-level’ valuations • No investment without revenue or clients • MVP with traction – a must have • Acquire or be acquired – or be a acquisition target • Needs > Wants – solve a problem, but scope it out •Bootstrapped startups will win, not the valuation-hungry PREDICTIONS
  • 57. Don't sell what you do - disguise it Ex. Create marketing strategy, not analytics/Tableau Nugget 1
  • 58. A product that promotes laziness or transfers laziness from seller to buyer will sell Nugget 2
  • 59. Is it something that Google/FB provides or can do? Even if it's partial, danger is real. Nugget 3
  • 60. The product should be IP-able, scalable and monetize-able - atleast 2/3 must be met Nugget 4
  • 61. Read Read Read techcruch, recode, crunchbase, yourstory, nextbigwhat, vccircle Nugget 5
  • 63. MANAS RANJAN KAR manas@jsm.email +91-9971 420 188 www.unlocktext.com #connect