Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

EXTRACT, LOAD, VISUALIZE Become a Data Ninja

A thorough understanding of how to become a data ninja

  • Login to see the comments

EXTRACT, LOAD, VISUALIZE Become a Data Ninja

  1. 1. A talk delivered at IIM Lucknow 24/01/2016 EXTRACT. LOAD. VISUALIZE Becoming a data ninja
  2. 2. • INTRODUCTION • WEB CRAWLING – THE DESIGN • DIY – WEB CRAWLING • TEXT ANALYTICS– THE DESIGN • CASE STUDY • DIY – TEXT ANALYTICS • SOME UNSOLICITED ADVICE? – BUILDING YOUR FIRM WE WILL TALK OF….
  3. 3. Who are we ? We are a data science company, founded in 2009– with special interest in making the world an intelligent place to live in. We identify data and bring it to light, making it visible, cohesive, comparable and easy to understand so that it really does support YOU in making the right decisions. Who am I ? I am a Practice Lead at JSM for Natural Language Processing & Machine Learning. I have architected multiple solutions in the area of text analytics for multiple industries like finance, healthcare, food & beverages & hospitality.
  4. 4. AREAS WE WORK ON PHARMA Sales Pitch Analysis RETAIL Predictive + IoT FINANCE Competitive Intelligence F&B Customer Insights MR Scoping and Product Evaluation SaaS NLP, ML, Text
  5. 5. WHAT DO CLIENTS WANT
  6. 6. TOOLS WE PLAY WITH OPEN SOURCE Inexpensive DATABASES Fast & Scalable INSIGHTS Python, R TECHNIQUES Latest yet tested VISUALIZATIONS D3, GCharts, Tableau MANAGEMENT Basecamp
  7. 7. Low Cost Data Collection + Comprehensive Analytics
  8. 8. PART 1 Scraping data from the web
  9. 9. HTML PAGES
  10. 10. • HTML pages are like textbooks – content, titles, subtitles, paragraphs and so on •Javascript adds interactivity to the HTML pages HTML PAGES
  11. 11. A crawler is a program that visits Web sites and reads their pages and other information in order to create entries for a search engine index. The major search engines on the Web all have such a program, which is also known as a "spider" or a "bot." OVERVIEW– WEB CRAWLING
  12. 12. BUT, YOU ARE MANAGERS. DO YOU NEED THIS ?
  13. 13. YES ! You don’t need to write a code, promise.* * T&Cs apply
  14. 14. Tool 1 Import.io
  15. 15. • BEST of the lot • Gives great flexibility – just click and extract • Most of the sites are compatible • Easy CSV/Google Docs Export • Provides APIs for regular data updates • Low training time INTRODUCTION
  16. 16. USE CASES You want to monitor feedback http://www.consumercomplaints.in/snapdeal-com-b100038
  17. 17. USE CASES You want to stay ahead of competition http://cashkaro.com/
  18. 18. USE CASES You want to create pricing strategy http://www.shopclues.com/mobiles/unboxed-mobiles.html
  19. 19. Tool 2 webscraper.io
  20. 20. • Works where Import.io fails • Bit buggy, but does a god job of providing flexible choices of data extraction • Most of the sites are compatible • Easy CSV Export • NO APIs for regular data updates • Moderate learning curve INTRODUCTION
  21. 21. LET’S GET OUR HANDS DIRTY
  22. 22. • Don’t scrape too fast or you will get banned • Respect robots.txt • Extract only what you need • Don’t overload their servers • Don’t take data what’s not yours – only the data in public domain PRECAUTIONS
  23. 23. TEXT ANALYTICS Mine gold from mountain heaps
  24. 24. As a brand owner with significant investments in social media, the usual questions you might have in mind… • Is the brand exuding same attributes I intended it to be? • Is my internet presence helping me ? • Can I measure my ROI for the money I spent? • What are the measurable metrics for effective social media management? • When can I exploit emerging trends for my brand? • How can I understand my customers better? 26 WHAT WOULD YOU WANT?
  25. 25. 27 WHAT WOULD YOU WANT TO TARGET? TRANSACTIONAL CONVERSATIONS Users talk about current, events, share cat videos and engage in trivial gossip INFORMATIONAL CONVERSTIONS Users engage with the brand to air appreciation or complaints.
  26. 26. 28 DATA SOURCES Social Media Comments Brand Website Blogs Customer Emails Customer Surveys News Websites
  27. 27. For a brand, say X, there would be thousands of conversations on blogs, websites and social media revolving around X. Several hundred blog posts are published that contain the keywords “X” This company needs to know: - How many of these posts are relevant and actually expressing opinions about X? - How many relevant posts are negative, and how many are positive? - What particular aspects and features of X are being praised or criticized? - For all of the above, what is the trend for the past few time periods? - How is my brand perceived among people demographics? - How is my brand faring against my competitors? 29 QUESTIONS TO ANSWER
  28. 28. BRIEF TERMINOLOGY You build an algorithm, machine learns patterns, machine predicts, rinse & repeat. MACHINE LEARNING TEXT ANALYTICS Analyzing unstructured text, assign structure, load into a BI/program to visualize
  29. 29. CASE STUDY HOW WE HELPED A RESTAURANT SERVE THEIR CUSTOMERS BETTER
  30. 30. PROBLEM STATEMENT The client had thousands of customer reviews which they wanted to analyse - to understand customer feedback and identify improvement opportunities. The broad questions we focused on; What did they say about the restaurant? Keywords & topics of discussion across the comments What elements of the restaurant would they want improved? – service, staff behaviour, ambience etc. When did the customer visit the store? How is client’s traffic distributed over time? Ticket sizes across multiple customer dimensions – age, gender, ratings, location, time of visit etc. Overall customer sentiments & views about UCH PRIMARY FOCUS AREAS SECONDARY FOCUS AREAS
  31. 31. FOCUS AREAS TOPICS KEYWORDS SENTIMENT POINT OF SALES
  32. 32. APPROACH Extract data and validate Corpus from social media Tokenise and remove stop words Initiate ML models , NER , parsers & topic algorithms Initiate detection rules for topics, keywords, gender, sentiment and multi-word concept detection Final Output PRE - PROCESSING PARSING & ANALYSIS OUTPUT Part of Speech (POS) Tagger
  33. 33. Author Value Type Sentiment Duncan Riley Samsung Galaxy S6 Entity Positive Duncan Riley Apple Entity Negative Duncan Riley LoopPay Entity Neutral Duncan Riley mobile payments Keyword Positive Duncan Riley point of sales Keyword Positive Structuring data from free flowing text is easy to use by existing reporting and business intelligence software. Insights from the final reports can now be used for decision-making by the PR firm and their client Actual blog post parsed through our SmartText Engine
  34. 34. LUNCHBOX www.lunchbox.subcortext.com
  35. 35. RATINGS – HOW MUCH IS BETTER ? Increasing scope of differentiating operational improvements Decreasing scope of customer loyalty
  36. 36. OUR PLATFORM 7.7 Mn 92.6 K 62.6 K16 reviews restaurants user profilestopics As on 28th October, 2015
  37. 37. LUNCHBOX - SCREENSHOTS & MOCKUPS
  38. 38. Lunchbox – Search Screen
  39. 39. Lunchbox – ViewMe
  40. 40. Lunchbox – ViewMe
  41. 41. Lunchbox – RankMe
  42. 42. Lunchbox – MarketMe
  43. 43. BUT HEY ! THIS NEEDS ME TO WRITE A CODE. YOU LIAR !
  44. 44. SCRAPE > VOSVIEWER > GEPHI > TABLEAU
  45. 45. Scrape the web !
  46. 46. Excel
  47. 47. = ISNUMBER(SEARCH($N$1,H2))
  48. 48. VOSViewer
  49. 49. Gephi
  50. 50. Tableau
  51. 51. LET’S GET OUR HANDS DIRTY, AGAIN !
  52. 52. SOME UNSOLICITED ADVICE ?
  53. 53. • Purse strings WILL be tightened • Fewer ‘unicorn-level’ valuations • No investment without revenue or clients • MVP with traction – a must have • Acquire or be acquired – or be a acquisition target • Needs > Wants – solve a problem, but scope it out •Bootstrapped startups will win, not the valuation-hungry PREDICTIONS
  54. 54. Don't sell what you do - disguise it Ex. Create marketing strategy, not analytics/Tableau Nugget 1
  55. 55. A product that promotes laziness or transfers laziness from seller to buyer will sell Nugget 2
  56. 56. Is it something that Google/FB provides or can do? Even if it's partial, danger is real. Nugget 3
  57. 57. The product should be IP-able, scalable and monetize-able - atleast 2/3 must be met Nugget 4
  58. 58. Read Read Read techcruch, recode, crunchbase, yourstory, nextbigwhat, vccircle Nugget 5
  59. 59. QUESTIONS ?
  60. 60. MANAS RANJAN KAR manas@jsm.email +91-9971 420 188 www.unlocktext.com #connect

×