Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

The Human Algorithm: Automating Startup Data Collection at Mattermark


Published on

Keynote by Sarah Catanzaro, Head of Data at Mattermark, at Data Point Live, San Francisco, 10/1/2015

Published in: Data & Analytics

The Human Algorithm: Automating Startup Data Collection at Mattermark

  1. 1. #datapointlive The Human Algorithm: Automating Startup Data Collection at Mattermark Sarah Catanzaro, Head of Data at Mattermark @sarahcat21
  2. 2. #DPL15 | @sarahcat21 Mattermark is a deal intelligence platform and private company database used by ● investors ● business and corporate development ● sales Mattermark
  3. 3. #DPL15 | @sarahcat21 THE CHALLENGE Scale + Information Overload + Stealth
  4. 4. #DPL15 | @sarahcat21 Scale Over 125 million private companies in the world (only about 45.5 thousand public).
  5. 5. #DPL15 | @sarahcat21 Information overload
  6. 6. #DPL15 | @sarahcat21 Stealth ● Private companies do not have strong incentives (e.g. legal obligations) to share data. Many may have competitive incentives to obfuscate information. ● Investors may request non-disclosure.
  7. 7. #DPL15 | @sarahcat21 Mattermark’s Solution
  8. 8. #DPL15 | @sarahcat21 Software-oriented approach ● A must, due to the scale of our dataset ○ 1.3 million companies ○ 16.5k investors ○ 110k funding events ● Leverage a lean data team
  9. 9. #DPL15 | @sarahcat21 Data collection strategy ● Web scraping ● Machine learning ● Direct submission ● Manual data entry
  10. 10. #DPL15 | @sarahcat21 The “Human Algorithm”
  11. 11. #DPL15 | @sarahcat21 Investors ask questions like What start-ups might raise capital in the next 6 months? What startups is Stephanie Palmeri investing in?
  12. 12. #DPL15 | @sarahcat21 Our data analysts seek to understand: ● Why does this question matter? ● What data is required to answer this question? ● Where can this data be accessed?
  13. 13. #DPL15 | @sarahcat21 Next, data analysts: 1. Define repeatable processes for data collection. 2. Determine whether processes can be replicated through web scraping and/or machine learning algorithms to collect data at scale. 3. Write functional specifications, reviewed by sales and engineering team members.
  14. 14. #DPL15 | @sarahcat21 Next, web and/or machine learning engineers 1. Write dev designs, reviewed by data analysts. 2. Upon implementation and marketing release, this data becomes available to customers. 3. New questions arise and the cycle starts again.
  15. 15. #DPL15 | @sarahcat21 Funding Automation
  16. 16. #DPL15 | @sarahcat21 Investors ask questions like How much funding has a company already raised? Who were the investors at each of those rounds?
  17. 17. #DPL15 | @sarahcat21 Problems with existing sources Rely on wiki-style data collection (cannot confirm the credibility of sources) News reports are better; but ● facts are harder to extricate ● different sources report different figures
  18. 18. #DPL15 | @sarahcat21 Solution: funding automation A new framework for collecting and synthesizing funding data. 1. News article fact extraction (machine learning) 2. Funding override system (web engineering) 3. Funding confirmation email campaign (marketing)
  19. 19. #DPL15 | @sarahcat21 2. News article fact extraction Crawl RSS feeds, extract data from stories (title, texts, links, etc.) ● 750+ sources ● 5,000 - 10,000 articles
  20. 20. #DPL15 | @sarahcat21 2. News article fact extraction Classify stories about funding ● 250 articles/day
  21. 21. #DPL15 | @sarahcat21 2. News article fact extraction ● Identify sentences containing information about investors, amount, and/or series
  22. 22. #DPL15 | @sarahcat21 2. News article fact extraction ● Extract facts ● Match companies and investors to entities in our database ○ 30% of extracted articles are entered automatically
  23. 23. #DPL15 | @sarahcat21 1. Funding override system ● Identify reports about the same funding event ● Combine information from multiple reports using wongi rules engine
  24. 24. #DPL15 | @sarahcat21 3. Funding confirmation email campaign Use CRM and Hubspot to automatically send emails to founders after equity financing.
  25. 25. #DPL15 | @sarahcat21 What We Learned
  26. 26. #DPL15 | @sarahcat21 Where we struggled Our initial implementation of a funding override system was inefficient. Why? Because our data analysts and developers were not aligned on functional requirements.
  27. 27. #DPL15 | @sarahcat21 Solution ● Analysts must work closely with developers ○ Pre-spec check-ins ○ Analysts review dev designs to ensure that the system design addresses the use case. ● Analysts must avoid being prescriptive ● Analysts must understand data mining and machine learning concepts
  28. 28. #DPL15 | @sarahcat21 Where we succeeded Implementation of news article fact extraction was successful. Why? Because data analysts and developers worked as service providers to each other.
  29. 29. #DPL15 | @sarahcat21 How We Did It
  30. 30. #DPL15 | @sarahcat21 1. Tighter Analyst + Dev Communication Tiger teams: 1 ML developer, 1 web/infrastructure developer, 1 data analyst, 1 project lead Define milestones & hold daily stand-ups.
  31. 31. #DPL15 | @sarahcat21 3. Track II interaction reinforce symbiotic relationship ● Devs lead Python learning group ● Data analysts hold seminars on topics like admin tooling and alternative assets
  32. 32. #DPL15 | @sarahcat21 Thank You!