Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Writing a Search Engine. How hard could it be?

930 views

Published on

5 of the most dangerous words you'll hear a developer say are "How hard could it be?". This talk tells the tale of what happens when you act on the question of "I'm going to write the next Google beater. How hard could it be?" This is the tale of how one person in a few hours is able to write something resembling a search engine thanks to the platform features of Azure and the productivity of F#. We'll see how we're able to use Azure search from F# to easily power our search internals, we'll use MBrace to rapidly find the most popular web pages on the internet and Azure functions to tie everything together to build up APIs and create on demand infrastructure. Add in a healthy mix of queues provided by Azure Service Bus and if you squint hard enough, you might just end up seeing something resembling a search engine.

But seriously writing the next Google, just how hard could it be?

A recording of this talk is available via SkillsMatter at https://skillsmatter.com/skillscasts/8901-f-sharpunctional-londoners-meetup

Published in: Software
  • Be the first to comment

Writing a Search Engine. How hard could it be?

  1. 1. WRITING A SEARCH ENGINE. HOW HARD COULD IT BE? ANTHONY BROWN @BRUINBROWN93 ANTHONY@COMPOSITIONAL-IT.COM
  2. 2. ABOUT ABOUT ME ▸ Consultant at Compositional IT ▸ F# dev for ~3 years now ▸ Interested in Big Data, IoT, Cloud and Distributed Systems
  3. 3. COMPOSITIONAL IT FUNCTIONAL FIRST. CLOUD READY. @COMPOSITIONALIT
  4. 4. HOW HARD COULD IT BE? Every software developer ever INTRODUCTION
  5. 5. IT’S ONLY AN OPERATING SYSTEM, ALL IT DOES IS RUNS PROGRAMS! Everybody when Windows blue screens INTRODUCTION
  6. 6. IT’S ONLY A MULTIPLAYER ONLINE VIDEO GAME! Anybody playing a game when lag spikes hit TEXT
  7. 7. IT’S ONLY 2 LINES OF JAVASCRIPT Backend developer needing to make a small API change INTRODUCTION
  8. 8. DUDE. HOLD MY BEER. Drunk people 10 seconds before making a terrible mistake
  9. 9. SATURDAY MORNING. PLANS CANCELLED.
  10. 10. WHAT NEXT? HIT UP GOOGLE.
  11. 11. WHAT TO DO IN LONDON THIS WEEKEND?
  12. 12. WRITING A SEARCH ENGINE. HOW HARD COULD IT BE?
  13. 13. WRITING A SEARCH ENGINE WITH AZURE AND F# IN A WEEKEND.
  14. 14. BUT FIRST.
  15. 15. THIS WAS A WEEKEND PROJECT.
  16. 16. YOU SHOULD EXPECT:
 - HACKY CODE.
  17. 17. YOU SHOULD EXPECT: - DEMOS TO FAIL.
  18. 18. YOU SHOULD NOT EXPECT: - A DEEP DIVE INTO SEARCH ENGINE TECH.
  19. 19. SEARCH ENGINE BACKGROUND CONSTRAINTS ▸ Not a priority ▸ Can’t cost more than £85 per month ▸ No operations investment ▸ Limit to the weekend
  20. 20. BACKGROUND EVERYTHING I KNOW ABOUT HOW SEARCH ENGINES WORK ▸ ▸ ▸ ▸ ▸ ▸
  21. 21. THE ANATOMY OF A LARGE- SCALE HYPER TEXTUAL WEB SEARCH ENGINE SERGEY BRIN LARRY PAGE
  22. 22. IT’S 2016. THE WEB’S CHANGED. A LOT.
  23. 23. WHAT’S NEW? + SCALE
  24. 24. WHAT’S NEW? + USERS
  25. 25. WHAT’S NEW? + GLOBALISATION
  26. 26. WHAT’S NEW? + CLOUD
  27. 27. WHAT’S NEW? + PLATFORM AS A SERVICE
  28. 28. WHAT’S NEW? - INFRASTRUCTURE
  29. 29. WHAT’S NEW? - PERSONAL HOSTING
  30. 30. SEARCH ENGINE BACKGROUND WHAT’S IMPORTANT? ▸ Search ▸ Scraping ▸ Page rank
  31. 31. SEARCH IMPLEMENTATION HOW TO FIND A NEEDLE IN A HAYSTACK ▸ Take all of your documents ▸ Record all of the words which occur within a file ▸ Invert that index ▸ List of all words and the documents they appear in ▸ For all words in the search query, find the files which appear in every inverted index
  32. 32. SOUNDS EASY RIGHT? I DON’T CARE ABOUT IT.
  33. 33. AZURE SEARCH MANAGED SEARCH AS A SERVICE
  34. 34. AZURE SEARCH WHAT DOES AZURE SEARCH GIVE US? ▸ Hosted Search as a Service ▸ HTTP API for indexing and retrieving documents ▸ Ability to scale out (more replicas, more indexes) ▸ Free basic tier
  35. 35. AZURE SEARCH IN THE AZURE PORTAL.
  36. 36. BOOSTING DEMO.
  37. 37. WE HAVE SEARCH. WHAT NEXT?
  38. 38. INDEXING DATA WHAT IS A CRAWLER ▸ Autonomously find every web page on the internet ▸ Pull the content from that web page and index it ▸ Read the links on that page and index those links ▸ Recursively process until every page on the internet has been reached
  39. 39. THE PROBLEM? THE INTERNET’S PRETTY BIG.
  40. 40. AZURE SERVICE BUS DISTRIBUTED MESSAGE QUEUES
  41. 41. INDEXING DATA WHAT DOES AZURE SERVICE BUS GIVE US? ▸ Scalable durable queues and topics with guaranteed availability ▸ .Net APIs to communicate with the service bus ▸ Free basic tier
  42. 42. WORKING WITH A SERVICE BUS QUEUE.
  43. 43. WE NEED TO BE GOOD CITIZENS. WE DON’T WANT TO DDOS A SINGLE WEBSITE DURING CRAWLING.
  44. 44. SERVICE BUS PROVIDES SUPPORT FOR MESSAGE DE-DUPLICATION BASED ON CONTENT.
  45. 45. WE DON’T WANT TO SCRAPE THROUGH EVERY WEB PAGE IN THE WORLD.
  46. 46. WE DON’T WANT TO INDEX: - GOOGLE SEARCH QUERIES
  47. 47. WE DON’T WANT TO INDEX: - PROTECTED CONTENT
  48. 48. WE DON’T WANT TO INDEX: - IRRELEVANT CONTENT
  49. 49. DEALING WITH THE ROBOTS.TXT FILE WRITING BASIC PARSERS IN F#
  50. 50. BEING A WELL BEHAVED SCRAPER WHAT IS ROBOTS.TXT? ▸ Text file standard for telling web scrapers what they should scrape ▸ Opt-in - crawlers can ignore the robots.txt file ▸ Simple file stored at the root of the web server
  51. 51. AN EXAMPLE ROBOTS.TXT FILE.
  52. 52. SIMPLE PARSING WITH F#.
  53. 53. HTML AND INFORMATION RETRIEVAL QUERYING HTML DOCUMENTS WITH HTML AGILITY PACK
  54. 54. WE HAVE A HTML FILE. WE NEED THE CONTENT OUT OF IT.
  55. 55. INFORMATION RETRIEVAL FROM HTML DOCUMENTS WORKING WITH THE HTML AGILITY PACK ▸ Provides a simple query layer over HTML documents ▸ Works with well formatted and poorly formatted HTML ▸ Provides XPath support over the document ▸ Allows for querying for individual properties and elements
  56. 56. EXTRACTING LINKS FROM A HTML DOCUMENT
  57. 57. EXTRACTING ALL OF THE CONTENT FROM AN HTML DOCUMENT
  58. 58. WE NOW HAVE A WEB SCRAPER. WE NEED TO RUN THE WEB SCRAPER.
  59. 59. AZURE WEBJOBS SIMPLE HOSTING OF LONG RUNNING PROCESSES
  60. 60. AZURE WEB JOBS WHAT ARE WEB JOBS? ▸ A means of hosting basic executables in the cloud ▸ Provides simplified deployment and monitoring ▸ Pricing per minute of usage
  61. 61. WE NOW HAVE A SEARCH ENGINE. KIND OF.
  62. 62. SEARCH IS A RECOMMENDATION PROBLEM.
  63. 63. HOW DO WE RECOMMEND CONTENT TO USERS?
  64. 64. PAGE RANK FINDING THE MOST INFLUENTIAL SITES ON THE INTERNET
  65. 65. PAGE RANK WHAT IS PAGE RANK? ▸ Stanford’s patented algorithm ▸ Helps you find the most influential websites on the internet ▸ Websites with lots of links to them are more influential
  66. 66. THE PROBLEM? THERE’S LOTS OF WEBSITES ON THE INTERNET.
  67. 67. THERE’S EVEN MORE LINKS BETWEEN WEBSITES.
  68. 68. WE HAVE A HUGE LINK GRAPH. WE NEED TO PROCESS THAT GRAPH.
  69. 69. BIG DATA PROCESSING WITH MBRACE AND CLOUDFLOWS.
  70. 70. WE HAVE A QUERY WHICH NEEDS TO RUN DAILY. WE NEED TO ORCHESTRATE IT.
  71. 71. AZURE FUNCTIONS + AZURE RESOURCE MANAGER USING AZURE FUNCTIONS FOR DEVOPS
  72. 72. DEVOPS WHAT IS AZURE RESOURCE MANAGER? ▸ Declarative way of describing Azure infrastructure ▸ REST APIs to deploy infrastructure template files ▸ APIs to see current deployment status
  73. 73. DEVOPS WHAT IS AZURE FUNCTIONS? ▸ Lightweight scripting of Azure web jobs ▸ Allows for running scripts in response to certain events ▸ Billing based on number of function invocations
  74. 74. DEVOPS USING AZURE FUNCTIONS FOR DEVOPS ▸ Set up a timer triggered Azure Function ▸ Deploy an Mbrace cluster through Azure Resource Manager ▸ Send an event when the job completes ▸ Second Azure Function for deleting the MBrace cluster
  75. 75. AZURE FUNCTIONS AND AZURE RESOURCE MANAGER.
  76. 76. WE NOW HAVE EVERYTHING IN PLACE FOR A SEARCH ENGINE. NOBODY CAN ACCESS IT THOUGH.
  77. 77. AZURE FUNCTIONS SERVERLESS WEB APIS WITH AZURE FUNCTIONS
  78. 78. AZURE FUNCTIONS CAN OPERATE ON HTTP REQUESTS.
  79. 79. NO LONG TERM HOSTING COSTS.
  80. 80. AZURE FUNCTIONS HTTP API DEMO.
  81. 81. DONE. SEARCH ENGINE COMPLETE.
  82. 82. HTTP API AZURE SEARCH LINK DATABASE PAGERANK CLUSTER ORCHESTRATOR AZURE SERVICEBUS INDEXER PAGERANK IMPORTER PAGERANK SCORE STORE
  83. 83. PLENTY OF ROOM FOR IMPROVEMENTS.
  84. 84. CACHING SEARCH QUERIES.
  85. 85. QUERY AUTO COMPLETE.
  86. 86. SEARCH A GIVEN DOMAIN.
  87. 87. MULTIPLE LANGUAGE SUPPORT.
  88. 88. SUPPORT FOR OTHER DOCUMENT TYPES.
  89. 89. BETTER INFORMATION RETRIEVAL ALGORITHMS.
  90. 90. WHAT’S NEXT FOR IT? NOTHING.
  91. 91. PRODUCTISING A GOOGLE COMPETITOR IS BASICALLY IMPOSSIBLE.
  92. 92. IN SUMMARY WRAPPING UP & KEY TAKEAWAYS
  93. 93. AZURE + F# = <3
  94. 94. AZURE MAKES HARD INFRASTRUCTURE PROBLEMS SIMPLE.
  95. 95. F# MAKES HARD SOFTWARE PROBLEMS SIMPLE.
  96. 96. TOGETHER THEY MAKE HARD PROBLEMS SIMPLE.
  97. 97. IT’S NOT GOOGLE. BUT IT TOOK 1 DEV 2 DAYS.
  98. 98. CLOUD IS THE EPITOME OF BUSINESS AGILITY
  99. 99. COMPOSITIONAL IT ANTHONY@COMPOSITIONAL-IT.COM FUNCTIONAL FIRST. CLOUD READY.
  100. 100. Q&A.

×