Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Multi-threaded web crawler in Ruby

969 views

Published on

author: Kamil Durski, Senior Ruby Developer at Polcode
Twitter: @kdurski Github: github.com/kdurski

Ruby has built-in support for threads yet it’s barely used, even in situations where it could be very handy, such as crawling the web. While it can be a pretty slow process, the majority of the time is spent on waiting for IO data from the remote server. This is the perfect case to use threads.

You can read more about multi-threaded web crawler in Ruby here: http://www.polcode.com/en/multi-threaded-web-crawler-in-ruby/

Published in: Software
  • Be the first to comment

Multi-threaded web crawler in Ruby

  1. 1. Multi-threaded web crawler in Ruby
  2. 2. Hi, I’m Kamil Durski, Senior Ruby Developer at Polcode If improving Ruby skills is what you’re after, stick around. I’ll show you how to use multiple threads to drastically increase the efficiency of your application. As I focus on threads, only the relevant code will be displayed in the slideshow. Find the full source here.
  3. 3. The (much) underestimated threads
  4. 4. Ruby programmers have easy access to threads thanks to build-in support. Threads can be very useful, yet for some reason they don’t receive much love. Where can you use threads to see their prowess first-hand? Crawling the web is a perfect example! Threads allow you to save much time you’d spend waiting for data from the remote server.
  5. 5. I’m going to build a simple app so you can really understand the power of threads. It will fetch info on some popular U.S. TV shows (that one with dragons and an ex chemistry teacher too!) from a bunch of websites. But before we take a look at the code, let’s start with a few slides of good old theory.
  6. 6. What’s the difference between a thread and a process?
  7. 7. A multi-threaded app is capable of doing a lot of things at the same time. That’s because the app has the ability to switch between threads, letting each of them use some of the process time. But it’s still a single process The same things goes for running many apps on a single-core processor. It’s the operating system that does the switching.
  8. 8. Another big difference Use threads within a single process and you can share memory and variables between all of them, making development easier Use multiple processes and processor cores and it’s no longer the case – sharing data gets harder. Check Wikipedia to find out more on threads.
  9. 9. Now we can go back to the TV shows. Aside of Ruby on Rails’ Active Record library for database access, all I’m going to use are: Three components from Ruby’s thread library: 1) Thread – the core class that runs multiple parts of code at the same time, 2) Queue – this class will let me schedule jobs to be used by all the threads, 3) Mutex – the role of the Mutex component is to synchronize access to the resources. Thanks to that, the app won’t switch to another thread too early.
  10. 10. The app itself is also divided into three major components: 1) Module I’m going to supply the app with a list of modules to run. The module creates multiple threads and tells the crawler what to do, 2) Crawler I’m going to create crawler classes to fetch data from websites, 3) Model Models will allow me to store and retrieve data from the database.
  11. 11. Crawler module
  12. 12. The Crawler module is responsible for setting the environment and connecting to the database.
  13. 13. The autoload calls refer to major components inside the lib/ directory. The setup_env method connects to the database and adds app/ directories to the $LOAD_PATH variable and includes all of the files under app/ directory. A new instance of the mutex method is stored inside of the @mutex variable. We can access it by Crawler.mutex.
  14. 14. Crawler::Threads class core feature
  15. 15. Now I’m going to create the core feature of the app. I’m initializing a few variables - @size, to know how many threads to spawn, @threads array to keep track of the threads, and @queue to store the jobs to do.
  16. 16. I’m calling the #add method to add each job to the queue. It accepts optional arguments and a block. Please, google block in Ruby if you’re not familiar with the concept.
  17. 17. Next,the#start methodinitializes threads and calls #join on each of them.It’sessentialforthewholeapp to work – otherwise once the main thread is done with its job, it would instantly kill spawned threads and exit without finishing its job..
  18. 18. To complete the core functionality, I’m calling the #pop method on a block from the queue and then run the block with the arguments from the earlier #add method. The true argument makes sure that it runs in a non-blocking mode. Otherwise, I would run into a deadlock with the thread waiting for a new job to be addedevenafterthequeueisalready emptied (eventually throwing anapplicationerror „Nolivethreads left. Deadlock?”).
  19. 19. I can use the Crawler::Threads class to crawl multiple pages at the same time.
  20. 20. NowIcanrunsomecodetoseewhat all of it amounts to:
  21. 21. 10 second to visit 10 pages and fetch somebasicinformation.Alright,now I’m going to try 10 threads.
  22. 22. All it took to do the same task is 1.51 s! The app no longer wastes time doing nothing while waiting for the remote server to deliver data. Additionally, what’s interesting, the input order is different – for the single thread option it’s the same as the config file. For the multi-threaded it’s random, as some threads do their job faster.
  23. 23. Thread safety
  24. 24. The code I used outputs information using puts. It’s not a thread-safe way of doing this as it causes two particular things: - outputs a given string, - then outputs the new line (NL) character. This may cause random instances of NLcharactersappearingoutofplace as the thread switches in the middle andanother assumes controlbefore the NL character is printed See the example below:
  25. 25. I fixed this with mutex by creating a custom #log method to output the information to the console wrapped in it: Now the console output is always in order as the thread waits for the puts to finish.
  26. 26. And that’s it. Nowyouknowmoreabouthowthreadswork. I wrote this code as a side project the topic of web crawling being an important part of what I do. The previous version included more features such as the usage of proxies and TOR networksupport.Thelatterimprovesanonymitybutalsoslows down the code a lot. Thanks for your time and, again, feel free to tackle the entire code at: https://github.com/kdurski/crawler

×