Polyglot parallelism


Published on

Two years ago Rackspace had a problem: how do we backup 20K network devices, in 8 datacenters, across 3 continents, with less than a 1% failure rate -- every single day? Many solutions were tried and found wanting: a pure Perl solution, a vendor solution and then one in Ruby, none worked well enough. They not fast enough or they were not reliable enough, or they were not transparent enough when things went wrong. Now we all love Ruby but good Rubyists know that it is not always the best tool for the job. After re-examining the problem we decided to rewrite the application in a mixture of Erlang and Ruby. By exploiting the strengths of both -- Erlang's astonishing support for parallelism and Ruby's strengths in web development -- the problem was solved. In this talk we'll get down and dirty with the details: the problems we faced and how we solved them. We'll cover the application architecture, how Ruby and Erlang work together, and the Erlang approach to asynchronous operations (hint: it does not involve callbacks). So come on by and find out how you can get these two great languages to work together.

Published in: Technology, Business
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • * Not an Erlang tutorial (but you will see some code)\n* Giving our experiences with Erlang + Ruby\n
  • \n
  • * Firewalls, load balancers\n\n
  • * North America, Europe and Hong Kong\n* Everything runs from DC in Virginia\n* High latency to LON and HKG\n* ~5K devices in LON\n
  • * Pass traffic like a pro\n* Individual devices won’t get faster\n\n\n
  • * Backup all devices every 24 hours\n* Update devices during off hours\n\n
  • * Newer devices have real management interfaces (give examples Junos, BigIP) but...\n* Majority of management happens via screen-scraping an SSH session\n* Pixen are slow and we have a lot of them\n\n
  • \n
  • \n
  • * Lots of devices == lots of data\n
  • * 260 GB of backups in old DB\n* We diff the backups to see if they have changed\n* We don’t want to be too clever\n * Err on the side of keeping too much rather than too little\n
  • * Impossible to predict what combination of factors will lead to a need to search\n* Certain lb vendor had a vulnerability in SSH daemon\n* Had to search all firewall configs to find affected devices with SSH access allowed\n
  • \n
  • * Every event has different information that needs to be stored\n* Lots of events per device\n
  • \n
  • \n
  • * Serial #, OS version, chassis details\n* Information is parsed from device output\n* We want to expose information useful to the business in one place\n\n
  • * No accepted cross-vendor backup standard\n* We abstract information as “files”\n\n
  • * 17K devices in 2009, ~21K now\n
  • * What was in place in 2009\n
  • * Rails and Ruby scripts\n* Overlapping responsibilities\n* Information silos\n* Difficult to change\n
  • * concurrency with Ruby 1.8 threads\n * It is an I/O bound problem\n * Threads block on I/O\n* expect.rb has performance issues\n * Matching input one character at a time\n
  • * Only one type of device\n* Small number of devices per manager\n* Expensive\n
  • \n
  • * Fully Buzzword-compliant!\n
  • * No message queue\n* Poll MongoDB for some jobs\n* Schedule others in code\n* Very few writers to the database\n
  • * Backend updates generally use MongoDB atomic updates\n* Need transactions for cross-collection modifications, but in-document can use atomic updates\n
  • \n
  • * You guys know Rails and have probably heard of mongodb\n* Take a minute to talk about Erlang\n
  • * Ericsson Computer Science Laboratory\n* 1986: Joe Armstrong creates Erlang\n* 1988: Erlang escapes the lab\n* 1998: Erlang Released as Open Source\n\n
  • * Developed for BT\n* 1.5 Million Lines of Erlang Code\n* 2 year evaluation\n* 9 9s of uptime\n
  • * Erlang doesn’t guarantee 9 9s\n* Gives you the tools to make high uptime possible\n
  • * Some core concepts are different than imperative/OO languages\n
  • * Just like Ruby\n
  • * Only assign to variables once\n* Allows flexible pattern matching and runtime assertions\n* “=” is the pattern match operator, not assignment\n
  • * Variables are function-scoped, so single assignment is really a non-issue\n
  • \n
  • * Changing a data structure creates a new one\n* Purely Functional Data Structures - Okasaki\n
  • \n
  • * Concurrency built into the language, not an add-on\n
  • \n
  • \n
  • * Two areas of interest: jobs framework and ReSTful API\n
  • \n
  • * Runner spins up multiple workers\n* Runner and workers are generic\n* Interesting work for each job is in callback module\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Example of how this solution made it easier to solve our problems\n\n* App running on two VMs and one developed problems\n* Moved all functionality to other node and rebuilt problematic VM\n* Doubled the workers on second VM with no detectable performance degradation\n\n
  • * Occupies approximately the same position in your architecture as Rails controllers, but thoroughly exposes you to the details of the HTTP request/response\n* When you develop apps with Webmachine, you are basically writing a custom webserver for your API\n* And what are those details... <transition to next slide>\n\n
  • * As seen through the eyes of Webmachine\n* Express API in terms of HTTP\n* Use HTTP as your domain language\n
  • \n
  • \n
  • * A bit like Rails routing\n* First element (the “pathspec”) is a list that matches the request URL\n* Second element is the name of an Erlang module that exports the overridden callbacks\n* Third element is the args to the init function on the callback module\n
  • * POEM - So far as I know, I coined this by accident\n* Why is the purity of functions important? predictability, testability, repeatability\n* You’ll usually override 4 or 5\n
  • * Override only the parts of the request cycle that interest you, and Webmachine will do something reasonable for the other cases\n
  • * Request parameter data is analogous to that contained in Rack::Request\n
  • * Putting it all together\n* Previous callbacks stash data in the context (in this case a device record)\n\n
  • \n
  • * You can get far with Ruby’s “Big 3” datatypes: String, Array and Hash\n\n
  • * Strings in Erlang are different\n* iolists are your friend\n
  • * Proplists/records vs hashes\n\n
  • * Erlang has internal iterators for lists (like each, etc)\n* No for loops, use recursion instead\n
  • * if statements must always have an else\n* case statements raise an error if no branches match\n* pattern matching can replace some conditions\n\n
  • * Erlang does not tolerate design mistakes as easily as Ruby\n* Pure vs impure functions\n* Pure is easy to test, IO is not your friend\n* Referential transparency\n* Dependency injection\n* Mocking is possible but not a panacea\n
  • * You must understand the concurrency primitives\n* In general you should be using the OTP behaviors\n* If you use ORM you must understand SQL\n
  • \n
  • * Emphasis on stability over new features\n* New useful features in Erlang for years that community frowns upon\n * Undocumented with uncertain future\n
  • * Standard library is very rich\n* 3rd party libraries tend to be immature\n
  • * lists, proplists, binary and string\n* string has duplicate functionality\n * made from merging two older modules\n* lists and proplists have duplicated, overlapping functionality\n
  • * agner, faxien, sinan, etc.\n
  • * how did we get this adopted?\n* how do you find Erlang programmers?\n* why not Node.js, EventMachine or Ruby + sockets?\n
  • \n
  • Polyglot parallelism

    1. 1. Polyglot Parallelism A Case Study in Using Erlang and Ruby at Rackspace
    2. 2. The Problem Part 1
    3. 3. 20,000network devices
    4. 4. 9 Datacenters, 3 Continents
    5. 5. devices not designed for high-throughput management
    6. 6. we need a highthroughput solution
    7. 7. the time spent in I/O is the primary bottleneck
    8. 8. if you want to speedthings up you have totalk to more devices in parallel
    9. 9. The Problem Part II
    10. 10. huge blobs of data
    11. 11. lots of backupsequals big database
    12. 12. ad-hoc searching is difficult but important
    13. 13. customer SLA meansneed to restore from backup quickly
    14. 14. an event must begenerated for eachdevice interaction
    15. 15. migrations areproblematic with that much data
    16. 16. rigid schema made adapting to new devices difficult
    17. 17. each device type hasdifferent properties
    18. 18. “backup” meansdifferent things for each device type
    19. 19. need to grow with the business
    20. 20. PreviousSolution
    21. 21. multiple Ruby apps
    22. 22. difficult to scale
    23. 23. vendor device managers
    24. 24. New Solution
    25. 25. the simplest thingthat could possibly work
    26. 26. most db writes come fromscheduled jobs
    27. 27. Other ClientsRails ReST API Erlang MongoDB Network Devices
    28. 28. Joe Armstrong
    29. 29. AXD301ATM Switch
    30. 30. 99.9999999%
    31. 31. Functional
    32. 32. Dynamically Typed
    33. 33. Single Assignment
    34. 34. A = 1. %=> 1A = 2. %=> badmatch
    35. 35. [B, 2, C] = [1, 2, 3].B = 1. %=> 1C = 3. %=> 3
    36. 36. Immutable DataStructures
    37. 37. D = dict:new().D1 = dict:store(foo, 1, D).D2 = dict:store(bar, 2, D1).
    38. 38. Concurrency Oriented
    39. 39. -module(fact).-export([fac/1]).fac(0) -> 1;fac(N) -> N * fac(N-1).
    40. 40. -module(quicksort).-export([quicksort/1]).quicksort([]) -> [];quicksort([Pivot|Rest]) -> quicksort([Front || Front <- Rest, Front < Pivot]) ++ [Pivot] ++ quicksort([Back || Back <- Rest, Back >= Pivot]).
    41. 41. Details
    42. 42. jobs framework
    43. 43. Runner Callback Workers Module
    44. 44. Callback Runner Worker Module start readyitem process process ready . . . stop
    45. 45. “behaviour” is interface
    46. 46. behaviour_info(callbacks) -> [ {init, 1}, {process_item, 3}, {worker_died, 5}, {job_stopping, 1}, {job_complete, 2}].
    47. 47. running({worker_ready, WorkerPid, ok}, S) -> case queue:out(S#state.items) of {empty, I2} -> stop_worker(WorkerPid, S), {next_state, complete, S#state{items = I2}}; {{value, Item}, I2} -> job_worker:process(WorkerPid, Item, now(), S#state.job_state), {next_state, running, S#state{items = I2}} end;
    48. 48. handle_info({DOWN, _, process, WorkerPid, Info}, StateName, S) -> {Item, StartTime} = clear_worker(WorkerPid, S), Callback = S#state.callback, spawn(Callback, worker_died, [Item, WorkerPid, StartTime, Info, S#state.job_state]), %% Start a replacement worker start_workers(1, Callback), {next_state, StateName, S};
    49. 49. handle_cast({process, Item, StartTime, JS}, S) -> Callback = S#state.callback, Continue = try Callback:process_item(Item, StartTime, JS) catch throw: Error -> error_logger:error_report(Error), ok end, job_runner:worker_ready(S#state.runner, self(), Continue), {noreply, S}.
    50. 50. story time
    51. 51. ReSTful API with WebmachineThe Convention Over Configuration Webserver
    52. 52. http://webmachine.basho.comHTTP Request Lifecycle Diagram
    53. 53. If you know HTTPWebmachine Is SimpleAs Proven by the “Number of Types of Things” Measurement of Complexity
    54. 54. The 3 Most Important Types of Things In Webmachine1. Dispatch Rules (pure data--barely a thing!)2. Resources (composed of simple functions!)3. Requests (simple get/set interface!)
    55. 55. Dispatch Rules { ["devices", server], device_resource, [] } GET /devices/12345 Webmachine inspects the device_resource module fordefined callbacks, and sets the Request record’s “server” value to 12345.
    56. 56. Resources• POEM (Plain Old Erlang Module)• Composed of referentially transparent functions*• Functions are callbacks into the request lifecycle• Approximately 30 possible callback functions, e.g.: • resource_exists → 404 Not Found • is_authorized → 401 Not Authorized * mostly
    57. 57. Resource Functions Perma-404resource_exists(Request, Context) -> {false, Request, Context}. Lucky Authis_authorized(Request, Context) -> S = calendar:time_to_seconds(now()), case S rem 2 of 0 -> {true, Request, Context}; 1 -> {“Basic realm=lucky”, Request, Context} end.
    58. 58. Requests• The first argument to each resource function• Set and read request & response data RemoteIP = wrq:peer(Request).wrq:set_resp_header(“X-Answer”, “42”, Request).
    59. 59. Retrieving a JSON Firewall Representationcontent_types_provided(Request, Context) -> Types = [{"application/json", to_json}], {Types, Request, Context}.to_json(Request, Context) -> Device = proplists:get_value(device, Context), UserId = get_user_id(Request), case fe_api_firewall:get_config(Device, UserId) of {ok, Config} -> success_response(Config, Request, Context); {error, Reason} -> error_response(502, Reason, Request, Context) end.
    60. 60. Gotchas
    61. 61. primitive obsession
    62. 62. string-ish “hi how are you” <<“hello there”>>[<<"easy as ">>, [$a, $b, $c], " ☺n"].
    63. 63. hashes vs records
    64. 64. to loop is human, to recur divine
    65. 65. Erlang conditionalsalways return a value
    66. 66. design for testability
    67. 67. don’t spawn,use OTP
    68. 68. Downsides
    69. 69. Erlang changes very slowly
    70. 70. 3rd party libraries
    71. 71. standard librarycan be inconsistent
    72. 72. packagemanagement
    73. 73. Questions
    74. 74. http://spkr8.com/t/7806Phil: @philtolandhttp://github.com/tolandhttp://philtoland.comMike: @lifeinzemblahttp://github.com/msassak