Your SlideShare is downloading. ×

Polyglot parallelism


Published on

Two years ago Rackspace had a problem: how do we backup 20K network devices, in 8 datacenters, across 3 continents, with less than a 1% failure rate -- every single day? Many solutions were tried and …

Two years ago Rackspace had a problem: how do we backup 20K network devices, in 8 datacenters, across 3 continents, with less than a 1% failure rate -- every single day? Many solutions were tried and found wanting: a pure Perl solution, a vendor solution and then one in Ruby, none worked well enough. They not fast enough or they were not reliable enough, or they were not transparent enough when things went wrong. Now we all love Ruby but good Rubyists know that it is not always the best tool for the job. After re-examining the problem we decided to rewrite the application in a mixture of Erlang and Ruby. By exploiting the strengths of both -- Erlang's astonishing support for parallelism and Ruby's strengths in web development -- the problem was solved. In this talk we'll get down and dirty with the details: the problems we faced and how we solved them. We'll cover the application architecture, how Ruby and Erlang work together, and the Erlang approach to asynchronous operations (hint: it does not involve callbacks). So come on by and find out how you can get these two great languages to work together.

Published in: Technology, Business
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide
  • * Not an Erlang tutorial (but you will see some code)\n* Giving our experiences with Erlang + Ruby\n
  • \n
  • * Firewalls, load balancers\n\n
  • * North America, Europe and Hong Kong\n* Everything runs from DC in Virginia\n* High latency to LON and HKG\n* ~5K devices in LON\n
  • * Pass traffic like a pro\n* Individual devices won’t get faster\n\n\n
  • * Backup all devices every 24 hours\n* Update devices during off hours\n\n
  • * Newer devices have real management interfaces (give examples Junos, BigIP) but...\n* Majority of management happens via screen-scraping an SSH session\n* Pixen are slow and we have a lot of them\n\n
  • \n
  • \n
  • * Lots of devices == lots of data\n
  • * 260 GB of backups in old DB\n* We diff the backups to see if they have changed\n* We don’t want to be too clever\n * Err on the side of keeping too much rather than too little\n
  • * Impossible to predict what combination of factors will lead to a need to search\n* Certain lb vendor had a vulnerability in SSH daemon\n* Had to search all firewall configs to find affected devices with SSH access allowed\n
  • \n
  • * Every event has different information that needs to be stored\n* Lots of events per device\n
  • \n
  • \n
  • * Serial #, OS version, chassis details\n* Information is parsed from device output\n* We want to expose information useful to the business in one place\n\n
  • * No accepted cross-vendor backup standard\n* We abstract information as “files”\n\n
  • * 17K devices in 2009, ~21K now\n
  • * What was in place in 2009\n
  • * Rails and Ruby scripts\n* Overlapping responsibilities\n* Information silos\n* Difficult to change\n
  • * concurrency with Ruby 1.8 threads\n * It is an I/O bound problem\n * Threads block on I/O\n* expect.rb has performance issues\n * Matching input one character at a time\n
  • * Only one type of device\n* Small number of devices per manager\n* Expensive\n
  • \n
  • * Fully Buzzword-compliant!\n
  • * No message queue\n* Poll MongoDB for some jobs\n* Schedule others in code\n* Very few writers to the database\n
  • * Backend updates generally use MongoDB atomic updates\n* Need transactions for cross-collection modifications, but in-document can use atomic updates\n
  • \n
  • * You guys know Rails and have probably heard of mongodb\n* Take a minute to talk about Erlang\n
  • * Ericsson Computer Science Laboratory\n* 1986: Joe Armstrong creates Erlang\n* 1988: Erlang escapes the lab\n* 1998: Erlang Released as Open Source\n\n
  • * Developed for BT\n* 1.5 Million Lines of Erlang Code\n* 2 year evaluation\n* 9 9s of uptime\n
  • * Erlang doesn’t guarantee 9 9s\n* Gives you the tools to make high uptime possible\n
  • * Some core concepts are different than imperative/OO languages\n
  • * Just like Ruby\n
  • * Only assign to variables once\n* Allows flexible pattern matching and runtime assertions\n* “=” is the pattern match operator, not assignment\n
  • * Variables are function-scoped, so single assignment is really a non-issue\n
  • \n
  • * Changing a data structure creates a new one\n* Purely Functional Data Structures - Okasaki\n
  • \n
  • * Concurrency built into the language, not an add-on\n
  • \n
  • \n
  • * Two areas of interest: jobs framework and ReSTful API\n
  • \n
  • * Runner spins up multiple workers\n* Runner and workers are generic\n* Interesting work for each job is in callback module\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Example of how this solution made it easier to solve our problems\n\n* App running on two VMs and one developed problems\n* Moved all functionality to other node and rebuilt problematic VM\n* Doubled the workers on second VM with no detectable performance degradation\n\n
  • * Occupies approximately the same position in your architecture as Rails controllers, but thoroughly exposes you to the details of the HTTP request/response\n* When you develop apps with Webmachine, you are basically writing a custom webserver for your API\n* And what are those details... <transition to next slide>\n\n
  • * As seen through the eyes of Webmachine\n* Express API in terms of HTTP\n* Use HTTP as your domain language\n
  • \n
  • \n
  • * A bit like Rails routing\n* First element (the “pathspec”) is a list that matches the request URL\n* Second element is the name of an Erlang module that exports the overridden callbacks\n* Third element is the args to the init function on the callback module\n
  • * POEM - So far as I know, I coined this by accident\n* Why is the purity of functions important? predictability, testability, repeatability\n* You’ll usually override 4 or 5\n
  • * Override only the parts of the request cycle that interest you, and Webmachine will do something reasonable for the other cases\n
  • * Request parameter data is analogous to that contained in Rack::Request\n
  • * Putting it all together\n* Previous callbacks stash data in the context (in this case a device record)\n\n
  • \n
  • * You can get far with Ruby’s “Big 3” datatypes: String, Array and Hash\n\n
  • * Strings in Erlang are different\n* iolists are your friend\n
  • * Proplists/records vs hashes\n\n
  • * Erlang has internal iterators for lists (like each, etc)\n* No for loops, use recursion instead\n
  • * if statements must always have an else\n* case statements raise an error if no branches match\n* pattern matching can replace some conditions\n\n
  • * Erlang does not tolerate design mistakes as easily as Ruby\n* Pure vs impure functions\n* Pure is easy to test, IO is not your friend\n* Referential transparency\n* Dependency injection\n* Mocking is possible but not a panacea\n
  • * You must understand the concurrency primitives\n* In general you should be using the OTP behaviors\n* If you use ORM you must understand SQL\n
  • \n
  • * Emphasis on stability over new features\n* New useful features in Erlang for years that community frowns upon\n * Undocumented with uncertain future\n
  • * Standard library is very rich\n* 3rd party libraries tend to be immature\n
  • * lists, proplists, binary and string\n* string has duplicate functionality\n * made from merging two older modules\n* lists and proplists have duplicated, overlapping functionality\n
  • * agner, faxien, sinan, etc.\n
  • * how did we get this adopted?\n* how do you find Erlang programmers?\n* why not Node.js, EventMachine or Ruby + sockets?\n
  • \n
  • Transcript

    • 1. Polyglot Parallelism A Case Study in Using Erlang and Ruby at Rackspace
    • 2. The Problem Part 1
    • 3. 20,000network devices
    • 4. 9 Datacenters, 3 Continents
    • 5. devices not designed for high-throughput management
    • 6. we need a highthroughput solution
    • 7. the time spent in I/O is the primary bottleneck
    • 8. if you want to speedthings up you have totalk to more devices in parallel
    • 9. The Problem Part II
    • 10. huge blobs of data
    • 11. lots of backupsequals big database
    • 12. ad-hoc searching is difficult but important
    • 13. customer SLA meansneed to restore from backup quickly
    • 14. an event must begenerated for eachdevice interaction
    • 15. migrations areproblematic with that much data
    • 16. rigid schema made adapting to new devices difficult
    • 17. each device type hasdifferent properties
    • 18. “backup” meansdifferent things for each device type
    • 19. need to grow with the business
    • 20. PreviousSolution
    • 21. multiple Ruby apps
    • 22. difficult to scale
    • 23. vendor device managers
    • 24. New Solution
    • 25. the simplest thingthat could possibly work
    • 26. most db writes come fromscheduled jobs
    • 27. Other ClientsRails ReST API Erlang MongoDB Network Devices
    • 28. Joe Armstrong
    • 29. AXD301ATM Switch
    • 30. 99.9999999%
    • 31. Functional
    • 32. Dynamically Typed
    • 33. Single Assignment
    • 34. A = 1. %=> 1A = 2. %=> badmatch
    • 35. [B, 2, C] = [1, 2, 3].B = 1. %=> 1C = 3. %=> 3
    • 36. Immutable DataStructures
    • 37. D = dict:new().D1 = dict:store(foo, 1, D).D2 = dict:store(bar, 2, D1).
    • 38. Concurrency Oriented
    • 39. -module(fact).-export([fac/1]).fac(0) -> 1;fac(N) -> N * fac(N-1).
    • 40. -module(quicksort).-export([quicksort/1]).quicksort([]) -> [];quicksort([Pivot|Rest]) -> quicksort([Front || Front <- Rest, Front < Pivot]) ++ [Pivot] ++ quicksort([Back || Back <- Rest, Back >= Pivot]).
    • 41. Details
    • 42. jobs framework
    • 43. Runner Callback Workers Module
    • 44. Callback Runner Worker Module start readyitem process process ready . . . stop
    • 45. “behaviour” is interface
    • 46. behaviour_info(callbacks) -> [ {init, 1}, {process_item, 3}, {worker_died, 5}, {job_stopping, 1}, {job_complete, 2}].
    • 47. running({worker_ready, WorkerPid, ok}, S) -> case queue:out(S#state.items) of {empty, I2} -> stop_worker(WorkerPid, S), {next_state, complete, S#state{items = I2}}; {{value, Item}, I2} -> job_worker:process(WorkerPid, Item, now(), S#state.job_state), {next_state, running, S#state{items = I2}} end;
    • 48. handle_info({DOWN, _, process, WorkerPid, Info}, StateName, S) -> {Item, StartTime} = clear_worker(WorkerPid, S), Callback = S#state.callback, spawn(Callback, worker_died, [Item, WorkerPid, StartTime, Info, S#state.job_state]), %% Start a replacement worker start_workers(1, Callback), {next_state, StateName, S};
    • 49. handle_cast({process, Item, StartTime, JS}, S) -> Callback = S#state.callback, Continue = try Callback:process_item(Item, StartTime, JS) catch throw: Error -> error_logger:error_report(Error), ok end, job_runner:worker_ready(S#state.runner, self(), Continue), {noreply, S}.
    • 50. story time
    • 51. ReSTful API with WebmachineThe Convention Over Configuration Webserver
    • 52. http://webmachine.basho.comHTTP Request Lifecycle Diagram
    • 53. If you know HTTPWebmachine Is SimpleAs Proven by the “Number of Types of Things” Measurement of Complexity
    • 54. The 3 Most Important Types of Things In Webmachine1. Dispatch Rules (pure data--barely a thing!)2. Resources (composed of simple functions!)3. Requests (simple get/set interface!)
    • 55. Dispatch Rules { ["devices", server], device_resource, [] } GET /devices/12345 Webmachine inspects the device_resource module fordefined callbacks, and sets the Request record’s “server” value to 12345.
    • 56. Resources• POEM (Plain Old Erlang Module)• Composed of referentially transparent functions*• Functions are callbacks into the request lifecycle• Approximately 30 possible callback functions, e.g.: • resource_exists → 404 Not Found • is_authorized → 401 Not Authorized * mostly
    • 57. Resource Functions Perma-404resource_exists(Request, Context) -> {false, Request, Context}. Lucky Authis_authorized(Request, Context) -> S = calendar:time_to_seconds(now()), case S rem 2 of 0 -> {true, Request, Context}; 1 -> {“Basic realm=lucky”, Request, Context} end.
    • 58. Requests• The first argument to each resource function• Set and read request & response data RemoteIP = wrq:peer(Request).wrq:set_resp_header(“X-Answer”, “42”, Request).
    • 59. Retrieving a JSON Firewall Representationcontent_types_provided(Request, Context) -> Types = [{"application/json", to_json}], {Types, Request, Context}.to_json(Request, Context) -> Device = proplists:get_value(device, Context), UserId = get_user_id(Request), case fe_api_firewall:get_config(Device, UserId) of {ok, Config} -> success_response(Config, Request, Context); {error, Reason} -> error_response(502, Reason, Request, Context) end.
    • 60. Gotchas
    • 61. primitive obsession
    • 62. string-ish “hi how are you” <<“hello there”>>[<<"easy as ">>, [$a, $b, $c], " ☺n"].
    • 63. hashes vs records
    • 64. to loop is human, to recur divine
    • 65. Erlang conditionalsalways return a value
    • 66. design for testability
    • 67. don’t spawn,use OTP
    • 68. Downsides
    • 69. Erlang changes very slowly
    • 70. 3rd party libraries
    • 71. standard librarycan be inconsistent
    • 72. packagemanagement
    • 73. Questions
    • 74. @philtoland @lifeinzembla