Monitoring a Large-Scale Infrastructure with Clojure
Who am I?          Dennis Rowe    Senior Software Developer    Dell MessageOne - DevOps2                               Ora...
MessageOne    E-Mail Continuity     E-Mail Archive      E-Mail Search3                       Oracle OpenWorld 2011
The Basics      2646 Servers      3 Countries    3 Billion E-Mail    5 Million Users    12 Tired People4                  ...
We got to have a    way to monitor all       that stuff…     Maybe not the people5                           Oracle OpenWo...
So, we came up with a solution…6                                Oracle OpenWorld 2011
“Kneel before Zod”         -- General Zod7                         Oracle OpenWorld 2011
A Bit of History             Initially written in Python            Utilized Twisted framework    Historical Data stored i...
Why?     Global Interpreter Lock (GIL)     caused performance problems    Relational Database not efficient           for ...
So…                 Why switch to Clojure?                         It is hip     It was designed with multi-threading in m...
“And there was much rejoicing”            -- Monty Python and the Holy Grail11                                            ...
So, this is how we did it12                               Oracle OpenWorld 2011
Loader     Takes XML and dumps it on a       Message Bus (RabbitMQ)     Nothing much to see here but…13                   ...
Data is Code     So, how do we store the configurations we want for the various datacenters?                             A...
RabbitMQ                       That is easy        We will just use the RabbitMQ Java API       We will create Clojure cen...
Also!     If code is data … then we can just send the code over                           RabbitMQ16                      ...
Wait – What?                We don’t need any funky configurations?                     We don’t need to use XML?         ...
Persister       Takes the data off the bus and writes it to disk         The Java ecosystem has tools for that, too       ...
Consumer     Takes metrics and does stuff with them                    Checks                   Computes                  ...
Examples                    Check         (check “mta-delay” :degraded     (above (* 3600 72)) :fmt “%,.1f secs”)         ...
Aggregate                  (aggregate “cfg-anomalies”)                       Historical Aggregate     (hist-aggregate “ind...
Threading     All those metrics are Clojure Agents, so I don’t                 have to worry about it            All 16 of...
Look23          Oracle OpenWorld 2011
WWW      We are not web developer types, which is fine,     Clojure (plus some libraries) makes that easy, too            ...
Query     We need a way to query the data in real time                Clojure is homiconic                        So…     ...
The DSL                          It is just code     We can use existing Clojure functions plus new ones like:            ...
Query Examplewhere :metric [“qsize” “qsize-2h-old” “rate”] |pivot |filter (> :qsize 50000) |select :host          :qsize  ...
Explanation         Looks a lot like Linux pipes     Which is a good way to think about it          Clojure way of reading...
Output29            Oracle OpenWorld 2011
DevOps     What we needed (and what we got)                 Reports            Ad-hoc Queries           Corrective actions...
Corrective Actions              Write little python scripts that               pull data and take actions           This w...
App Smarter     App now uses the monitoring to feed intelligently             Less operator interaction needed          Mo...
Q and A33
Upcoming SlideShare
Loading in …5
×

Java one 2011 monitoring a large-scale infrastructure with clojure

2,038 views

Published on

Published in: Technology, News & Politics
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,038
On SlideShare
0
From Embeds
0
Number of Embeds
289
Actions
Shares
0
Downloads
17
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Java one 2011 monitoring a large-scale infrastructure with clojure

  1. 1. Monitoring a Large-Scale Infrastructure with Clojure
  2. 2. Who am I? Dennis Rowe Senior Software Developer Dell MessageOne - DevOps2 Oracle OpenWorld 2011
  3. 3. MessageOne E-Mail Continuity E-Mail Archive E-Mail Search3 Oracle OpenWorld 2011
  4. 4. The Basics 2646 Servers 3 Countries 3 Billion E-Mail 5 Million Users 12 Tired People4 Oracle OpenWorld 2011
  5. 5. We got to have a way to monitor all that stuff… Maybe not the people5 Oracle OpenWorld 2011
  6. 6. So, we came up with a solution…6 Oracle OpenWorld 2011
  7. 7. “Kneel before Zod” -- General Zod7 Oracle OpenWorld 2011
  8. 8. A Bit of History Initially written in Python Utilized Twisted framework Historical Data stored in relational database It worked, but it did not perform8 Oracle OpenWorld 2011
  9. 9. Why? Global Interpreter Lock (GIL) caused performance problems Relational Database not efficient for time-series data9 Oracle OpenWorld 2011
  10. 10. So… Why switch to Clojure? It is hip It was designed with multi-threading in mind It is a functional language It uses the JVM We can use all the Java libraries lying around Homoiconic10 Oracle OpenWorld 2011
  11. 11. “And there was much rejoicing” -- Monty Python and the Holy Grail11 Oracle OpenWorld 2011
  12. 12. So, this is how we did it12 Oracle OpenWorld 2011
  13. 13. Loader Takes XML and dumps it on a Message Bus (RabbitMQ) Nothing much to see here but…13 Oracle OpenWorld 2011
  14. 14. Data is Code So, how do we store the configurations we want for the various datacenters? As code … data … code … [“dc1” “url1” “type1” “dc2” “url2” “type2”] The configs are just Clojure code and they make sense14 Oracle OpenWorld 2011
  15. 15. RabbitMQ That is easy We will just use the RabbitMQ Java API We will create Clojure centric data structures This whole Java interoperability is kind of nice … things just kind of work15 Oracle OpenWorld 2011
  16. 16. Also! If code is data … then we can just send the code over RabbitMQ16 Oracle OpenWorld 2011
  17. 17. Wait – What? We don’t need any funky configurations? We don’t need to use XML? We don’t need to use JSON? If it is Clojure talking to Clojure we can just use data (or is it code, I am confused)17 Oracle OpenWorld 2011
  18. 18. Persister Takes the data off the bus and writes it to disk The Java ecosystem has tools for that, too JrobinWe now have our own little timeseries database and we didn’t really have to work for it.18 Oracle OpenWorld 2011
  19. 19. Consumer Takes metrics and does stuff with them Checks Computes Aggregates Historical Aggregates19 Oracle OpenWorld 2011
  20. 20. Examples Check (check “mta-delay” :degraded (above (* 3600 72)) :fmt “%,.1f secs”) Compute (compute “mem-swap-used” :using [swap_total swap_free] :as (- swap_total swap_free))20 Oracle OpenWorld 2011
  21. 21. Aggregate (aggregate “cfg-anomalies”) Historical Aggregate (hist-aggregate “index-percent-failed” “index-percent- failed#hist-1h” 3600 :agg-fn avg)21 Oracle OpenWorld 2011
  22. 22. Threading All those metrics are Clojure Agents, so I don’t have to worry about it All 16 of my processors get used Life is easy22 Oracle OpenWorld 2011
  23. 23. Look23 Oracle OpenWorld 2011
  24. 24. WWW We are not web developer types, which is fine, Clojure (plus some libraries) makes that easy, too Compojure Hiccup So, no HTML. Just code [:a {:href “/”}]24 Oracle OpenWorld 2011
  25. 25. Query We need a way to query the data in real time Clojure is homiconic So… We will just create our own DSL25 Oracle OpenWorld 2011
  26. 26. The DSL It is just code We can use existing Clojure functions plus new ones like: where select pivot filter sort format sum-by and agg-by26 Oracle OpenWorld 2011
  27. 27. Query Examplewhere :metric [“qsize” “qsize-2h-old” “rate”] |pivot |filter (> :qsize 50000) |select :host :qsize [(* 100 (- 1 (/ :qsize-2h-old :qsize))) :pct-recent] :rate |sort :pct-recent27 Oracle OpenWorld 2011
  28. 28. Explanation Looks a lot like Linux pipes Which is a good way to think about it Clojure way of reading it is: (sort (select (filter (pivot (where)))))28 Oracle OpenWorld 2011
  29. 29. Output29 Oracle OpenWorld 2011
  30. 30. DevOps What we needed (and what we got) Reports Ad-hoc Queries Corrective actions? Make the app smarter?30 Oracle OpenWorld 2011
  31. 31. Corrective Actions Write little python scripts that pull data and take actions This was so easy that we had to do it Simple, repetitive actions are now fully automated Life is better31 Oracle OpenWorld 2011
  32. 32. App Smarter App now uses the monitoring to feed intelligently Less operator interaction needed More time spent solving real problems32 Oracle OpenWorld 2011
  33. 33. Q and A33

×