Monitoring a Large-Scale Infrastructure with Clojure
Who am I?

          Dennis Rowe
    Senior Software Developer
    Dell MessageOne - DevOps
2                               Oracle OpenWorld 2011
MessageOne

    E-Mail Continuity
     E-Mail Archive
      E-Mail Search

3                       Oracle OpenWorld 2011
The Basics
      2646 Servers
      3 Countries
    3 Billion E-Mail
    5 Million Users
    12 Tired People
4                      Oracle OpenWorld 2011
We got to have a
    way to monitor all
       that stuff…
     Maybe not the people

5                           Oracle OpenWorld 2011
So, we came up with a solution…


6                                Oracle OpenWorld 2011
“Kneel before Zod”
         -- General Zod


7                         Oracle OpenWorld 2011
A Bit of History
             Initially written in Python
            Utilized Twisted framework
    Historical Data stored in relational database

         It worked, but it did not perform
8                                              Oracle OpenWorld 2011
Why?
     Global Interpreter Lock (GIL)
     caused performance problems

    Relational Database not efficient
           for time-series data
9                                       Oracle OpenWorld 2011
So…
                 Why switch to Clojure?
                         It is hip
     It was designed with multi-threading in mind
               It is a functional language
                     It uses the JVM
     We can use all the Java libraries lying around
                       Homoiconic
10                                                Oracle OpenWorld 2011
“And there was much rejoicing”
            -- Monty Python and the Holy Grail




11                                               Oracle OpenWorld 2011
So, this is how we did it




12                               Oracle OpenWorld 2011
Loader
     Takes XML and dumps it on a
       Message Bus (RabbitMQ)
     Nothing much to see here but…
13                              Oracle OpenWorld 2011
Data is Code
     So, how do we store the configurations we want for the various datacenters?

                             As code … data … code …

                                [“dc1” “url1” “type1”
                                “dc2” “url2” “type2”]

               The configs are just Clojure code and they make sense

14                                                                     Oracle OpenWorld 2011
RabbitMQ
                       That is easy
        We will just use the RabbitMQ Java API
       We will create Clojure centric data structures

     This whole Java interoperability is kind of nice …
                 things just kind of work
15                                                Oracle OpenWorld 2011
Also!
     If code is data … then we can just send the code over
                           RabbitMQ


16                                                 Oracle OpenWorld 2011
Wait – What?
                We don’t need any funky configurations?
                     We don’t need to use XML?
                     We don’t need to use JSON?

     If it is Clojure talking to Clojure we can just use data (or is it
                            code, I am confused)


17                                                            Oracle OpenWorld 2011
Persister
       Takes the data off the bus and writes it to disk
         The Java ecosystem has tools for that, too

                           Jrobin

We now have our own little timeseries database and we didn’t
                really have to work for it.
18                                                   Oracle OpenWorld 2011
Consumer
     Takes metrics and does stuff with them
                    Checks
                   Computes
                   Aggregates
             Historical Aggregates
19                                       Oracle OpenWorld 2011
Examples
                    Check
         (check “mta-delay” :degraded
     (above (* 3600 72)) :fmt “%,.1f secs”)

                    Compute
          (compute “mem-swap-used”
        :using [swap_total swap_free]
         :as (- swap_total swap_free))
20                                            Oracle OpenWorld 2011
Aggregate
                  (aggregate “cfg-anomalies”)

                       Historical Aggregate
     (hist-aggregate “index-percent-failed” “index-percent-
                failed#hist-1h” 3600 :agg-fn avg)

21                                                  Oracle OpenWorld 2011
Threading
     All those metrics are Clojure Agents, so I don’t
                 have to worry about it

            All 16 of my processors get used

                       Life is easy

22                                               Oracle OpenWorld 2011
Look




23          Oracle OpenWorld 2011
WWW
      We are not web developer types, which is fine,
     Clojure (plus some libraries) makes that easy, too

                        Compojure
                         Hiccup

                 So, no HTML. Just code
                      [:a {:href “/”}]

24                                                        Oracle OpenWorld 2011
Query
     We need a way to query the data in real time
                Clojure is homiconic
                        So…
          We will just create our own DSL

25                                          Oracle OpenWorld 2011
The DSL
                          It is just code

     We can use existing Clojure functions plus new ones like:
                              where
                               select
                               pivot
                               filter
                                sort
                              format
                        sum-by and agg-by
26                                                               Oracle OpenWorld 2011
Query Example
where :metric [“qsize” “qsize-2h-old” “rate”] |
pivot |
filter (> :qsize 50000) |
select :host
          :qsize
          [(* 100 (- 1 (/ :qsize-2h-old :qsize))) :pct-recent]
          :rate |
sort :pct-recent

27                                                               Oracle OpenWorld 2011
Explanation
         Looks a lot like Linux pipes
     Which is a good way to think about it


          Clojure way of reading it is:
      (sort (select (filter (pivot (where)))))

28                                               Oracle OpenWorld 2011
Output




29            Oracle OpenWorld 2011
DevOps
     What we needed (and what we got)

                 Reports
            Ad-hoc Queries
           Corrective actions?
          Make the app smarter?

30                                      Oracle OpenWorld 2011
Corrective Actions
              Write little python scripts that
               pull data and take actions
           This was so easy that we had to do it
     Simple, repetitive actions are now fully automated
                       Life is better

31                                                    Oracle OpenWorld 2011
App Smarter
     App now uses the monitoring to feed intelligently
             Less operator interaction needed
          More time spent solving real problems

32                                                Oracle OpenWorld 2011
Q and A




33

Java one 2011 monitoring a large-scale infrastructure with clojure

  • 1.
    Monitoring a Large-ScaleInfrastructure with Clojure
  • 2.
    Who am I? Dennis Rowe Senior Software Developer Dell MessageOne - DevOps 2 Oracle OpenWorld 2011
  • 3.
    MessageOne E-Mail Continuity E-Mail Archive E-Mail Search 3 Oracle OpenWorld 2011
  • 4.
    The Basics 2646 Servers 3 Countries 3 Billion E-Mail 5 Million Users 12 Tired People 4 Oracle OpenWorld 2011
  • 5.
    We got tohave a way to monitor all that stuff… Maybe not the people 5 Oracle OpenWorld 2011
  • 6.
    So, we cameup with a solution… 6 Oracle OpenWorld 2011
  • 7.
    “Kneel before Zod” -- General Zod 7 Oracle OpenWorld 2011
  • 8.
    A Bit ofHistory Initially written in Python Utilized Twisted framework Historical Data stored in relational database It worked, but it did not perform 8 Oracle OpenWorld 2011
  • 9.
    Why? Global Interpreter Lock (GIL) caused performance problems Relational Database not efficient for time-series data 9 Oracle OpenWorld 2011
  • 10.
    So… Why switch to Clojure? It is hip It was designed with multi-threading in mind It is a functional language It uses the JVM We can use all the Java libraries lying around Homoiconic 10 Oracle OpenWorld 2011
  • 11.
    “And there wasmuch rejoicing” -- Monty Python and the Holy Grail 11 Oracle OpenWorld 2011
  • 12.
    So, this ishow we did it 12 Oracle OpenWorld 2011
  • 13.
    Loader Takes XML and dumps it on a Message Bus (RabbitMQ) Nothing much to see here but… 13 Oracle OpenWorld 2011
  • 14.
    Data is Code So, how do we store the configurations we want for the various datacenters? As code … data … code … [“dc1” “url1” “type1” “dc2” “url2” “type2”] The configs are just Clojure code and they make sense 14 Oracle OpenWorld 2011
  • 15.
    RabbitMQ That is easy We will just use the RabbitMQ Java API We will create Clojure centric data structures This whole Java interoperability is kind of nice … things just kind of work 15 Oracle OpenWorld 2011
  • 16.
    Also! If code is data … then we can just send the code over RabbitMQ 16 Oracle OpenWorld 2011
  • 17.
    Wait – What? We don’t need any funky configurations? We don’t need to use XML? We don’t need to use JSON? If it is Clojure talking to Clojure we can just use data (or is it code, I am confused) 17 Oracle OpenWorld 2011
  • 18.
    Persister Takes the data off the bus and writes it to disk The Java ecosystem has tools for that, too Jrobin We now have our own little timeseries database and we didn’t really have to work for it. 18 Oracle OpenWorld 2011
  • 19.
    Consumer Takes metrics and does stuff with them Checks Computes Aggregates Historical Aggregates 19 Oracle OpenWorld 2011
  • 20.
    Examples Check (check “mta-delay” :degraded (above (* 3600 72)) :fmt “%,.1f secs”) Compute (compute “mem-swap-used” :using [swap_total swap_free] :as (- swap_total swap_free)) 20 Oracle OpenWorld 2011
  • 21.
    Aggregate (aggregate “cfg-anomalies”) Historical Aggregate (hist-aggregate “index-percent-failed” “index-percent- failed#hist-1h” 3600 :agg-fn avg) 21 Oracle OpenWorld 2011
  • 22.
    Threading All those metrics are Clojure Agents, so I don’t have to worry about it All 16 of my processors get used Life is easy 22 Oracle OpenWorld 2011
  • 23.
    Look 23 Oracle OpenWorld 2011
  • 24.
    WWW We are not web developer types, which is fine, Clojure (plus some libraries) makes that easy, too Compojure Hiccup So, no HTML. Just code [:a {:href “/”}] 24 Oracle OpenWorld 2011
  • 25.
    Query We need a way to query the data in real time Clojure is homiconic So… We will just create our own DSL 25 Oracle OpenWorld 2011
  • 26.
    The DSL It is just code We can use existing Clojure functions plus new ones like: where select pivot filter sort format sum-by and agg-by 26 Oracle OpenWorld 2011
  • 27.
    Query Example where :metric[“qsize” “qsize-2h-old” “rate”] | pivot | filter (> :qsize 50000) | select :host :qsize [(* 100 (- 1 (/ :qsize-2h-old :qsize))) :pct-recent] :rate | sort :pct-recent 27 Oracle OpenWorld 2011
  • 28.
    Explanation Looks a lot like Linux pipes Which is a good way to think about it Clojure way of reading it is: (sort (select (filter (pivot (where))))) 28 Oracle OpenWorld 2011
  • 29.
    Output 29 Oracle OpenWorld 2011
  • 30.
    DevOps What we needed (and what we got) Reports Ad-hoc Queries Corrective actions? Make the app smarter? 30 Oracle OpenWorld 2011
  • 31.
    Corrective Actions Write little python scripts that pull data and take actions This was so easy that we had to do it Simple, repetitive actions are now fully automated Life is better 31 Oracle OpenWorld 2011
  • 32.
    App Smarter App now uses the monitoring to feed intelligently Less operator interaction needed More time spent solving real problems 32 Oracle OpenWorld 2011
  • 33.