Operationalizing Clojure in mature enterprises can be difficult. I'm presenting a case study from my experience deploying and maintaining a clojure application for delivering ad-free videos to ISS for NASA. The goal is to tease out the core principles that makes an application "operational".
2. –Douglas Hofstadter (“I am a Strange Loop”)
“We don't want to focus on the trees (or their leaves)
at the expense of the forest.”
3. My Clojure Story
Introduced to Clojure - didn’t have prior Lisp experience.
Did my senior project on simulating Mobile Ad-hoc networks
using Clojure at Trinity College in 2011.
Started working at ESPN Innovation
Worked on variety of other languages - Java, Ruby, Python,
Javascript, C++
Clojure was my primary interface to JVM for experimentation
Decided to use Clojure to deliver ESPN programming to
International Space Station
SparX
2009
2011
2011-2013
2013
2015
4. Requirements
Cmdr. Chris Cassidy reached out to request
regular ESPN programming.
200 MB file limit
Had to be ready every day at noon Central
Time
Obvious choice:
Lets hire people to clip and send videos
every day!
5. But it’s 2013
Why not automate?
Also, let’s remove ads.
Motive: Validating the video services and interfaces we
had been working on.
Ok, so why Clojure?
6. Why Clojure?
Two weeks to deadline
Not all the pieces were clear
No guarantees from upstream services
Human errors abound
Source of data was people pressing buttons
And, systems failing would result in similar behavior
7. Why Clojure?
Immutability
I could keep the system as a “constant” in ever changing world
Idempotency - re-run if failed, resume at any point in pipeline.
Java Interop
Even when I had APIs that weren’t written by my group, they
were SOAP and XML based. Yay!
Inherently refactorable if designed correctly
8. Post-mortem
Still in production since September 2013
Strictly enforced the “naïve” approach that “should”
work
Learned a lot of lessons that go beyond Clojure
This talk is about these lessons
9. - Paul Graham
(“Hackers & Painters: Big Ideas from the Computer Age”)
“When you're forced to be simple, you're forced to
face the real problem.”
10. Parts of the stack
Core Assumptions
Operations
Familiar Interfaces
Overrides
State
Logging
Error Handling
Iterative Development
11. Core: Timestamps
Programs — items that have a name and “start” and
“end” times
Program Segments, Breaks — blocks within a program
that “start” and “end” at particular times.
It’s just a map and reduce operation now!!
Take only program segments and make them into a
video.
12. Why was it a good idea?
Bare set of functionality to bind everything together.
Everything else is a good signal and would make
system “better” but not dependable.
Aligning timestamps in UI is dead-easy to see where
things are not aligned.
TV Programs are events too.
13. Core: Dependency Graph
Your tasks are dependent on previous tasks
What’s the plan when they fail to execute?
15. On Operations
Functional Programs still need
Operational expertise
If you’re in big enough company with
an ops team
They don’t care about your FP
patterns - they shouldn’t have to.
Make configurations declarative
and readable
16. On Familiar Interfaces
Use standard configuration formats
— readable, parseable by anything
I picked Yaml
Familiar scheduling
Used cron strings thanks to
Quartz
Everything in UTC internally
Timezones treated as side-
effects
programs:)
))*)name:)AROUND)THE)HORN)
))))short_name:)ATH)
))))start_time:)"20:00:00")
))*)name:)PARDON)THE)INTERRUPTION)
))))short_name:)PTI)
))))start_time:)"20:30:00")
))*)name:)SPORTSCENTER)
))))short_name:)SportsCenter)
))))start_time:)"14:00:00")))
run:)
))cron:)0)0)14)1/1)*)?)*)
)
final_tz:)America/Anchorage)
)
17. On Familiar Interfaces
Started with a solid command line interface.
Took the Config and Options abstractions and exposed
as REST API.
Switches)))))))))))))))))))))))))Default))))))))Desc)
)////////)))))))))))))))))))))))))///////))))))))////)
)/c,)//config)))))))))))))))))))))nasamatic.yml))Use)this)config)file)path)
)/h,)//no/help,)//help))))))))))))false))))))))))Show)Help)
)/f,)//no/force,)//force))))))))))false))))))))))Force)run)now)instead)of)using)Cron)
)/u,)//no/upload,)//upload))))))))true)))))))))))Upload)or)not)
)/t,)//no/transcode,)//transcode))true)))))))))))Transcode)or)not)
)/B,)//hours/before/now)))))))))))0))))))))))))))How)many)hours)before)now)to)look)at)
)/d,)//no/dry/run,)//dry/run))))))false))))))))))Dry)Run)modeOptions)
)
18. On Familiar Interfaces
Also wrote a Web UI in AngularJS for Operations team
to use in cases of failed runs
The system failed rarely enough that I had to retrain
people all the time.
Just gave up and used the CLI tool most of the time
UI breakage due to javascript issues
Exposing the API to Slack was more popular
19. On Familiar Interfaces
One-to-one correspondence between CLI and JSON
Key switch type default description
upload -u,--[no-]upload flag TRUE Upload to the FTP server
transcode -t, --[no-]transcode flag TRUE Pass the files through transcoder
qc -q,--[no-]qc flag FALSE Submit file to be QC’d by Pulsar
hours-before-now -B,--hours-before-now int 0 Number of hours before to look
dry-run -d,--dry-run flag FALSE Run without affecting filesystem/uploading
filter-by-program-tag -p, --[no-]filter-by-program-
tag
flag TRUE Select contiguous programTags from
Authnet or not
short-names -s,--short-names string Programs to select as declared in the
configuration file under programs. Default
behavior is to run all programs declared in
configuration.
20. On Overrides
Core Abstractions - Config and Options
Config: A static set of parameters that defines the
general behavior of program. Doesn’t change too
often.
Options: A dynamic set of parameters that can override
config per-run.
Every job gets defined entirely by them.
21. On State
Keep the least amount of state possible
The system used no database at all for operations.
Intermediate files that were effects of steps were
relied upon
Have to keep only last-seen state for live operation.
Re-running is trivial.
22. On Logging
Timestamp, state, key=value
Parseable by anything! (It was Splunk’s weirdness that
led to this)
Can generate metrics from on-going operations
without instrumenting further.
Wired to PagerDuty directly
23. On Error Handling
Find out about error, try to fix it — if not possible, system
should try the whole process next day/job
Parent form generates random trace-id for a job
Passed to all children for that job
Any exceptions are passed via the chain and logged
Back off and Retry — if all else fails, let humans figure it
out.
26. Operational Clojure
Builds on simple concepts
they’re the units of composition
Sparingly depends on global state, if at all
Leverages existing infrastructure and people
Adapts to changes in scope and requirements
Loosely couples data and execution
27. Future
I had great time coming up with some of these
patterns
Particularly - config and options for jobs
Thinking about open source re-implementations
More Clojure-y things at SparX coming soon. ;)