Clojure has always been good at manipulating data. With the release of spec and Onyx (“a masterless, cloud scale, fault tolerant, high performance distributed computation system”) good became best. In this talk you will learn about a streaming data layer architecture build around Kafka and Onyx that is self-describing, declarative, scalable and convenient to work with for the end user. The focus will be on the power and elegance of describing data and computation with data; the inferences and automations that can be built on top of that; and how and why Clojure is a natural choice for tasks that involve a lot of data manipulation, touching both on functional programming and lisp-specifics such as code-is-data.
We will look at how such an approach can be used to manage a data warehouse by automatically inferring materialized views from raw incoming data or other views based on a combination of heuristics, statistical analysis (seasonality, outlier removal, ...) and predefined ontologies. Doing so is a practical way to maintain a large number of views, increasing their availability and abstracting the complexity into declarative rules, rather than having an ETL pipeline with dozens or even hundreds of hand crafted tasks.
The system described requires relatively little effort upfront but can easily grow with one's needs both in terms of scale as well as scope. With its good introspection capabilities and strong decoupling it is for instance an excellent substrate for putting machine learning algorithms in production, which is the final use-case we will dive into.
4. The analytics chasm
Ideal. Almost real-time, can
be done during brainstorming
without disrupting flow
< 2min < 20min project
squeeze in
somewhere
in the day
fail
roadmap
ahoy!
5. My goto architecture
KafkaDB Events
Onyx Onyx
Onyx
Persist all events to S3
• time travel
• query with AWS Athena
6. Onyxa masterless, cloud scale, fault tolerant, high
performance distributed computation system
… written entirely in Clojure
7. Clojure at a glance
• Lisp running on JVM
• Functional, dynamic, immutable
• Excellent concurrency and state management
support
• Unparalleled data manipulation
• Good Java interoperability
8. Onyx at
• In production for almost a year
• ETL
• online machine learning
• offline (batch) machine learning
• ad-hoc analysis
20. Machine learning with Onyx
• Hyperparameter server build on top of Onyx
parameters
• Batch & streaming mode
• Monitoring: performance metrics, side channels for
partial results/introspection into computiation
• Everything is data so easy to build tools around
24. Queryable data
descriptions
Turn spec into a graph
A fully interactive and open type system!
order
promo code
user
account age
country
always always
alwaysmaybe