Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Doing data science with Clojure


Published on

Having programmers do data science is terrible, if only everyone else were not even worse! The problem is tools – either a bunch of libraries and an agnostic IDE, or some point-and-click wonder which no matter how glossy never quite fits our need. The dual lisp tradition of grow-your-own-language and grow-your-own-editor gives me hope there is a third way. This talk is a meditation on how I do data science with Clojure, what the ideal process would look like, and the tools needed to get there. Some already exists (or can at least be bodged together); others can be made with relative ease (and we are already working on some of these); but a few will take a lot more hammock time.

Clojure is fantastic for data manipulation and rapid prototyping, but falls short when it comes to communicating your insights. What is lacking are good visualization libraries and (sharable) notebook-like environments. I'll show my workflow which weaves Clojure with R (for ggplot) and Python (for scikit-learn) and tell you why it's wrong; how IPythons of the world have trapped us in a local maximum and why we need a reconceptualization similar to what a REPL does to programming. All this interposed with my experience doing data science with Clojure (everything from ETL to on-the-spot analysis during brainstormings) and how these are interwoven into the design of Huri my library for the lazy data scientist.

Published in: Data & Analytics
  • Be the first to comment

Doing data science with Clojure

  1. 1. Doing data science with Clojure @sbelak
  2. 2. The analytics chasm Ideal. Almost real-time, can be done during brainstorming without disrupting flow < 2min < 20min project squeeze in somewhere in the day fail roadmap
  3. 3. Easy things should be easy, and hard things should be possible. — L. Wall
  4. 4. Data frames considered harmful • Data frame (=table) conflates representation and abstraction • Clojure excels in structure manipulation/encoding
  5. 5. • No data structures, just functions over collections • Composable (even DSLs — no macros!) • Reasonably fast (transducers <3) • Do-what-I-mean (auto-sort, liberal with inputs, …) • Minimal buy-in • Support reaching into nested structures everywhere
  6. 6. Composability is key to quick iterating • Curried versions where possible • ->> and partial friendly • Side benefit: consistent API
  7. 7. “This is possibly Clojure’s most important property: the syntax expresses the code’s semantic layers. An experienced reader of Clojure can skip over most of the code and have a lossless understanding of its high- level intent.” — Z. Tellman, Elements of Clojure
  8. 8. Live programming
  9. 9. Catching errors early more context easier debugging faster iterating
  10. 10. clojure.spec
  11. 11. Queryable data descriptions
  12. 12. <3 Bret Victor
  13. 13. Think in distributions, not numbers
  14. 14. The power of sharing runtime
  15. 15. Notebooks as dashboards
  16. 16. The ecosystem
  17. 17. What about machine learning? farm it out to sklearn
  18. 18. Mini compilers targeting a specific library in another language
  19. 19. huri.plot • DSL that compiles to ggplot2 • Targets Gorilla REPL • Follows the rest of Huri’s design philosophy • bar chart, scatter plot, line chart, box & violin plot, heatmap, histogram
  20. 20. Takeouts • Speed-of-answer matters • Data science is about communication • We don’t have to reinvent every wheel in Clojure • Clojure is fantastic at structure manipulation, play to its strengths • Blurring the line between environment and work is a powerful idea
  21. 21. Questions @sbelak