Big data pipeline with scala by Rohit Rai, Tuplejump - presented at Pune Scala Symposium, ThoughtWorks.
Big Data Pipeline
A data engineering startup, with
a vision to simplify data
engineering and empower the
next generation of data powered
Who Am I?
• Founder and CEO @ Tuplejump
• Earlier worked at Pramati, Cordys and couple of startups.
• A polyglot developer
• Started with Perl and PHP, have worked with VB.Net, C#, VC++, Erlang and Haskell
• Love data hacking in R and Python
• Committed to Scala
• Believe in choosing the best tool for the task
• Open Source fanatic
@milliondreams | mytechrantings.blogspot.com
The big data pipeline
The Tuplejump Platform
from push and
Spark to batch
API with added
or load data
from and to
Cassandra provides a single storage
mechanism for Files, (un)structured
data, Generic data.
Building on Spark's ML framework,
going towards machine assisted
insights, we are in building our own
EA and ANN/DL frameworks to take
ML to the next level.
Shark + Calliope
Ad Hoc querying with shark on your
data in Dstore.
A OLAP cube engine
A modern, game changing data frontend, which
is “not just dashboards”, providing highly
interactive and reactive visualization frontend.
• All the advantages of Spark + All the advantages of Cassandra + Much more!
• Over 500x (100x in case of filtered data) faster than traditional Hadoop solutions
• Shark + C* provide for superfast ad hoc querying.
• UberCube empowers sub-millisecond responses on very large cubes
• MinerBot provides ready to use ML Algos, plus a possibility of much more complex
algos and mechanisms than just map reduce.
• Ready to use, no integration required
• Easy to develop, deploy, monitor and scale
• Object oriented and functional
• Runs on the JVM
• 100% compatible with Java
• Modern, evolving, scalable
• Concise, flexible and high performance
• Excellent support for DSL development
• Spark and Play use Scala as their primary language
• We used it for long and we love it!!!
The ultimate gyan!
You can flirt with other languages,
you can have short affairs with few,
You will fall in love with Scala at the first sight,
You have to marry her to know her!
Let’s call in some friends
• Akka - Actors to build concurrent and distributed applications
• Spark - The blue eyed whiz kid of the Big Data class
• Play - The web development champion
• SBT - The best builder in town
• ScalaTest - The story teller
• Shapeless and Scalaz - Masters of the Dark Arts
Concurrency With Akka
• Inspired by Erlang’s Actor Model
• Runs on the JVM
• Actors define behavior to handle typed messages
• Actors process one message at a time
• Can use Group/Pool of actors behind routers for concurrency
• Can run thousands of actos on a modern server
• Location transparency for clustering
• Supervision and state recovery for HA
Batch Processing with Spark
• Resilient Distributed Datasets
• Fast in-memory big data
• Map/Reduce on steroids
• Iterative and interactive
• Code in scala, java, python and now R
• Streaming (DStreams - Batch processing on streams)
• MLLib, Shark, Spark SQL, GraphX and more
Web development with Play
• Modern high velocity, highly scalable
• Built on Akka and Netty (NIO)
• Reactive in design (reactive I/O)
• Async HTTP, streaming HTTP, Comet, Websockets, build your own protocol
• Feature rich yet flexible
Build with SBT
• I hate writing XML
• Very easy to get started
• All the power of Scala in the build
• Maven dependency management + more
Testing with ScalaTest
• Write specs not tests and excellent tool for BDD
• Specs DSL very close to english
• Many testing styles
• Powerful matchers (“should be”)
• Mock objects with ScalaMock
Taking functional further
• Scrap your boilerplate
• Generic programming
• Existential types
• Bringing Haskel to Scala
• Monads, Functors and all the theory!