`
Big Data Pipeline
@tuplejump
A data engineering startup, with
a vision to simplify data
engineering and empower the
next generation of data powered
miracles!
tuplejump
Who Am I?
• Founder and CEO @ Tuplejump
• Earlier worked at Pramati, Cordys and couple of startups.
• A polyglot developer
• Started with Perl and PHP, have worked with VB.Net, C#, VC++, Erlang and Haskell
• Love data hacking in R and Python
• Java and Javascript fed me for a long long time
• Committed to Scala
• Believe in choosing the best tool for the task
• Open Source fanatic
@milliondreams | mytechrantings.blogspot.com
The big data pipeline
COLLECT TRANSFORM
PREDICT
STORE
EXPLORE VISUALIZE
The Tuplejump Platform
COLLECT TRANSFORM
PREDICT
STORE
EXPLORE VISUALIZE
Hydra
The tentacled
framework to
gather high
volume and
velocity data
from push and
pull powered
by Akka,
reacting on
demands to
events and
streaming to
Spark to batch
process.
COLLECT
Spark +
Calliope
Using the
friendly Spark
API with added
features to
easily consume
or load data
from and to
Cassandra
powered
storage.
TRANSFORM
Cassandra++
Cassandra provides a single storage
mechanism for Files, (un)structured
data, Generic data.
STORE
MinerBot
Building on Spark's ML framework,
going towards machine assisted
insights, we are in building our own
EA and ANN/DL frameworks to take
ML to the next level.
PREDICT
Shark + Calliope
Ad Hoc querying with shark on your
data in Dstore.
UberCube
A OLAP cube engine
EXPLORE
Pissaro
A modern, game changing data frontend, which
is “not just dashboards”, providing highly
interactive and reactive visualization frontend.
VIZUALIZE
Advantages
• All the advantages of Spark + All the advantages of Cassandra + Much more!
• Over 500x (100x in case of filtered data) faster than traditional Hadoop solutions
• Shark + C* provide for superfast ad hoc querying.
• UberCube empowers sub-millisecond responses on very large cubes
• MinerBot provides ready to use ML Algos, plus a possibility of much more complex
algos and mechanisms than just map reduce.
• Ready to use, no integration required
• Easy to develop, deploy, monitor and scale
Why Scala?
• Object oriented and functional
• Runs on the JVM
• 100% compatible with Java
• Modern, evolving, scalable
• Concise, flexible and high performance
• Excellent support for DSL development
• Spark and Play use Scala as their primary language
• We used it for long and we love it!!!
The ultimate gyan!
You can flirt with other languages,
you can have short affairs with few,
You will fall in love with Scala at the first sight,
You have to marry her to know her!
Let’s call in some friends
• Akka - Actors to build concurrent and distributed applications
• Spark - The blue eyed whiz kid of the Big Data class
• Play - The web development champion
• SBT - The best builder in town
• ScalaTest - The story teller
• Shapeless and Scalaz - Masters of the Dark Arts
Concurrency With Akka
• Inspired by Erlang’s Actor Model
• Runs on the JVM
• Actors define behavior to handle typed messages
• Actors process one message at a time
• Can use Group/Pool of actors behind routers for concurrency
• Can run thousands of actos on a modern server
• Location transparency for clustering
• Supervision and state recovery for HA
Batch Processing with Spark
• Resilient Distributed Datasets
• Fast in-memory big data
• Map/Reduce on steroids
• Iterative and interactive
• Code in scala, java, python and now R
• Streaming (DStreams - Batch processing on streams)
• MLLib, Shark, Spark SQL, GraphX and more
Web development with Play
• Modern high velocity, highly scalable
• Built on Akka and Netty (NIO)
• Reactive in design (reactive I/O)
• Async HTTP, streaming HTTP, Comet, Websockets, build your own protocol
• Feature rich yet flexible
Build with SBT
• I hate writing XML
• Very easy to get started
• All the power of Scala in the build
• Maven dependency management + more
Testing with ScalaTest
• Write specs not tests and excellent tool for BDD
• Specs DSL very close to english
• Many testing styles
• Powerful matchers (“should be”)
• Fixtures
• Mock objects with ScalaMock
Taking functional further
• Shapeless
• Scrap your boilerplate
• Generic programming
• Existential types
• ScalaZ
• Bringing Haskel to Scala
• Monads, Functors and all the theory!
Thank you!
• http://www.tuplejump.com/
• http://github.com/tuplejump/
• http://tuplejump.github.com/calliope/
• http://tuplejump.github.com/stargate/
• @tuplejump on twitter

Big Data pipeline with Scala by Rohit Rai, Tuplejump - presented at Pune Scala Symposium 2014, ThoughtWorks

  • 1.
  • 2.
    A data engineeringstartup, with a vision to simplify data engineering and empower the next generation of data powered miracles! tuplejump
  • 3.
    Who Am I? •Founder and CEO @ Tuplejump • Earlier worked at Pramati, Cordys and couple of startups. • A polyglot developer • Started with Perl and PHP, have worked with VB.Net, C#, VC++, Erlang and Haskell • Love data hacking in R and Python • Java and Javascript fed me for a long long time • Committed to Scala • Believe in choosing the best tool for the task • Open Source fanatic @milliondreams | mytechrantings.blogspot.com
  • 4.
    The big datapipeline COLLECT TRANSFORM PREDICT STORE EXPLORE VISUALIZE
  • 5.
    The Tuplejump Platform COLLECTTRANSFORM PREDICT STORE EXPLORE VISUALIZE Hydra The tentacled framework to gather high volume and velocity data from push and pull powered by Akka, reacting on demands to events and streaming to Spark to batch process. COLLECT Spark + Calliope Using the friendly Spark API with added features to easily consume or load data from and to Cassandra powered storage. TRANSFORM Cassandra++ Cassandra provides a single storage mechanism for Files, (un)structured data, Generic data. STORE MinerBot Building on Spark's ML framework, going towards machine assisted insights, we are in building our own EA and ANN/DL frameworks to take ML to the next level. PREDICT Shark + Calliope Ad Hoc querying with shark on your data in Dstore. UberCube A OLAP cube engine EXPLORE Pissaro A modern, game changing data frontend, which is “not just dashboards”, providing highly interactive and reactive visualization frontend. VIZUALIZE
  • 6.
    Advantages • All theadvantages of Spark + All the advantages of Cassandra + Much more! • Over 500x (100x in case of filtered data) faster than traditional Hadoop solutions • Shark + C* provide for superfast ad hoc querying. • UberCube empowers sub-millisecond responses on very large cubes • MinerBot provides ready to use ML Algos, plus a possibility of much more complex algos and mechanisms than just map reduce. • Ready to use, no integration required • Easy to develop, deploy, monitor and scale
  • 7.
    Why Scala? • Objectoriented and functional • Runs on the JVM • 100% compatible with Java • Modern, evolving, scalable • Concise, flexible and high performance • Excellent support for DSL development • Spark and Play use Scala as their primary language • We used it for long and we love it!!!
  • 8.
    The ultimate gyan! Youcan flirt with other languages, you can have short affairs with few, You will fall in love with Scala at the first sight, You have to marry her to know her!
  • 9.
    Let’s call insome friends • Akka - Actors to build concurrent and distributed applications • Spark - The blue eyed whiz kid of the Big Data class • Play - The web development champion • SBT - The best builder in town • ScalaTest - The story teller • Shapeless and Scalaz - Masters of the Dark Arts
  • 10.
    Concurrency With Akka •Inspired by Erlang’s Actor Model • Runs on the JVM • Actors define behavior to handle typed messages • Actors process one message at a time • Can use Group/Pool of actors behind routers for concurrency • Can run thousands of actos on a modern server • Location transparency for clustering • Supervision and state recovery for HA
  • 11.
    Batch Processing withSpark • Resilient Distributed Datasets • Fast in-memory big data • Map/Reduce on steroids • Iterative and interactive • Code in scala, java, python and now R • Streaming (DStreams - Batch processing on streams) • MLLib, Shark, Spark SQL, GraphX and more
  • 12.
    Web development withPlay • Modern high velocity, highly scalable • Built on Akka and Netty (NIO) • Reactive in design (reactive I/O) • Async HTTP, streaming HTTP, Comet, Websockets, build your own protocol • Feature rich yet flexible
  • 13.
    Build with SBT •I hate writing XML • Very easy to get started • All the power of Scala in the build • Maven dependency management + more
  • 14.
    Testing with ScalaTest •Write specs not tests and excellent tool for BDD • Specs DSL very close to english • Many testing styles • Powerful matchers (“should be”) • Fixtures • Mock objects with ScalaMock
  • 15.
    Taking functional further •Shapeless • Scrap your boilerplate • Generic programming • Existential types • ScalaZ • Bringing Haskel to Scala • Monads, Functors and all the theory!
  • 16.
    Thank you! • http://www.tuplejump.com/ •http://github.com/tuplejump/ • http://tuplejump.github.com/calliope/ • http://tuplejump.github.com/stargate/ • @tuplejump on twitter