Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Big Data pipeline with Scala by Rohit Rai, Tuplejump - presented at Pune Scala Symposium 2014, ThoughtWorks


Published on

At Tuplejump we have a built a big data platform powered by Scala everywhere. Using Akka for message/event processing, spark for streaming and batch processing, Shark for adhoc querying and Play to power our web based IDE. This talk will walk through what various components of the platform and how Scala, the concepts like Reactive programming, event driven architecture and ecosystem components like Akka actors framework, Spark RDDs and Play Web framework guided and inspired our decisions and also provided the kickstart to attempt this huge challenge of building a complete integrated end-to-end Big Data Application Framework.

Published in: Technology
  • Be the first to comment

Big Data pipeline with Scala by Rohit Rai, Tuplejump - presented at Pune Scala Symposium 2014, ThoughtWorks

  1. 1. ` Big Data Pipeline @tuplejump
  2. 2. A data engineering startup, with a vision to simplify data engineering and empower the next generation of data powered miracles! tuplejump
  3. 3. Who Am I? • Founder and CEO @ Tuplejump • Earlier worked at Pramati, Cordys and couple of startups. • A polyglot developer • Started with Perl and PHP, have worked with VB.Net, C#, VC++, Erlang and Haskell • Love data hacking in R and Python • Java and Javascript fed me for a long long time • Committed to Scala • Believe in choosing the best tool for the task • Open Source fanatic @milliondreams |
  5. 5. The Tuplejump Platform COLLECT TRANSFORM PREDICT STORE EXPLORE VISUALIZE Hydra The tentacled framework to gather high volume and velocity data from push and pull powered by Akka, reacting on demands to events and streaming to Spark to batch process. COLLECT Spark + Calliope Using the friendly Spark API with added features to easily consume or load data from and to Cassandra powered storage. TRANSFORM Cassandra++ Cassandra provides a single storage mechanism for Files, (un)structured data, Generic data. STORE MinerBot Building on Spark's ML framework, going towards machine assisted insights, we are in building our own EA and ANN/DL frameworks to take ML to the next level. PREDICT Shark + Calliope Ad Hoc querying with shark on your data in Dstore. UberCube A OLAP cube engine EXPLORE Pissaro A modern, game changing data frontend, which is “not just dashboards”, providing highly interactive and reactive visualization frontend. VIZUALIZE
  6. 6. Advantages • All the advantages of Spark + All the advantages of Cassandra + Much more! • Over 500x (100x in case of filtered data) faster than traditional Hadoop solutions • Shark + C* provide for superfast ad hoc querying. • UberCube empowers sub-millisecond responses on very large cubes • MinerBot provides ready to use ML Algos, plus a possibility of much more complex algos and mechanisms than just map reduce. • Ready to use, no integration required • Easy to develop, deploy, monitor and scale
  7. 7. Why Scala? • Object oriented and functional • Runs on the JVM • 100% compatible with Java • Modern, evolving, scalable • Concise, flexible and high performance • Excellent support for DSL development • Spark and Play use Scala as their primary language • We used it for long and we love it!!!
  8. 8. The ultimate gyan! You can flirt with other languages, you can have short affairs with few, You will fall in love with Scala at the first sight, You have to marry her to know her!
  9. 9. Let’s call in some friends • Akka - Actors to build concurrent and distributed applications • Spark - The blue eyed whiz kid of the Big Data class • Play - The web development champion • SBT - The best builder in town • ScalaTest - The story teller • Shapeless and Scalaz - Masters of the Dark Arts
  10. 10. Concurrency With Akka • Inspired by Erlang’s Actor Model • Runs on the JVM • Actors define behavior to handle typed messages • Actors process one message at a time • Can use Group/Pool of actors behind routers for concurrency • Can run thousands of actos on a modern server • Location transparency for clustering • Supervision and state recovery for HA
  11. 11. Batch Processing with Spark • Resilient Distributed Datasets • Fast in-memory big data • Map/Reduce on steroids • Iterative and interactive • Code in scala, java, python and now R • Streaming (DStreams - Batch processing on streams) • MLLib, Shark, Spark SQL, GraphX and more
  12. 12. Web development with Play • Modern high velocity, highly scalable • Built on Akka and Netty (NIO) • Reactive in design (reactive I/O) • Async HTTP, streaming HTTP, Comet, Websockets, build your own protocol • Feature rich yet flexible
  13. 13. Build with SBT • I hate writing XML • Very easy to get started • All the power of Scala in the build • Maven dependency management + more
  14. 14. Testing with ScalaTest • Write specs not tests and excellent tool for BDD • Specs DSL very close to english • Many testing styles • Powerful matchers (“should be”) • Fixtures • Mock objects with ScalaMock
  15. 15. Taking functional further • Shapeless • Scrap your boilerplate • Generic programming • Existential types • ScalaZ • Bringing Haskel to Scala • Monads, Functors and all the theory!
  16. 16. Thank you! • • • • • @tuplejump on twitter