Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Sa introduction to big data pipelining with cassandra & spark west minster meetup - black-2015 0.11-2


Published on

Slides from my presentation at BigDataWeek London and Cassandra London Meetup 27th November 2015

Published in: Technology
  • Be the first to comment

Sa introduction to big data pipelining with cassandra & spark west minster meetup - black-2015 0.11-2

  1. 1. Introduction To Big Data Pipelining with Docker, Cassandra, Spark, Spark-Notebook & Akka
  2. 2. Apache Cassandra and DataStax enthusiast who enjoys explaining to customers that the traditional approaches to data management just don’t cut it anymore in the new always on, no single point of failure, high volume, high velocity, real time distributed data management world. Previously 25 years designing, building, implementing and supporting complex data management solutions with traditional RDBMS technology includingOracle Hyperion & E-Business Suite deployments at clients such as the Financial Services Authority, Olympic Delivery Authority, BT, RBS, Virgin Entertainment, HP, Sun and Oracle. Oracle certified in Exadata, Oracle Cloud, Oracle Essbase, Oracle Linux and OBIEE, and worked extensively with Oracle Hyperion, Oracle E-Business Suite, Oracle Virtual Machine and Oracle Exalytics. @stratman1958 Simon Ambridge Pre-Sales Solution Engineer, Datastax UK
  3. 3. Big Data Pipelining: Outline •  1-Hour introduction to Big Data Pipelining and a working sandbox •  Presented at a half-day workshop Devoxx November 2015 •  Uses Data Pipeline environment from Data Fellas •  Contributors from Typesafe, Mesos, Datastax •  Demonstrates how to use scalable, distributed technologies •  Docker •  Spark •  Spark-Notebook •  Cassandra •  Objective is to introduce the demo environment •  Key takeaway – understanding how to build a reactive, repeatable Big Data pipeline
  4. 4. Big Data Pipelining: Devoxx & Data Fellas •  Co-founder of Data Fellas •  Certified Scala/Spark trainer and wrote the Learning Play! Framework 2 book. •  Creator of Spark-Notebook, one of the top projects on GitHub related to Apache Spark and Scala •  Co-founder of Data Fellas •  Ph.D in experimental atomic physics •  Specialist in prediction of biological molecular structures and interactions, and applied Machine Learning methodologies •  Iulian Dragos is a key member of Martin Odersky’s Scala team at Typesafe. •  For the last six years he has been the main contributor for many critical Scala components including the compiler backend, its optimizer and the Eclipse build manager •  Datastax Solutions Engineer •  Prior to Datastax Simon has extensive experience with traditional RDBMS technologies at Oracle, Sun, Compaq, DEC etc. Andy Petrella Xavier Tordoir Iulian Dragos Simon Ambridge
  5. 5. Big Data Pipelining: Legacy SamplingData Modeling Tuning Report Interpret •  Sampling and analysis often run on a single machine •  CPU and memory limitations •  Frequently dictates limited sampling because of data size limitations •  Multiple iterations over large datasets Repeated iterations
  6. 6. Big Data Pipelining: Big Data Problems •  Data is getting bigger or, more accurately, the number of available data sources is exploding •  Sampling the data is becoming more difficult •  The validity of the analysis becomes obsolete faster •  Analysis becomes too slow to get any ROI from the data
  7. 7. Big Data Pipelining: Big Data Needs •  Scalable infrastructure + distributed technologies •  Allow data volumes to be scaled •  Faster processing •  More complex processing •  Constant data flow •  Visible, reproducible analysis •  For example, SHAR3 from Data Fellas
  8. 8. Big Data Pipelining: Pipeline Flow ADAM
  9. 9. Intro To Docker: Quick History What is Docker? •  Open source project started in 2013 •  Easy to build, deploy, copy containers •  Great for packaging and deploying applications •  Similar resource isolation to VMs, but different architecture •  Lightweight •  Containers share the OS kernel •  Fast start •  Layered filesystems – share underlying OS files, directories “Each virtual machine includes the application, the necessary binaries and libraries and an entire guest operating system - all of which may be tens of GBs in size.” “Containers include the application and all of its dependencies, but share the kernel with other containers. They run as an isolated process in userspace on the host operating system. They’re also not tied to any specific infrastructure – Docker containers run on any computer, on any infrastructure and in any cloud.”
  10. 10. Intro To ADAM: Quick History What is ADAM? •  Started at UC Berkeley in 2012 •  Open-source library for bioinformatics analysis, written for Spark •  Spark’s ability to parallelize an analysis pipeline is a natural fit for genomics methods •  A set of formats, APIs, and processing stage implementations for genomic data •  Fully open source under the Apache 2 license •  Implemented on top of Avro and Parquet for data storage •  Compatible with Spark up to 1.5.1
  11. 11. Intro To Spark: Quick History What is Apache Spark? •  Started at UC Berkeley in 2009 •  Apache Project since 2010 •  Fast - 10x-100x faster than Hadoop MapReduce •  Distributed in-memory processing •  Rich Scala, Java and Python APIs •  2x-5x less code than R •  Batch and streaming analytics •  Interactive shell (REPL)
  12. 12. Intro To Spark-Notebook: Quick History What is Spark-Notebook? •  Drive your data analysis from the browser •  Can be deployed on a single host or large cluster e.g. Mesos, ec2, GCE etc. •  Features tight integration with Apache Spark and offers handy tools to analysts: •  Reproducible visual analysis •  Charting •  Widgets •  Dynamic forms •  SQL support •  Extensible with custom libraries
  13. 13. Intro To Parquet: Quick History What is Parquet? •  Started at Twitter and Cloudera in 2013 •  Databases traditionally store information in rows and are optimized for working with one record at a time •  Columnar storage systems optimised to store data by column •  Netflix big user - 7 PB of warehoused data in Parquet format •  A compressed, efficient columnar data representation •  Allows complex data to be encoded efficiently •  Compression schemes can be specified on a per-column level •  Not as compressed as ORC (Hortonworks) but faster read/analysis
  14. 14. Intro To Cassandra: Quick History What is Apache Cassandra? •  Originally started at Facebook in 2008 •  Top level Apache project since 2010 •  Open source distributed database •  Handles large amounts of data •  At high velocity •  Across multiple data centres •  No single point of failure •  Continuous Availability •  Disaster avoidance •  Enterprise Cassandra from Datastax
  15. 15. Intro To Akka: Quick History What is Akka? •  Open source toolkit first released in 2009 •  Simplifies the construction of concurrent and distributed Java applications •  Primarily designed for actor-based concurrency •  Akka enforces parental supervision •  Actors are arranged hierarchically •  Each actor is created and supervised by its parent actor •  Program failures treated as events handled by an actor's supervisor •  Message-based and asynchronous; typically no mutable data are shared •  Language bindings exist for both Java and Scala
  16. 16. Spark: RDD What Is A Resilient Distributed Dataset? •  RDD - a distributed, memory abstraction for parallel in-memory computations •  RDD represents a dataset consisting of objects and records •  Such as Scala, Java or Python objects •  RDD is distributed across nodes in the Spark cluster •  Nodes hold partitions and partitions hold records •  RDD is read-only or immutable •  RDD can be transformed into a new RDD •  Operations •  Transformations (e.g. map, filter, groupBy) •  Actions (e.g. count, collect, save)
  17. 17. Spark: DataFrames What Is A DataFrame? •  Inspired by data frames in R and Python •  Data is organized into named columns •  Conceptually equivalent to a table in a relational database •  Can be constructed from a wide array of sources •  structured data files - JSON, Parquet •  tables in Hive •  relational database systems via JDBC •  existing RDDs •  Can be extended to support any third-party data formats or sources •  Existing third-party extensions already include Avro, CSV, ElasticSearch, and Cassandra •  Enables applications to easily combine data from disparate sources
  18. 18. Spark & Cassandra: How? How Does Spark Access Cassandra? •  DataStax Cassandra Spark driver – open source! •  Open source: • •  Compatible with •  Spark 0.9+ •  Cassandra 2.0+ •  DataStax Enterprise 4.5+ •  Scala 2.10 and 2.11 •  Java and Python •  Expose Cassandra tables as Spark RDDs •  Execute arbitrary CQL queries in Spark applications •  Saves RDDs back to Cassandra via saveToCassandra call
  19. 19. Spark: How Do You Access RDDs? Create A ‘Spark Context’ •  To create an RDD you need a Spark Context object •  A Spark Context represents a connection to a Spark Cluster •  In the Spark shell the sc object is created automatically •  In a standalone application a Spark Context must be constructed
  20. 20. Spark: Architecture Spark Architecture •  Master-worker architecture •  One master •  Spark Workers run on all nodes •  Executors belonging to different clients/SCs are isolated •  Executors belonging to the same client/SCs can communicate •  Client jobs are divided into tasks, executed by multiple threads •  First Spark node promoted as Spark Master •  Master HA feature available in DataStax Enterprise •  Standby Master promoted on failure •  Workers are resilient by default
  21. 21. Open Source: Analytics Integration •  Apache Spark for Real-Time Analytics •  Analytics nodes separate from data nodes •  ETL required Cassandra Cluster ETL Spark Cluster •  Loose integration •  Data separate from processing •  Millisecond response times Solr Cluster ES Cluster 10 core 16GB minimum
  22. 22. DataStax Enterprise: Analytics Integration Cassandra Cluster Spark, Solr Cluster ETL Spark Cluster •  Tight integration •  Data locality •  Microsecond response times X •  Integrated Apache Spark for Real-Time Analytics •  Integrated Apache Solr for Enterprise Search •  Search and analytics nodes close to data •  No ETL required X Solr Cluster ES Cluster 12+ core 32GB+
  23. 23. Big Data Pipelining: Demo Build & Run Steps 1.  Provision a 64-bit Linux environment 2.  Pre-requisites (5 mins) 3.  Install Docker (5 mins) 4.  Clone the Pipeline Repo from GitHub (2 mins) 5.  Pull the Docker image from Docker Hub (20 mins) 6.  Run the image as a container (5 mins) 7.  Run the demo setup script - inside the container (2 mins) 8.  Run the demo from a browser - on the host (30 mins)
  24. 24. Big Data Pipelining: Demo Steps 1.  Provision a host Required machine spec: 3 cores, 5GB •  Linux machine •  Create a VM (e.g. Ubuntu)
  25. 25. Big Data Pipelining: Demo Steps 2.  Pre-requisites •  Updates to apt-get sources and gpg key •  Check kernel version
  26. 26. Big Data Pipelining: Demo Steps 3.  Install Docker $  sudo  apt-­‐get  update     $  sudo  apt-­‐get  install  docker   $  sudo  usermod  -­‐aG  docker  <myuserid>   Log out/in $  docker  run  hello-­‐world    
  27. 27. Big Data Pipelining: Demo Steps 4.  Clone the Pipeline repo $  mkdir  ~/pipeline   $  cd  ~/pipeline   $  git  clone­‐freaks/pipeline.git  
  28. 28. Big Data Pipelining: Demo Steps 5.  Pull the Pipeline image $  docker  pull  xtordoir/pipeline  
  29. 29. Big Data Pipelining: Demo Steps 6.  Run the Pipeline image as a container $  docker  run  -­‐it  -­‐m  8g    -­‐p  30080:80  -­‐p   34040-­‐34045:4040-­‐4045  -­‐p  9160:9160  -­‐p  9042:9042  -­‐p   39200:9200  -­‐p  37077:7077  -­‐p  36060:6060  -­‐p  36061:6061  -­‐p   32181:2181  -­‐p  38090:8090  -­‐p  38099:8099  -­‐p  30000:10000  -­‐p   30070:50070  -­‐p  30090:50090  -­‐p  39092:9092  -­‐p  36066:6066  -­‐p   39000:9000  -­‐p  39999:19999  -­‐p  36081:6081  -­‐p  35601:5601  -­‐p   37979:7979  -­‐p  38989:8989  xtordoir/pipeline  bash  
  30. 30. Big Data Pipelining: Demo Steps 7.  Run the demo setup script in the container $  cd  pipeline   $  source  devoxx-­‐              #  ignore  Cassandra  errors     Run cqlsh
  31. 31. Big Data Pipelining: Demo Steps 8.  Run the demo in the host browser http://localhost:39000/tree/pipeline
  32. 32. Thank you!
  33. 33. Big Data Pipelining: Appendix RDD/Cassandra Reference
  34. 34. Spark: RDD How Do You Create An RDD? 1.  From an existing collection: ‘action’
  35. 35. Spark: RDD How Do You Create An RDD? 2.  From a text file: ‘action’
  36. 36. Spark: RDD How Do You Create An RDD? 3.  From a data in a Cassandra database: ‘action’
  37. 37. Spark: RDD How Do You Create An RDD? 4.  From an existing RDD: ‘action’ ‘transformation’
  38. 38. Spark: RDD’s & Cassandra Accessing Data As An RDD ‘action’ RDD method
  39. 39. Spark: Filtering Data In Cassandra Server-side Selection •  Reduce the amount of data transferred •  Selecting rows (by clustering columns and/or secondary indexes)
  40. 40. Spark: Saving Data In Cassandra Saving Data •  saveToCassandra
  41. 41. Spark: Using SparkSQL & Cassandra You Can Also Access Cassandra Via SparkSQL! •  Spark Conf object can be used to create a Cassandra-aware Spark SQL context object •  Use regular CQL syntax •  Cross table operations - joins, unions etc!
  42. 42. Spark: Streaming Data Spark Streaming •  High velocity data – IoT, sensors, Twitter etc •  Micro batching •  Each batch represented as RDD •  Fault tolerant •  Exactly-once processing •  Represents a unified stream and batch processing framework
  43. 43. Spark: Streaming Data Into Cassandra Streaming Example