Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Building Scalable Data Pipelines - 2016 DataPalooza Seattle

An overview of building scalable data pipelines and specific technologies including Apache #Kafka, #Spark, Spark Streaming, #Cassandra, and #FiloDB.

  • Login to see the comments

Building Scalable Data Pipelines - 2016 DataPalooza Seattle

  1. 1. Building Scalable Data Pipelines Evan Chan
  2. 2. Who am I Distinguished Engineer, Tuplejump @evanfchan User and contributor to Spark since 0.9 Co-creator and maintainer of Spark Job Server
  3. 3. TupleJump - Big Data Dev Partners 3
  4. 4. Instant Gratification I want insights now I want to act on news right away I want stuff personalized for me (?)
  5. 5. Fast Data, not
 Big Data
  6. 6. How Fast do you Need to Act? Financial trading - milliseconds Dashboards - seconds to minutes BI / Reports - hours to days?
  7. 7. What’s Your App? Concurrent video viewers Anomaly detection Clickstream analysis Live geospatial maps Real-time trend detection & learning
  8. 8. Common Components Message Queue Events Stream Processing Layer State / Database Happy Users
  9. 9. Example: Real-time trend detection Events: time, OS, location, asset/product ID Analyze 1-5 second batches of new “hot” data in stream processor Combine with recent and historical top K feature vectors in database Update database recent feature vectors Serve to users
  10. 10. Example 2: Smart Cities
  11. 11. Smart City Streaming Data City buses - regular telemetry (position, velocity, timestamp) Street sweepers - regular telemetry Transactions from rail, subway, buses, smart cards 311 info 911 info - new emergencies
  12. 12. Citizens want to know… Where and for how long can I park my car? Are transportation options affected by 311 and 911 events? How long will it take the next bus to get here? Where is the closest bus to where I am?
  13. 13. Cities want to know… How can I maximize parking revenue? More granular updates to parking spots that don't need sweeping How does traffic affect waiting times in public transit, and revenue? Patterns in subway train times - is a breakdown coming? Population movement - where should new transit routes be placed?
  14. 14. Message Queue Stream Processing Layer Event storage Ad- Hoc 311 911 Buses Metro Short term telemetry Models Dashboard
  15. 15. The HARD Principle Highly Available, Resilient, Distributed Flexibility - do as many transformations as possible with as few components as possible Real-time: “NoETL” Community: best of breed OSS projects with huge adoption and commercial support
  16. 16. Message Queue
  17. 17. Message Queue Events Stream Processing Layer State / Database Happy Users
  18. 18. Why a message queue? Centralized publish-subscribe of events Need more processing? Add another consumer Buffer traffic spikes Replay events in cases of failure
  19. 19. Message Queues help distribute data A-F G-M N-S T-Z Input 1 Input 2 Input3 Input4 Processing Processing Processing Processing
  20. 20. Intro to Apache Kafka Kafka is a distributed publish subscribe system It uses a commit log to track changes Kafka was originally created at LinkedIn Open sourced in 2011 Graduated to a top-level Apache project in 2012
  21. 21. On being HARD Many Big Data projects are open source implementations of closed source products Unlike Hadoop, HBase or Cassandra, Kafka actually isn't a clone of an existing closed source product The same codebase being used for years at LinkedIn answers the questions: Does it scale? Is it robust?
  22. 22. Ad Hoc ETL
  23. 23. Decoupled ETL
  24. 24. Avro Schemas And Schema Registry Keys and values in Kafka can be Strings or byte arrays Avro is a serialization format used extensively with Kafka and Big Data Kafka uses a Schema Registry to keep track of Avro schemas Verifies that the correct schemas are being used
  25. 25. Consumer Groups
  26. 26. Commit Logs
  27. 27. Kafka Resources Official docs - https:// documentation.html Design section is really good read Includes schema registry
  28. 28. Stream Processing
  29. 29. Message Queue Events Stream Processing Layer State / Database Happy Users
  30. 30. Types of Stream Processors Event by Event: Apache Storm, Apache Flink, Intel GearPump, Akka Micro-batch: Apache Spark Hybrid? Google Dataflow
  31. 31. Apache Storm and Flink Transform one message at a time Very low latency State and more complex analytics difficult
  32. 32. Akka and Gearpump Actor to actor messaging. Local state. Used for extreme low latency (ad networks, etc) Dynamically reconfigurable topology Configurable fault tolerance and failure recovery Cluster or local mode - you don’t always need distribution!
  33. 33. Spark Streaming Data processed as stream of micro batches Higher latency (seconds), higher throughput, more complex analysis / ML possible Same programming model as batch
  34. 34. Why Spark? file = spark.textFile("hdfs://...")   file.flatMap(line => line.split(" "))     .map(word => (word, 1))     .reduceByKey(_ + _) 1 package org.myorg; 2 3 import; 4 import java.util.*; 5 6 import org.apache.hadoop.fs.Path; 7 import org.apache.hadoop.conf.*; 8 import*; 9 import org.apache.hadoop.mapreduce.*; 10 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; 11 import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; 12 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; 13 import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; 14 15 public class WordCount { 16 17 public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { 18 private final static IntWritable one = new IntWritable(1); 19 private Text word = new Text(); 20 21 public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { 22 String line = value.toString(); 23 StringTokenizer tokenizer = new StringTokenizer(line); 24 while (tokenizer.hasMoreTokens()) { 25 word.set(tokenizer.nextToken()); 26 context.write(word, one); 27 } 28 } 29 } 30 31 public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { 32 33 public void reduce(Text key, Iterable<IntWritable> values, Context context) 34 throws IOException, InterruptedException { 35 int sum = 0; 36 for (IntWritable val : values) { 37 sum += val.get(); 38 } 39 context.write(key, new IntWritable(sum)); 40 } 41 } 42 43 public static void main(String[] args) throws Exception { 44 Configuration conf = new Configuration(); 45 46 Job job = new Job(conf, "wordcount"); 47 48 job.setOutputKeyClass(Text.class); 49 job.setOutputValueClass(IntWritable.class); 50 51 job.setMapperClass(Map.class); 52 job.setReducerClass(Reduce.class); 53 54 job.setInputFormatClass(TextInputFormat.class); 55 job.setOutputFormatClass(TextOutputFormat.class); 56 57 FileInputFormat.addInputPath(job, new Path(args[0])); 58 FileOutputFormat.setOutputPath(job, new Path(args[1])); 59 60 job.waitForCompletion(true); 61 } 62 63 }
  35. 35. Spark Production Deployments
  36. 36. Explosion of Specialized Systems
  37. 37. Spark and Berkeley AMP Lab
  38. 38. Benefits of Unified Libraries Optimizations can be shared between libraries Core Project Tungsten MLlib Shared statistics libraries Spark Streaming GC and memory management
  39. 39. Mix and match modules Easily go from DataFrames (SQL) to MLLib / statistics, for example: scala> import org.apache.spark.mllib.stat.Statistics scala> val numMentions ="NumMentions").map(row => row.getInt(0).toDouble) numMentions: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[100] at map at DataFrame.scala:848 scala> val numArticles ="NumArticles").map(row => row.getInt(0).toDouble) numArticles: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[104] at map at DataFrame.scala:848 scala> val correlation = Statistics.corr(numMentions, numArticles, "pearson")
  40. 40. Spark Worker Failure Rebuild RDD Partitions on Worker from Lineage
  41. 41. Spark SQL & DataFrames
  42. 42. DataFrames & Catalyst Optimizer
  43. 43. Catalyst Optimizations Column and partition pruning (Column filters) Predicate pushdowns (Row filters)
  44. 44. Spark SQL Data Sources API Enables custom data sources to participate in SparkSQL = DataFrames + Catalyst Production Impls spark-csv (Databricks) spark-avro (Databricks) spark-cassandra-connector (DataStax) elasticsearch-hadoop (
  45. 45. Spark Streaming
  46. 46. Streaming Sources Basic: Files, Akka actors, queues of RDDs, Socket Advanced Kafka Kinesis Flume Twitter firehose
  47. 47. DStreams = micro-batches
  48. 48. Streaming Fault Tolerance Incoming data is replicated to 1 other node Write Ahead Log for sources that support ACKs Checkpointing for recovery if Driver fails
  49. 49. Direct Kafka Streaming: KafkaRDD No single Receiver Parallelizable No Write Ahead Log Kafka *is* the Write Ahead Log! KafkaRDD stores Kafka offsets KafkaRDD partitions recover from offsets

  50. 50. Spark MLlib & GraphX
  51. 51. Spark MLlib Common Algos Classifiers DecisionTree, RandomForest Clustering K-Means, Streaming K-Means Collaborative Filtering Alternating Least Squares (ALS)
  52. 52. Spark Text Processing Algos TF/IDF LDA Word2Vec *Pro-Tip: Use Stanford CoreNLP!
  53. 53. Spark ML Pipelines Modeled after scikit-learn
  54. 54. Spark GraphX PageRank Top Influencers Connected Components Measure of clusters Triangle Counting Measure of cluster density
  55. 55. Handling State
  56. 56. Message Queue Events Stream Processing Layer State / Database Happy Users
  57. 57. What Kind of State? Non-persistent / in-memory: concurrent viewers Short term: latest trends Longer term: raw event & aggregate storage ML Models, predictions, scored data
  58. 58. Spark RDDs Immutable, cache in memory and/or on disk Spark Streaming: UpdateStateByKey IndexedRDD - can update bits of data Snapshotting for recovery
  59. 59. •Massively Scalable • High Performance • Always On • Masterless
  60. 60. Scale Apache Cassandra • Scales Linearly to as many nodes as you need • Scales whenever you need
  61. 61. Performance Apache Cassandra • It’s Fast • Built to sustain massive data insertion rates in irregular pattern spikes
  62. 62. Fault Tolerance & Availability Apache Cassandra • Automatic Replication • Multi Datacenter • Decentralized - no single point of failure • Survive regional outages • New nodes automatically add themselves to the cluster • DataStax drivers automatically discover new nodes
  63. 63. Architecture Apache Cassandra • Distributed, Masterless Ring Architecture • Network Topology Aware • Flexible, Schemaless - your data structure can evolve seamlessly over time
  64. 64. To download: download/ ^ Highly recommended for local testing/cluster setup
  65. 65. Cassandra Data Modeling Primary key = (partition keys, clustering keys) Fast queries = fetch single partition Range scans by clustering key Must model for query patterns Clustering 1 Clustering 2 Clustering 3 Partition 1 Partition 2 Partition 3
  66. 66. City Bus Data Modeling Example Primary key = (Bus UUID, timestamp) Easy queries: location and speed of single bus for a range of time Can also query most recent location + speed of all buses (slower) 1020 s 1010 s 1000 s Bus A speed, GPS Bus B Bus C
  67. 67. Using Cassandra for Short Term Storage Idea is store and read small values Idempotent writes + huge write capacity = ideal for streaming ingestion For example, store last few (latest + last N) snapshots of buses, taxi locations, recent traffic info
  68. 68. But Mommy! What about longer term data?
  69. 69. I need to read lots of data, fast!! - Ad hoc analytics of events - More specialized / geospatial - Building ML models from large quantities of data - Storing scored/classified data from models - OLAP / Data Warehousing
  70. 70. Can Cassandra Handle Batch? Cassandra tables are much better at lots of small reads than big data scans You CAN store data efficiently in C* Files seem easier for long term storage and analysis But are files compatible with streaming?
  71. 71. Lambda Architecture
  72. 72. Lambda is Hard and Expensive Very high TCO - Many moving parts - KV store, real time, batch Lots of monitoring, operations, headache Running similar code in two places Lower performance - lots of shuffling data, network hops, translating domain objects Reconcile queries against two different places
  73. 73. NoLambda A unified system Real-time processing and reprocessing No ETLs Fault tolerance Everything is a stream
  74. 74. Can Cassandra do batch and ad-hoc? Yes, it can be competitive with Hadoop actually…. If you know how to be creative with storing your data! Tuplejump/SnackFS - HDFS for Cassandra - analytics database Store your data using Protobuf / Avro / etc.
  75. 75. Introduction to FiloDB Efficient columnar storage - 5-10x better Scan speeds competitive with Parquet - 100x faster than regular Cassandra tables Very fine grained filtering for sub-second concurrent queries Easy BI and ad-hoc analysis via Spark SQL/ Dataframes (JDBC etc.) Uses Cassandra for robust, proven storage
  76. 76. Combining FiloDB + Cassandra Regular Cassandra tables for highly concurrent, aggregate / key-value lookups (dashboards) FiloDB + C* + Spark for efficient long term event storage Ad hoc / SQL / BI Data source for MLLib / building models Data storage for classified / predicted / scored data
  77. 77. Message Queue Events Spark Streaming Short term storage, K-V Adhoc, SQL, ML Cassandra FiloDB: Events, ad-hoc, batch Spark Dashboa rds, maps
  78. 78. Message Queue Events Spark Streaming Models Cassandra FiloDB: Long term event storage Spark Learned Data
  79. 79. FiloDB + Cassandra Robust, peer to peer, proven storage platform Use for short term snapshots, dashboards Use for efficient long term event storage & ad hoc querying Use as a source to build detailed models
  80. 80. Thank you! @evanfchan