Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Hadoop application architectures - using Customer 360 as an example

5,129 views

Published on

Hadoop application architectures - using Customer 360 (more generally, Entity 360) as an example. By Ted Malaska, Jonathan Seidman and Mark Grover at Strata + Hadoop World 2016 in NYC.

Published in: Engineering

Hadoop application architectures - using Customer 360 as an example

  1. 1. Hadoop Application Architectures: Architecting a Next Generation Data Platform Strata + Hadoop World, New York 2016 tiny.cloudera.com/app-arch-ny tiny.cloudera.com/app-arch-questions Mark Grover | @mark_grover Ted Malaska | @ted_malaska Jonathan Seidman | @jseidman
  2. 2. Questions? tiny.cloudera.com/app-arch- questions Logistics ▪ Break at 3:00 – 3:30 PM ▪ Questions at the end of each section ▪ Slides at tiny.cloudera.com/app-arch-ny ▪ Code at https://github.com/hadooparchitecturebook/Taxi360
  3. 3. Questions? tiny.cloudera.com/app-arch- questions About the book ▪ @hadooparchbook ▪ hadooparchitecturebook.com ▪ github.com/hadooparchitecturebook ▪ slideshare.com/hadooparchbook
  4. 4. Questions? tiny.cloudera.com/app-arch- questions About the presenters ▪ Principal Solutions Architect at Cloudera ▪ Previously, lead architect at FINRA ▪ Contributor to Apache Hadoop, HBase, Flume, Avro, Pig, Spark, YARN, Sqoop, Kudu, Kafka Ted Malaska
  5. 5. Questions? tiny.cloudera.com/app-arch- questions About the presenters ▪ Partner Software Engineer at Cloudera ▪ Contributor to Apache Sqoop. ▪ Previously, Technical Lead on the big data team at Orbitz, co-founder of the Chicago Hadoop User Group and Chicago Big Data Jonathan Seidman
  6. 6. Questions? tiny.cloudera.com/app-arch- questions About the presenters ▪ Software Engineer on Spark at Cloudera ▪ Committer on Apache Bigtop, PMC member on Apache Sentry(incubating) ▪ Contributor to Apache Spark, Hadoop, Hive, Sqoop, Pig, Flume Mark Grover
  7. 7. Case Study Overview Internet of Things and Entity 360
  8. 8. Questions? tiny.cloudera.com/app-arch- questions Customer 360
  9. 9. Questions? tiny.cloudera.com/app-arch- questions Connected Cars
  10. 10. Questions? tiny.cloudera.com/app-arch- questions Entity (Taxi) 360 View Geo-location/ Traffic Data Customer Data Maintenance Data Other Data Sources Streaming Vehicle Data
  11. 11. Questions? tiny.cloudera.com/app-arch- questions What Makes Hadoop a Fit? Data Sources Extract Transform Load The early days…
  12. 12. Questions? tiny.cloudera.com/app-arch- questions What Makes Hadoop a Fit? SERVERS MARTS EDWS DOCUMENTS STORAGE SEARCH ARCHIVE ERP, CRM, RDBMS, MACHINES FILES, IMAGES, VIDEOS, LOGS, CLICKSTREAMS EXTERNAL DATA SOURCES Today…
  13. 13. Questions? tiny.cloudera.com/app-arch- questions Enabling a Range of New Use Cases… Fraud Detection Market transactions Internet of Things Network Security
  14. 14. Questions? tiny.cloudera.com/app-arch- questions Hadoop Challenges Kafka StreamsKafka Connect Kafka
  15. 15. Questions? tiny.cloudera.com/app-arch- questions Challenges - Architectural Considerations ▪ Reliable and scalable ingress of multiple data types and sources: - High volume event data? Batch data? ▪ Reliable and scalable storage to support multiple workloads and access patterns - Historical data? Real-time search? Analytics ▪ Processing engines (for background processing): - Stream processing? Batch processing? ▪ Data Modeling - Modeling data for real-time random access? Analytic access? Batch access?
  16. 16. Case Study Requirements Overview
  17. 17. Questions? tiny.cloudera.com/app-arch- questions Requirements ▪ Allow users (technical and non-technical) to analyze and visualize data…
  18. 18. Questions? tiny.cloudera.com/app-arch- questions Requirements ▪ Provide analysts with query capabilities via a standard interface…
  19. 19. Questions? tiny.cloudera.com/app-arch- questions Requirements ▪ Provide developers the ability to perform batch processing on historical data…
  20. 20. Questions? tiny.cloudera.com/app-arch- questions Requirements ▪ To support all this, we need: - Reliable ingestion of streaming and batch data. - Ability to perform transformations on streaming data in flight. - Ability to perform sophisticated processing of historical data.
  21. 21. High level architecture Walkthrough
  22. 22. Questions? tiny.cloudera.com/app-arch- questions High level architecture Source Transport Stream Processing Storage Access Data Producer Pub-Sub Processing & Ingestion Engine Nested Tables Indexed Cube Relational Tables Entity Time Series Lookup Batch Processing SQL NRT REST NRT Dashboard
  23. 23. Data Buffering Considerations
  24. 24. Questions? tiny.cloudera.com/app-arch- questions But wait! What about batch data?
  25. 25. Event Data Buffering Overview
  26. 26. Questions? tiny.cloudera.com/app-arch- questions High level architecture Source Transport Stream Processing Storage Access Data Producer Pub-Sub Processing & Ingestion Engine Nested Tables Indexed Cube Relational Tables Entity Time Series Lookup Batch Processing SQL NRT REST NRT Dashboard
  27. 27. Questions? tiny.cloudera.com/app-arch- questions Buffering Data ▪ What do we mean by “buffering” and why do we need it? event,event,event,event,event,event… This is bad!
  28. 28. Questions? tiny.cloudera.com/app-arch- questions Buffering Data – Message Brokers Publisher Publisher Publisher Message Queue Subscriber Subscriber Subscriber
  29. 29. Questions? tiny.cloudera.com/app-arch- questions High level architecture Source Transport Stream Processing Storage Access Custom Producer or Processing & Ingestion Engine Nested Tables Indexed Cube Relational Tables Entity Time Series Lookup Batch Processing SQL NRT Rest NRT Dashboard
  30. 30. Questions? tiny.cloudera.com/app-arch- questions Buffering Data – Flume vs. Kafka ▪ Flume – well integrated with Hadoop. - Great choice when ingesting data into HDFS. - Can support simple transformations. - Less coding. ▪ But… - Interface between Kafka and the streaming layer is already well defined. - Transformations are done in the stream processing layer. - We need a more general purpose system at this layer.
  31. 31. Questions? tiny.cloudera.com/app-arch- questions Kafka Connect Kafka Connect (Source) Kafka Connect (Sink)
  32. 32. Questions? tiny.cloudera.com/app-arch- questions What is Kafka? ▪ It’s like a message queue, right? - Actually, it’s a “distributed commit log”. 0 1 2 3 4 5 6 7 8 Data Source Data Consumer A Data Consumer B
  33. 33. Questions? tiny.cloudera.com/app-arch- questions Topics and Partitions ▪ Messages are organized into topics, and each topic is split into partitions. - Each partition is an immutable, time-sequenced log of messages. 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 Partition 0 Partition 1 Partition 2 Data Source Topi c
  34. 34. Questions? tiny.cloudera.com/app-arch- questions Kafka Background (Physical) ProducersProducersProducers BrockerBrockerBroker ConsumersConsumersConsumers ZooKeeperZooKeeperZooKeeper
  35. 35. Questions? tiny.cloudera.com/app-arch- questions Consumers Kafka In Our Architecture Taxi Trip Data Producer Kafka taxi-trip-input Topic Stream Processing (Analytic) Stream Processing (Lookup) Stream Processing (Search) Stream Processing (Long Term)
  36. 36. Questions? tiny.cloudera.com/app-arch- questions Kafka Considerations – Reliability ▪ Different reliability levels for topics: Taxi Trip Data Kafka taxi-trip-input Twitter customer-sentiment 100% – dups are ok (“At least once”) <=100% (“At most once”)
  37. 37. Questions? tiny.cloudera.com/app-arch- questions Kafka Considerations – Reliability ▪ But remember there are tradeoffs…
  38. 38. Questions? tiny.cloudera.com/app-arch- questions Kafka Reliability – Replication Producer Broker Partition1 Partition2 Partition3 Leader
  39. 39. Questions? tiny.cloudera.com/app-arch- questions Kafka Reliability – Replication Producer Broker Partition1 Partition2 Partition3
  40. 40. Questions? tiny.cloudera.com/app-arch- questions Kafka Reliability – Replication Producer Broker Partition1 Partition2 Partition3 Broker Partition1 Partition2 Partition3 Leader
  41. 41. Questions? tiny.cloudera.com/app-arch- questions Kafka Reliability– Replication Producer Broker Partition1 Partition2 Partition3 Broker Partition1 Partition2 Partition3 Leader Leader
  42. 42. Questions? tiny.cloudera.com/app-arch- questions Kafka Reliability – Replication Producer Broker Partition1 Partition2 Partition3 Broker Partition1 Partition2 Partition3 Broker Partition1 Partition2 Partition3 Leader
  43. 43. Questions? tiny.cloudera.com/app-arch- questions Kafka Reliability – Replication ▪ So how does this relate to our application? kafka-topics --zookeeper ZKHOST:ZKPORT –partition 2 --replication-factor 3 --create --topic taxi-trip-input kafka-topics --zookeeper ZKHOST:ZKPORT –partition 2 --replication-factor 1 --create –topic customer-sentiment
  44. 44. Questions? tiny.cloudera.com/app-arch- questions Kafka Reliability – Producers Taxi Trip Data Kafka taxi_trip_input Partition 1 Partition 2 Partition 3 Topic B Partition 1 Partition 2 Partition 3 Message failure? Producer Resend message acks=all
  45. 45. Questions? tiny.cloudera.com/app-arch- questions Kafka Reliability – Producers ▪ What about duplicates? Taxi Trip Data Kafka taxi_trip_input Partition 1 Partition 2 Partition 3 Topic B Partition 1 Partition 2 Partition 3 Producer ID Message 1000 2009-01-04 03:02:00,1,2.629,... 1001 2009-01-04 03:38:00,3,4.549… 1001 2009-01-04 03:38:00,3,4.549…
  46. 46. Questions? tiny.cloudera.com/app-arch- questions Kafka Scaling – Partitions Producer Kafka taxi-trip-input Partition 1 Partition 2 Partition 3 Consumer Group Consumer Consumer Consumer
  47. 47. Questions? tiny.cloudera.com/app-arch- questions Kafka Scaling – Partitions Producer Kafka taxi-trip-input Partition 1 Partition 2 Partition 3 Consumer Group Consumer Consumer Consumer Partition 4 Partition 5 Consumer Consumer Higher throughput Higher throughput More resources (memory) More resources (file handles) Producer
  48. 48. Questions? tiny.cloudera.com/app-arch- questions Kafka Scaling – Partitions
  49. 49. Questions? tiny.cloudera.com/app-arch- questions Kafka Scaling – Producers Producer Kafka taxi-trip-input Partition 1 Partition 2 Partition 3 Consumer Group Consumer Consumer Consumer Partition 4 Partition 5 Consumer Consumer Producer
  50. 50. Questions? tiny.cloudera.com/app-arch- questions Kafka Message Loss ▪Kafka -Failure >= then acknowledgements -Use a higher ack level ▪Producer -If disconnected from Kafka and buffer overflows -Consider a producer side buffer (e.g. Flume) ▪Source -If data retention is not enough before Producer consumes messages
  51. 51. Stream Processing Considerations
  52. 52. Questions? tiny.cloudera.com/app-arch- questions High level architecture Source Transport Stream Processing Storage Access Custom Producer or Processing & Ingestion Engine Nested Tables Indexed Cube Relational Tables Entity Time Series Lookup Batch Processing SQL NRT REST NRT Dashboard
  53. 53. Questions? tiny.cloudera.com/app-arch- questions Streaming agenda ▪ What do we mean by streaming? ▪ Streaming use-cases ▪ Streaming semantics ▪ Which streaming engine to choose? ▪ Streaming in our use-case
  54. 54. What do we mean by streaming?
  55. 55. Questions? tiny.cloudera.com/app-arch- questions What do we mean by streaming? Constant low milliseconds & under Low milliseconds to seconds, delay in case of failures 10s of seconds or more, re-run in case of failures Real-time Near real-time Batch
  56. 56. Questions? tiny.cloudera.com/app-arch- questions What do we mean by streaming? Constant low milliseconds & under Low milliseconds to seconds, delay in case of failures 10s of seconds or more, re-run in case of failures Real-time Near real-time Batch
  57. 57. Questions? tiny.cloudera.com/app-arch- questions But, there’s no free lunch Constant low milliseconds & under Low milliseconds to seconds, delay in case of failures 10s of seconds or more, re-run in case of failures Real-time Near real-time Batch “Difficult” architectures, lower latency “Easier” architectures, higher latency
  58. 58. Streaming use-cases
  59. 59. Questions? tiny.cloudera.com/app-arch- questions Streaming Use-cases ▪ Ingestion (most relevant in our use-case) ▪ Simple transformations - Decision (e.g. anomaly detection) - Enrichment (e.g. add a state based on zipcode) ▪ Advanced usage - Machine Learning - Windowing
  60. 60. Questions? tiny.cloudera.com/app-arch- questions #1 - Simple ingestion Buffer Event e Stream Processing Long term storage Event e
  61. 61. Questions? tiny.cloudera.com/app-arch- questions #2 - Enrichment Buffer Event e Stream Processing Storage Event e’ e’ = enriched event e Context store
  62. 62. Questions? tiny.cloudera.com/app-arch- questions #2 - Decision Buffer Event e Stream Processing Storage Event e’ e’ = e + decision Rules
  63. 63. Questions? tiny.cloudera.com/app-arch- questions #3 – Advanced usage Buffer Event e Stream Processing Storage Event e’ e’ = aggregation or windowed aggregation Model
  64. 64. Questions? tiny.cloudera.com/app-arch- questions #1 – Simple Ingestion 1. Zero transformation - No transformation, plain ingest - Keep the original format – SequenceFile, Text, etc. - Allows to store data that may have errors in the schema 2. Format transformation - Simply change the format of the field - To a structured format, say, Avro, for example - Can do schema validation 3. Atomic transformation - Mask a credit card number
  65. 65. Questions? tiny.cloudera.com/app-arch- questions #2 - Enrichment Buffer Event e Stream Processing Storage Event e’ e’ = enriched event e Context store Need to store the context somewhere
  66. 66. Questions? tiny.cloudera.com/app-arch- questions Where to store the context? 1. Locally Broadcast Cached Dim Data - Local to Process (On Heap, Off Heap) - Local to Node (Off Process) 2. Partitioned Cache - Shuffle to move new data to partitioned cache 3. External Fetch Data (e.g. HBase, Memcached)
  67. 67. Questions? tiny.cloudera.com/app-arch- questions #1a - Locally broadcast cached data Could be On heap or Off heap
  68. 68. Questions? tiny.cloudera.com/app-arch- questions #1b - Off process cached data Data is cached on the node, outside of process. Potentially in an external system like Rocks DB
  69. 69. Questions? tiny.cloudera.com/app-arch- questions #2 - Partitioned cache data Data is partitioned based on field(s) and then cached
  70. 70. Questions? tiny.cloudera.com/app-arch- questions #3 - External fetch Data fetched from external system
  71. 71. Questions? tiny.cloudera.com/app-arch- questions Partitioned cache + external
  72. 72. Streaming semantics
  73. 73. Questions? tiny.cloudera.com/app-arch- questions Delivery Types ▪ At most once - Not good for many cases - Only where performance/SLA is more important than accuracy ▪ Exactly once - Expensive to achieve but desirable ▪ At least once - Easiest to achieve
  74. 74. Questions? tiny.cloudera.com/app-arch- questions Semantics of our architecture Source System 1 Destination systemSource System 2 Source System 3 Ingest Extract Streaming engine Push Message broker
  75. 75. Questions? tiny.cloudera.com/app-arch- questions Classification of storage systems ▪ File based - S3 - HDFS ▪ NoSQL - HBase - Cassandra ▪ Document based - Search ▪ NoSQL-SQL - Kudu
  76. 76. Questions? tiny.cloudera.com/app-arch- questions Classification of storage systems ▪ File based - S3 - HDFS ▪ NoSQL - HBase - Cassandra ▪ Document based - Search ▪ NoSQL-SQL - Kudu De-duplication at file level Semantics at key/record level
  77. 77. Which streaming engine to choose?
  78. 78. Questions? tiny.cloudera.com/app-arch- questions High level architecture Source Transport Stream Processing Storage Access Custom Producer or Processing & Ingestion Engine Nested Tables Indexed Cube Relational Tables Entity Time Series Lookup Batch Processing SQL NRT REST NRT Dashboard Apache Beam Kafka Streams
  79. 79. Questions? tiny.cloudera.com/app-arch- questions Spark Streaming ▪ Micro batch based architecture ▪ Allows stateful transformations ▪ Feature rich - Windowing - Sessionization - ML - SQL (Structured Streaming)
  80. 80. Questions? tiny.cloudera.com/app-arch- questions Spark Streaming DStream DStream DStream Single Pass Source Receiver RDD Source Receiver RDD RDD Filter Count Print Source Receiver RDD RDD RDD Single Pass Filter Count Print First Batc h Secon d Batch
  81. 81. Questions? tiny.cloudera.com/app-arch- questions DStream DStream DStream Single Pass Source Receiver RDD Source Receiver RDD RDD Filter Count Print Source Receiver RDD partitions RDD Parition RDD Single Pass Filter Count Pre-first Batch First Batc h Secon d Batch Stateful RDD 1 Print Stateful RDD 2 Stateful RDD 1 Spark Streaming
  82. 82. Questions? tiny.cloudera.com/app-arch- questions Flink ▪ True “streaming” system, but not as feature rich as Spark ▪ Much better event time handling ▪ Good built-in backpressure support ▪ Allows stateful transformations ▪ Lower Latency - No Micro Batching - Asynchronous Barrier Snapshotting (ABS)
  83. 83. Questions? tiny.cloudera.com/app-arch- questions Flink - ABS Operator Buffer
  84. 84. Questions? tiny.cloudera.com/app-arch- questions Operator Buffer Operator Buffer Flink - ABS Barrier 1A Hit Barrier 1B Still Behind
  85. 85. Questions? tiny.cloudera.com/app-arch- questions Operator Buffer Flink - ABS Both Barriers Hit Operator Buffer Barrier 1A Hit Barrier 1B Still Behind
  86. 86. Questions? tiny.cloudera.com/app-arch- questions Operator Buffer Flink - ABS Both Barriers Hit Operator Buffer Barrier is combined and can move on Buffer can be flushed out
  87. 87. Questions? tiny.cloudera.com/app-arch- questions Storm ▪ Old school ▪ Didn’t manage state – had to use Trident ▪ No good support for batch processing
  88. 88. Questions? tiny.cloudera.com/app-arch- questions Samza ▪ Good integration with Kafka ▪ Doesn’t support batch ▪ Forked by Kafka Streams
  89. 89. Questions? tiny.cloudera.com/app-arch- questions Flume ▪ Well integrated with the Hadoop ecosystem ▪ Allowed interceptors (for simple transformations) ▪ Supports buffering - Memory - File - Kafka ▪ But no real fault-tolerance ▪ No state management
  90. 90. Questions? tiny.cloudera.com/app-arch- questions Others ▪ Apache Apex ▪ Kafka Streams ▪ Heron
  91. 91. Questions? tiny.cloudera.com/app-arch- questions Apache Beam ▪ Abstraction on top of Streaming Engines ▪ Best support for Google Dataflow
  92. 92. Streaming in our use- case
  93. 93. Questions? tiny.cloudera.com/app-arch- questions Spark Streaming ▪ We chose Spark Streaming because: - Same execution engine for batch and streaming - Similar code for batch and streaming - Support for security, kafka integration - Thriving community - We don’t have low millisecond requirements
  94. 94. Questions? tiny.cloudera.com/app-arch- questions High level architecture Source Transport Stream Processing Storage Access Custom Producer or Nested Tables Indexed Cube Relational Tables Entity Time Series Lookup Batch Processing SQL NRT REST NRT Dashboard
  95. 95. Storage Layer Considerations
  96. 96. Questions? tiny.cloudera.com/app-arch- questions High level architecture Source Transport Stream Processing Storage Access Custom Producer or Nested Tables Indexed Cube Relational Tables Entity Time Series Lookup Batch Processing SQL NRT REST NRT Dashboard
  97. 97. Data Modeling
  98. 98. Questions? tiny.cloudera.com/app-arch- questions Structured Landing Zones Hive Relational Model Kudu/HDFS Hive Nested Model HDFS Aggregations Kudu HBase Entity Time Series Solr Traditional SQL Optimized for nested Structures like JSON Optimized Storing and mutating aggregates Optimized Entity 360 and time base access Optimized faceted charts and reverse index look ups
  99. 99. Questions? tiny.cloudera.com/app-arch- questions Relational ▪ Everyone knows it ▪ Simple ▪ Very painful to do large Join ▪ May lead to customers making bad queries ▪ Easier to mutate
  100. 100. Questions? tiny.cloudera.com/app-arch- questions Kudu Data Models ▪ Entity Summary Tables - Quick update and access of aggregate of Entity Stats ▪ Event Tables - Number of Partitioning strategies - Partition by Entity - Partition by Hash on time
  101. 101. Questions? tiny.cloudera.com/app-arch- questions Kudu: Table Creation Example CREATE EXTERNALTABLEny_taxi_trip( vender_id STRING, tpep_pickup_datetime TIMESTAMP, tpep_dropoff_datetimeTIMESTAMP, passenger_count INT, trip_distanceDOUBLE, pickup_longitudeDOUBLE, pickup_latitudeDOUBLE, rate_code_idSTRING, store_and_fwd_flag STRING, dropoff_longitudeDOUBLE, dropoff_latitudeDOUBLE, payment_typeSTRING, fare_amount DOUBLE, extra DOUBLE, mta_tax DOUBLE, improvement_surcharge DOUBLE, tip_amountDOUBLE, tolls_amountDOUBLE, total_amountDOUBLE ) STORED AS PARQUET LOCATION 'usr/root/hive/ny_taxi_trip';
  102. 102. Questions? tiny.cloudera.com/app-arch- questions Kudu: Data Population SparkStreamingTaxiTripToKudu.scala
  103. 103. Questions? tiny.cloudera.com/app-arch- questions Kudu: REST API KuduServiceLayer.scala
  104. 104. Questions? tiny.cloudera.com/app-arch- questions View Strategies Hive Relational Model Hive Nested Model Models Hive Normal Views Hive Materialized Table Views Use in the cases where the view requires a join that is done through a shuffle Use only for tables that filter records/columns or use for marking fields
  105. 105. Questions? tiny.cloudera.com/app-arch- questions Nested ▪ Less Space than Denormalization ▪ Still have tables but the cost of joins is all but gone ▪ Also great for cartesian joins - N x M vs N + M ▪ Not really supported yet with Kudu or HBase with SQL
  106. 106. Questions? tiny.cloudera.com/app-arch- questions Nested Writing Example in Spark { "id": "0001", "type": "donut", "name": "Cake", "ppu": 0.55, "batters": { "batter": [ { "id": "1001", "type": "Regular" }, { "id": "1002", "type": "Chocolate"}, { "id": "1003", "type": "Blueberry" }, { "id": "1004", "type": "Devil's Food" } ] }, "topping": [ { "id": "5001","type": "None"}, { "id": "5002","type": "Glazed"}, { "id": "5005","type": "Sugar"}, { "id": "5007","type": "PowderedSugar"}, { "id": "5006","type": "Chocolatewith Sprinkles"} ]
  107. 107. Questions? tiny.cloudera.com/app-arch- questions Nested Writing Example in Spark val jsonDF = hiveContext.read.json(jsonRDD) jsonDF.write.parquet("./parquet") hiveContext.createExternalTable("jsonNestedTable", "./parquet")
  108. 108. Questions? tiny.cloudera.com/app-arch- questions Nested: Taxi Example KuduToNestedHDFS.scala
  109. 109. Questions? tiny.cloudera.com/app-arch- questions Entity Centric Time Series ▪ Partition by Entity ID ▪ Order by Time ▪ Allows for free windowing ▪ Allows for fetching of single time window of single entity at web scale
  110. 110. Questions? tiny.cloudera.com/app-arch- questions HBase Entity Time Series Cust-A, 10 Cust-A, 20 Cust-A, 40 Cust-C, 10 Cust-C, 20 Cust-C, 30 Cust-C, 40 Cust-B, 10 Cust-B, 20 Cust-B, 30 Cust-B, 40 Cust-F, 20 Cust-F, 30 Cust-F, 40 Cust-D, 10 Cust-D, 20 Cust-D, 40 Cust-G, 10 Cust-G, 20 Cust-G, 30 Cust-G, 40
  111. 111. Questions? tiny.cloudera.com/app-arch- questions HBase Entity Time Series Cust-A, 10 Cust-A, 20 Cust-A, 40 Cust-C, 10 Cust-C, 20 Cust-C, 30 Cust-C, 40 Cust-B, 10 Cust-B, 20 Cust-B, 30 Cust-B, 40 Cust-F, 20 Cust-F, 30 Cust-F, 40 Cust-D, 10 Cust-D, 20 Cust-D, 40 Cust-G, 10 Cust-G, 20 Cust-G, 30 Cust-G, 40 Rest Call Short Scan
  112. 112. Questions? tiny.cloudera.com/app-arch- questions HBase Entity Time Series Cust-A, 10 Cust-A, 20 Cust-A, 40 Cust-C, 10 Cust-C, 20 Cust-C, 30 Cust-C, 40 Cust-B, 10 Cust-B, 20 Cust-B, 30 Cust-B, 40 Cust-F, 20 Cust-F, 30 Cust-F, 40 Cust-D, 10 Cust-D, 20 Cust-D, 40 Cust-G, 10 Cust-G, 20 Cust-G, 30 Cust-G, 40 Mapper Mapper Mapper
  113. 113. Questions? tiny.cloudera.com/app-arch- questions HBase Entity Time Series Cust-A, 10 Cust-A, 20 Cust-A, 40 Cust-C, 10 Cust-C, 20 Cust-C, 30 Cust-C, 40 Cust-B, 10 Cust-B, 20 Cust-B, 30 Cust-B, 40 Cust-F, 20 Cust-F, 30 Cust-F, 40 Cust-D, 10 Cust-D, 20 Cust-D, 40 Cust-G, 10 Cust-G, 20 Cust-G, 30 Cust-G, 40 Mapper Mapper Mapper
  114. 114. Questions? tiny.cloudera.com/app-arch- questions HBase: Row Key Example def generateRowKey(customerTrans: CustomerTran, numOfSalts:Int): Array[Byte] = { val salt = StringUtils.leftPad( Math.abs(customerTrans.customerId.hashCode % numOfSalts).toString, 4, "0") Bytes.toBytes(salt + ":" + customerTrans.customerId + ":" + StringUtils.leftPad(customerTrans.eventTimeStamp.toString, 11, "0") + ":trans:" + customerTrans.transId) }
  115. 115. Questions? tiny.cloudera.com/app-arch- questions HBase: Population Example SparkStreamingTaxiTripToHBase.scala
  116. 116. Questions? tiny.cloudera.com/app-arch- questions HBase: REST Example HBaseServiceLayer.scala
  117. 117. Questions? tiny.cloudera.com/app-arch- questions Solr: Data Model ▪ Think of it like a cube on a object type - In our case a taxi trip - Allows for rollups and aggregations from object’s point of view - Think of objects as immutable - Try to find time based events - May design more than one object type
  118. 118. Questions? tiny.cloudera.com/app-arch- questions Solr Details 1 Trip:101 1 2 Trip:102 1 3 Trip:103 1 ID Document Live 4 Trip:104 1 5 Trip:105 1 ID Field Value Documents 1 Cash 1,3 2 Credit 2 3 Debit 4,5
  119. 119. Questions? tiny.cloudera.com/app-arch- questions Single Value Aggregations ▪Get Array Lengths ID Field Value Documents 1 Cash 1,3 2 Credit 2 3 Debit 4,5
  120. 120. Questions? tiny.cloudera.com/app-arch- questions Multi Value Aggregations ▪Ordered Merge Join -Think like a zipper -Scans -No Lookups ▪Top N from both sides -Leaving the rest to other ▪Indexes distributed ▪No need to read document data 1 4 5 7 8 9 10 14 16 2 3 6 11 12 13 15 17 18 1 2 3 6 7 8 10 15 18 Cash Credit Vender A 4 5 9 11 12 13 14 16 17Vender B
  121. 121. Questions? tiny.cloudera.com/app-arch- questions Solr: Population Example SparkStreamingTaxiTripToSolR.scala
  122. 122. Questions? tiny.cloudera.com/app-arch- questions Storage High level architecture Source Transport Stream Processing Access Custom Producer or
  123. 123. Batch Processing Considerations
  124. 124. Questions? tiny.cloudera.com/app-arch- questions High level architecture Source Transport Stream Processing Storage Access Custom Producer or Nested Tables Indexed Cube Relational Tables Entity Time Series Lookup Batch Processing SQL NRT REST NRT Dashboard
  125. 125. Questions? tiny.cloudera.com/app-arch- questions Why have batch processing? ▪ When you need a larger context - Say, to train a model ▪ Complex periodic job that does something - Convert data to a nested structure for reduced number of shuffles ▪ In our use-case, - Kudu -> HDFS Nested is batch processing - KMeans calculation is also in bash
  126. 126. Questions? tiny.cloudera.com/app-arch- questions Batch processing options ▪ Spark (+ MLlib) ▪ MapReduce (+ Mahout) ▪ Flink (+ Flink ML)
  127. 127. Questions? tiny.cloudera.com/app-arch- questions Spark ▪ Pretty popular ▪ Much faster than MapReduce ▪ Thriving community
  128. 128. Questions? tiny.cloudera.com/app-arch- questions MapReduce ▪ Sloooooow
  129. 129. Questions? tiny.cloudera.com/app-arch- questions Flink ▪ Pretty popular ▪ Batch is a special case of Streaming ▪ Developing community
  130. 130. Questions? tiny.cloudera.com/app-arch- questions In our use-case ▪ We chose Spark - We were using Spark Streaming anyways - Similar code between Spark and Spark Streaming - Thriving community
  131. 131. Interactive Data Access Considerations
  132. 132. Questions? tiny.cloudera.com/app-arch- questions High level architecture Source Transport Stream Processing Storage Access Custom Producer or Nested Tables Indexed Cube Relational Tables Entity Time Series Lookup Batch Processing SQL NRT REST NRT Dashboard
  133. 133. Questions? tiny.cloudera.com/app-arch- questions Types of data access ▪ REST server/APIs for querying entities and aggregates ▪ UI for displaying search facets ▪ SQL engine
  134. 134. REST servers Considerations
  135. 135. Questions? tiny.cloudera.com/app-arch- questions Why have REST server? ▪ Tired of business people telling us how to access data ▪ Serves as an interface between the data engineers and business folks ▪ Lets business folks decide access patterns ▪ Engineers to optimize those patterns ▪ Brownie points from your boss ▪ And, it’s not that difficult to write!
  136. 136. Questions? tiny.cloudera.com/app-arch- questions Don’t believe me? import org.mortbay.jetty.Server import org.mortbay.jetty.servlet.{Context, ServletHolder} … val server = new Server(port) val sh = new ServletHolder(classOf[ServletContainer]) sh.setInitParameter("com.sun.jersey.config.property.resourceConfigClass", "com.sun.jersey.api.core.PackagesResourceConfig") sh.setInitParameter("com.sun.jersey.config.property.packages", "com.hadooparchitecturebook.taxi360.server.hbase") sh.setInitParameter("com.sun.jersey.api.json.POJOMappingFeature", "true”) val context = new Context(server, "/", Context.SESSIONS) context.addServlet(sh, "/*”) server.start() server.join()
  137. 137. Questions? tiny.cloudera.com/app-arch- questions Then, write a ServiceLayer @GET @Path("vender/{venderId}/timeline") @Produces(Array(MediaType.APPLICATION_JSON)) def getTripTimeLine (@PathParam("venderId") venderId:String, @QueryParam("startTime") startTime:String = Long.MinValue.toString, @QueryParam("endTime") endTime:String = Long.MaxValue.toString): Array[NyTaxiYellowTrip] = {
  138. 138. Questions? tiny.cloudera.com/app-arch- questions Use REST! Say no to business people! ▪ Access data like so: http://<serverURL>:8080/vender/{venderId}/timeline
  139. 139. UI Considerations
  140. 140. Questions? tiny.cloudera.com/app-arch- questions UI requirements Something that can ▪ Represent search results really well ▪ Integrates with Apache Solr on Hadoop
  141. 141. Questions? tiny.cloudera.com/app-arch- questions UI options ▪ Hue ▪ Banana ▪ Kibana
  142. 142. Questions? tiny.cloudera.com/app-arch- questions We choose Hue ▪ Because it’s included ▪ Please look at the others
  143. 143. SQL engines Considerations
  144. 144. Questions? tiny.cloudera.com/app-arch- questions SQL engine criteria ▪ Low latency SQL access ▪ Allows for high concurrency ▪ JDBC/ODBC integration ▪ Capable of large scale aggregation ▪ Optionally integrates with Kudu for real-time updates to SQL tables
  145. 145. Questions? tiny.cloudera.com/app-arch- questions Apache Hive ▪ Good JDBC integration ▪ Not really low latency, even when using Tez ▪ Doesn’t integrate with Kudu ▪ Can run at MR, Spark, or Tez
  146. 146. Questions? tiny.cloudera.com/app-arch- questions Presto ▪ Low latency SQL engine from Facebook ▪ Provides JDBC/ODBC access ▪ Is only in-memory, large aggregations can lead to OOM errors ▪ Doesn’t integrate with Kudu
  147. 147. Questions? tiny.cloudera.com/app-arch- questions Apache Impala ▪ Low latency SQL access ▪ Provides JDBC/ODBC access ▪ Excellent concurrency support ▪ Integrates with Kudu for real-time SQL
  148. 148. Questions? tiny.cloudera.com/app-arch- questions Apache Drill ▪ Similar in architecture to Impala ▪ Provides JDBC/ODBC access ▪ Doesn’t integrate with Kudu
  149. 149. Questions? tiny.cloudera.com/app-arch- questions Spark SQL ▪ Builds on top of Spark ▪ JDBC/ODBC access only via Spark Thrift Server - Doesn’t scale well with larger number of concurrent users - Doesn’t fully provide secure access.
  150. 150. Questions? tiny.cloudera.com/app-arch- questions We choose ▪ Spark SQL ▪ Impala
  151. 151. Overall Architecture Review
  152. 152. Questions? tiny.cloudera.com/app-arch- questions High level architecture Source Transport Stream Processing Storage Access Custom Producer Processing & Ingestion Engine Nested Tables Indexed Cube Relational Tables Entity Time Series Lookup Batch Processing SQL NRT Rest NRT Dashboard
  153. 153. Questions? tiny.cloudera.com/app-arch- questions High level architecture Source Transport Stream Processing Storage Access Custom Producer or Nested Tables Indexed Cube Relational Tables Entity Time Series Lookup Batch Processing SQL NRT REST NRT Dashboard
  154. 154. Questions? tiny.cloudera.com/app-arch- questions Storage High level architecture Source Transport Stream Processing Custom Producer or Access Batch Processing SQL NRT REST NRT Dashboard
  155. 155. Questions? tiny.cloudera.com/app-arch- questions Access High level architecture Source Transport Stream Processing Storage Custom Producer or
  156. 156. Questions? tiny.cloudera.com/app-arch- questions High level architecture Source Transport Stream Processing Storage Access Custom Producer or
  157. 157. Demo!
  158. 158. Questions? tiny.cloudera.com/app-arch- questions High Level of the Demo Design Producer Kafka Topic Foo Partition 1 Partition 2 Partition 3 Spark Streaming Kudu Spark Streaming HBase Spark Streaming Solr Spark Streaming HDFS Kudu HBase Solr HDFS SQL REST REST Hue SQL
  159. 159. Where else to find us?
  160. 160. Questions? tiny.cloudera.com/app-arch- questions Book Signing! ▪ 6:50 pm on Wednesday (tomorrow) at Cloudera booth
  161. 161. Questions? tiny.cloudera.com/app-arch- questions Other Sessions ▪ Ask Us Anything session (all) – Thursday, 1:15 pm ▪ Top Five Mistakes When Writing Spark Applications – Wednesday, 1:15 pm
  162. 162. Thank you! @hadooparchbook tiny.cloudera.com/app-arch-slides Jonathan Seidman | @jseidman Ted Malaska | @ted_malaska Mark Grover | @mark_grover

×