Architecting a Next Generation Data Platform

Hadoop Application
Architectures:
Architecting a Next
Generation Data
Platform
Strata + Hadoop World, San Jose 2017
tiny.cloudera.com/app-arch-sanjose
tiny.cloudera.com/app-arch-questions
Mark Grover | @mark_grover
Ted Malaska | @ted_malaska
Jonathan Seidman | @jseidman
Gwen Shapira | @gwenshap

Questions? tiny.cloudera.com/app-arch-questions
Logistics
▪ Break at 3:00 – 3:30 PM
▪ Questions at the end of each section
▪ Slides at tiny.cloudera.com/app-arch-sanjose
▪ Code at https://github.com/hadooparchitecturebook/Taxi360

About the book
▪ @hadooparchbook
▪ hadooparchitecturebook.com
▪ github.com/hadooparchitecturebook
▪ slideshare.com/hadooparchbook

About the presenters
▪ Technical Group Architect at
Blizzard Entertainment
▪ Principal Solutions Architect
at Cloudera
▪ Big Data Archicect at FINRA
▪ Contributor to Apache HDFS,
HBase, Flume, Avro, Pig,
Spark, YARN, Sqoop, Kudu,
Kafka
Ted Malaska

▪ Software Engineer on Spark
at Cloudera
▪ Committer on Apache Bigtop,
PMC member on Apache
Sentry(incubating)
▪ Contributor to Apache Spark,
Hadoop, Hive, Sqoop, Pig,
Flume
Mark Grover

▪ Software Engineer at
Cloudera
▪ Contributor to Apache Sqoop.
▪ Previously Technical Lead on
the big data team at Orbitz,
co-founder of the Chicago
Hadoop User Group and
Chicago Big Data
Jonathan Seidman

▪ Product Manager at Confluent
▪ PMC for Apache Kafka.
▪ Previously software engineer
at Cloudera
▪ @gwenshap on twitter
Gwen Shapira

Case Study Overview
Internet of Things and Entity 360

Customer 360

Connected Cars

Entity (Taxi) 360 View
Geo-location/
Traffic Data
Customer Data
Maintenance
Data
Other Data
Sources
Streaming
Vehicle Data

What Makes Hadoop a Fit?
Data Sources Extract Transform Load
The early days…

What Makes Hadoop a Fit?
SERVERS MARTS EDWS DOCUMENTS STORAGE SEARCH ARCHIVE
ERP, CRM, RDBMS, MACHINES FILES, IMAGES, VIDEOS, LOGS, CLICKSTREAMS EXTERNAL DATA SOURCES
Today…

Enabling a Range of New Use Cases…
Fraud Detection Market
Transactions
Internet of
Things
Network Security

Hadoop Challenges
Kafka StreamsKafka Connect
Kafka

Challenges - Architectural Considerations
▪ Reliable and scalable ingress of multiple data types and sources:
- High volume event data? Batch data?
▪ Reliable and scalable storage to support multiple workloads and access patterns
- Historical data? Real-time search? Analytics?
▪ Processing engines (for background processing):
- Stream processing? Batch processing?
▪ Data Modeling
- Modeling data for real-time random access? Analytic access? Batch access?

Case Study
Requirements
Overview

Requirements
▪ Allow users (technical and non-technical) to analyze and visualize data…

Requirements
▪ Provide analysts with query capabilities via a standard interface…

Requirements
▪ Provide developers the ability to perform batch processing on historical data…

Requirements
▪ To support all this, we need:
- Reliable ingestion of streaming and batch data.
- Ability to perform transformations on streaming data in flight.
- Ability to perform sophisticated processing of historical data.

High level architecture
Walkthrough

Source Transport Stream
Processing
Storage Access
Data Producer Pub-Sub
Processing &
Ingestion
Engine
Nested
Tables
Indexed
Cube
Relational
Tables
Entity Time
Series Lookup
Batch
Processing
SQL
NRT REST
NRT
Dashboard

TransportSource Stream
Processing
Storage Access
Data
Producers
Processing &
Ingestion
Engine
Nested
Tables
Indexed
Cube
Relational
Tables
Entity Time
Series Lookup
Batch
Processing
SQL
NRT REST
NRT
Dashboard
Pub Sub

Key to Customer 360 Success
Your project is only as good as the quality and variety of data sources
Geo-location/
Traffic Data
Customer DataMaintenance
Data
Other Data
Sources
Streaming
Vehicle Data
Files
CSV? XML?
JSON?
Twitter?
Mainframe?
Database Salesforce?
MQTT

Get the Data: Flume vs. Kafka
▪ Flume – well integrated with Hadoop.
- Part of Hadoop ecosystem
- Great choice when ingesting data into HDFS.
- Can support simple transformations.
▪ Kafka – flexible, get-everything pipe
▪ Producers in ~ 20 languages
▪ REST API
▪ Huge connector ecosystem

Kafka Clients
Apache Kafka Clients Ecosystem Clients

REST Proxy
Talking to Non-native Kafka Apps and Outside the Firewall
REST Proxy
Non-Java Applications
Native Kafka Java Applications
REST / HTTP
Simplifies administrative
actions
Simplifies message creation
and consumption
Provides a RESTful
interface to a Kafka
cluster

Kafka Connect
Streaming Data Capture
JDBC
Logs
MQTT
RDBMS
Key/Value
HDFS
Kafka Connect API
Kafka
Connector
Connector
Connector
Connector
Connector
Connector
Sources Sinks
Fault tolerant
Manage hundreds of data
sources and sinks
Preserves data schema
Part of Apache Kafka
project
Includes simple
transformations

Ecosystem of Connectors
Databases Datastore/File Store
Analytics Applications / Other

How Connect Works?
Log
Connector
MQTT
Connector
REST API
Logs MQTT
Log Task Log Task
MQTT
Task
MQTT
Task

Schema Registry
Elastic
Cassandra
HDFS
Example Consumers
Serializer
Source 1
Serializer
Source 2
!
Kafka Topic!
Schema Registry
Define the expected fields for each Kafka topic
Automatically handle schema changes (e.g. new fields)Prevent backwards incompatible changes
Get different data sources to talk the same language

TransportSource Stream
Processing
Storage Access
Processing &
Ingestion
Engine
Nested
Tables
Indexed
Cube
Relational
Tables
Entity Time
Series Lookup
Batch
Processing
SQL
NRT REST
NRT
Dashboard
Pub Sub

But wait!
What about batch data?

Source Buffer Stream
Processing
Storage Access
Pub-Sub
Processing &
Ingestion
Engine
Nested
Tables
Indexed
Cube
Relational
Tables
Entity Time
Series Lookup
Batch
Processing
SQL
NRT REST
NRT
Dashboard
Data
Producers

Buffering Data
▪ What do we mean by “buffering” and why do we need it?
event,event,event,event,event,event…
This is bad!
▪ Network partitions happen
▪ Producers and Consumers
work at different rates
▪ Reliable storage is hard
Stream processing is hard
Lets do one at a time

Buffering Data – Message Brokers
Publisher
Publisher
Publisher
Message
Queue
Subscriber
Subscriber
Subscriber

Source Buffer Stream
Processing
Storage Access
Processing &
Ingestion
Engine
Nested
Tables
Indexed
Cube
Relational
Tables
Entity Time
Series Lookup
Batch
Processing
SQL
NRT REST
NRT
Dashboard

What is Kafka?
▪ It’s like a message queue, right?
- Actually, it’s a “distributed commit log”
- Or “streaming data platform”
0 1 2 3 4 5 6 7 8
Data
Source
Data
Consumer
A
Data
Consumer
B

Topics and Partitions
▪ Messages are organized into topics, and each topic is split into partitions.
- Each partition is an immutable, time-sequenced log of messages on disk.
- Note that time ordering is guaranteed within, but not across, partitions.
0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8
Partition 0
Partition 1
Partition 2
Data
Source
Topic

Consumers
In Our Architecture
Taxi Trip Data
Producer
Kafka
taxi-trip-input
Topic
Stream
Processing
(Analytic)
Stream
Processing
(Lookup)
Stream
Processing
(Search)
Stream
Processing
(Long Term)

Input Events
CMT,2009-01-05 08:31:55,2009-01-05 8:37:50,1,0.90000000000000002,-73.977936999999997,
40.745919000000001,,,-
73.983609000000001,40.755051000000002,Credit,5.2999999999999998,0,,0.79000000000000004,0,6.0
899999999999999
vendor_name,Trip_Pickup_DateTime,Trip_Dropoff_DateTime,Passenger_Count,Trip_Distance,
Start_Lon,Start_Lat,Rate_Code,store_and_forward,End_Lon,End_Lat,Payment_Type,Fare_Amt,
surcharge,mta_tax,Tip_Amt,Tolls_Amt,Total_Amt

Kafka Considerations – Reliability
▪ But remember there are tradeoffs…

Kafka Reliability – Replication
Producer
Broker
Partition1
Partition2
Partition3
Leader

Producer
Broker
Partition1
Partition2
Partition3

Kafka Considerations – Reliability
▪ Different reliability levels for topics:
Taxi Trip Data
Kafka
taxi-trip-input
Twitter customer-sentiment
100% – dups
are ok
(“At least
once”)
<=100%
(“At most
once”)
News Flash:
Kafka’s Exactly Once
Producer is on the way

Producer
Broker
Partition1
Partition2
Partition3
Broker
Partition1
Partition2
Partition3
Leader

Kafka Reliability– Replication
Producer
Broker
Partition1
Partition2
Partition3
Broker
Partition1
Partition2
Partition3
Leader
Leader

Producer
Broker
Partition1
Partition2
Partition3
Broker
Partition1
Partition2
Partition3
Broker
Partition1
Partition2
Partition3
Leader

▪ So how does this relate to our application?
kafka-topics --zookeeper ZKHOST:ZKPORT –partition 2 --replication-factor 3
--create --topic taxi-trip-input
kafka-topics --zookeeper ZKHOST:ZKPORT –partition 2 --replication-factor 1
--create –topic customer-sentiment

Kafka Reliability – Producers
Taxi Trip Data
Kafka
taxi_trip_input
Partition 1
Partition 2
Partition 3
Topic B
Partition 1
Partition 2
Partition 3
Message
failure?
Producer
Resend
message
acks=all

Kafka Reliability – Producers
▪ What about duplicates?
Taxi Trip Data
Kafka
taxi_trip_input
Partition 1
Partition 2
Partition 3
Topic B
Partition 1
Partition 2
Partition 3
Producer
ID Message
1000 2009-01-04 03:02:00,1,2.629,...
1001 2009-01-04 03:38:00,3,4.549…
1001 2009-01-04 03:38:00,3,4.549…

Kafka Scaling – Partitions
Producer
Kafka
taxi-trip-input
Partition 1
Partition 2
Partition 3
Consumer Group
Consumer
Consumer
Consumer

Kafka Scaling – Partitions
Producer
Kafka
taxi-trip-input
Partition 1
Partition 2
Partition 3
Consumer Group
Consumer
Consumer
Consumer
Partition 4
Partition 5
Consumer
Consumer
Higher
throughput
Higher
throughput
More
resources
(memory)
More
resources
(file handles)
Producer

How many partitions?
▪ Adding partitions late in the game is painful
▪ Basic formula:
total desired throughput / throughput of slowest consumer or producer
▪ Or ~25GB disk space
▪ Not too many because:
- Each partition takes broker heap memory and file handles
- Each partition slows down node shutdown / recovery
- 1000 – 4000 partitions per broker max
- Producers will produce smaller batches – lower throughput

Kafka Scaling – Producers
Producer
Kafka
taxi-trip-input
Partition 1
Partition 2
Partition 3
Consumer Group
Consumer
Consumer
Consumer
Partition 4
Partition 5
Consumer
Consumer
Producer

Guarding Against Message Loss
▪ Producer – What happens if the producer loses connection to Kafka and the buffer
overflows?
- You get an exception. You can choose to… block? Write to file?
▪ Source – What happens if events are lost before getting sent to producer?
- Once again use some kind of buffer to provide sufficient retention of data.

Stream Processing
Considerations

Processing
Storage Access
Custom
Producer
or
Processing &
Ingestion
Engine
Nested
Tables
Indexed
Cube
Relational
Tables
Entity Time
Series Lookup
Batch
Processing
SQL
NRT REST
NRT
Dashboard

Streaming agenda
▪ What do we mean by streaming?
▪ Streaming use-cases
▪ Streaming semantics
▪ Which streaming engine to choose?
▪ Streaming in our use-case

What do we mean by streaming?
Constant low
milliseconds & under
Low milliseconds to
seconds, delay in
case of failures
10s of seconds or
more, re-run in case
of failures
Real-time Near real-time Batch

But, there’s no free lunch
Constant low
milliseconds & under
Low milliseconds to
seconds, delay in
case of failures
10s of seconds or
more, re-run in case
of failures
Real-time Near real-time Batch
“Difficult” architectures, lower
latency
“Easier” architectures, higher
latency

Streaming Use-cases
▪ Ingestion (most relevant in our use-case)
▪ Simple transformations
- Decision (e.g. anomaly detection)
- Enrichment (e.g. add a state based on zipcode)
▪ Advanced usage
- Machine Learning
- Windowing

#1 - Simple ingestion
Buffer
Event e Stream
Processing Long term
storage
Event e

#2 - Enrichment
Buffer
Event e Stream
Processing Storage
Event e’
e’ = enriched event e
Context store

#2 - Decision
Buffer
Event e Stream
Processing Storage
Event e’
e’ = e + decision
Rules

#3 – Advanced usage
Buffer
Event e Stream
Processing Storage
Event e’
e’ = aggregation or
windowed aggregation
Model

#1 – Simple Ingestion
1. Zero transformation
- No transformation, plain ingest
- Keep the original format – SequenceFile, Text, etc.
- Allows to store data that may have errors in the schema
2. Format transformation
- Simply change the format of the field
- To a structured format, say, Avro, for example
- Can do schema validation
3. Atomic transformation
- Mask a credit card number

#2 - Enrichment
Buffer
Event e Stream
Processing Storage
Event e’
e’ = enriched event e
Context store
Need to store the
context
somewhere

Where to store the context?
1. Locally Broadcast Cached Dim Data
- Local to Process (On Heap, Off Heap)
- Local to Node (Off Process)
2. Partitioned Cache
- Shuffle to move new data to partitioned cache
3. External Fetch Data (e.g. HBase, Memcached)

#1a - Locally broadcast cached data
Could be
On heap or Off heap

#1b - Off process cached data
Data is cached on the
node, outside of
process. Potentially in
an external system like
Rocks DB

#2 - Partitioned cache data
Data is partitioned
based on field(s) and
then cached

#3 - External fetch
Data fetched from
external system

Partitioned cache + external

Delivery Types
▪ At most once
- Not good for many cases
- Only where performance/SLA is more important than accuracy
▪ Exactly once
- Expensive to achieve but desirable
▪ At least once
- Easiest to achieve

Semantics of our architecture
Source System 1
Destination
systemSource System 2
Source System 3
Ingest Extract Streaming
engine
Push
Message broker

Classification of storage systems
▪ File based
- S3
- HDFS
▪ NoSQL
- HBase
- Cassandra
▪ Document based
- Search
▪ NoSQL-SQL
- Kudu

Classification of storage systems
▪ File based
- S3
- HDFS
▪ NoSQL
- HBase
- Cassandra
▪ Document based
- Search
▪ NoSQL-SQL
- Kudu
De-duplication at file level
Semantics at key/record level

Which streaming
engine to choose?

Processing
Storage Access
Processing &
Ingestion
Engine
Nested
Tables
Indexed
Cube
Relational
Tables
Entity Time
Series Lookup
Batch
Processing
SQL
NRT REST
NRT
Dashboard
Apache
Beam
Kafka
Streams

Requirements
▪ Fault-tolerant and distributed
▪ Effectively once semantics
▪ Handle processing time vs. event time
▪ Allow stateful transformations

Spark Streaming
▪ Micro batch based architecture
▪ Allows stateful transformations
▪ Feature rich
- Windowing
- Sessionization
- ML
- SQL (Structured Streaming)

Spark Streaming
DStream
DStream
DStream
Single Pass
Source Receiver RDD
Source Receiver RDD
RDD
Filter Count Print
Source Receiver RDD
RDD
RDD
Single Pass
Filter Count Print
First
Batc
h
Second
Batch

DStream
DStream
DStream
Single Pass
Source Receiver RDD
Source Receiver RDD
RDD
Filter Count
Print
Source Receiver
RDD
partitions
RDD
Parition
RDD
Single Pass
Filter Count
Pre-first
Batch
First
Batc
h
Second
Batch
Stateful
RDD 1
Print
Stateful
RDD 2
Stateful
RDD 1
Spark Streaming

Spark Streaming - Gaps
▪ Not as low of a latency
- Efforts towards reducing latency e.g. RISElab’s Drizzle
▪ Global consistent execution state
- Stop overall execution of distributed computation
- Eagerly persist records in transit meaning larger snapshots

Flink
▪ True “streaming” system, but not as feature rich as Spark
▪ Much better event time handling
▪ Good built-in backpressure support
▪ Allows stateful transformations
▪ Lower Latency
- No Micro Batching
- Asynchronous Barrier Snapshotting (ABS)

Flink - ABS
Operator
Buffer

Operator
Buffer
Operator
Buffer
Flink - ABS
Barrier 1A Hit
Barrier 1B
Still Behind

Operator
Buffer
Flink - ABS
Both Barriers
Hit
Operator
Buffer
Barrier 1A Hit
Barrier 1B
Still Behind

Operator
Buffer
Flink - ABS Both Barriers
Hit
Operator
Buffer Barrier is
combined and
can move on
Buffer can be
flushed out

Storm
▪ Old school
▪ Didn’t manage state – had to use Trident
▪ No good support for batch processing

Samza
▪ Good integration with Kafka
▪ Doesn’t support batch
▪ Forked by Kafka Streams

Flume
▪ Well integrated with the Hadoop ecosystem
▪ Allowed interceptors (for simple transformations)
▪ Supports buffering
- Memory
- File
- Kafka
▪ But no real fault-tolerance
▪ No state management

Kafka Streams
▪ Good integration with Kafka
▪ Light-weight library (not a framework)
▪ No micro-batching, uses Kafka as internal messaging layer
▪ Maintains local state per node (in RocksDB, or in memory
hash map)
▪ Handles late events
▪ Stream-to-stream joins

Topic
Partition 1
Partition 2
Task 1 Re-partition topic
Partition 1
Partition 2
Task 3
Task 2
Task 4
Kafka Streams architecture

Apache Beam
▪ Abstraction on top of Streaming Engines
▪ Best support for Google Dataflow

Others
▪ Apache Apex
▪ Heron

Spark Streaming
▪ We chose Spark Streaming because:
- Same execution engine for batch and streaming
- Similar code for batch and streaming
- Support for security, Kafka integration
- Thriving community

Processing
Storage Access
Nested
Tables
Indexed
Cube
Relational
Tables
Entity Time
Series Lookup
Batch
Processing
SQL
NRT REST
NRT
Dashboard

Structured Landing Zones
Hive Relational Model
Kudu/HDFS
Hive Nested Model
HDFS
Aggregations
Kudu
HBase Entity Time
Series
Solr
Traditional SQL
Optimized for nested Structures like JSON
Optimized Storing and mutating aggregates
Optimized Entity 360 and time base access
Optimized faceted charts and reverse index look
ups

Relational
▪ Everyone knows it
▪ Simple
▪ Very painful to do large Join
▪ May lead to customers making bad queries
▪ Easier to mutate

Kudu Data Models
▪ Entity Summary Tables
- Quick update and access of aggregate of Entity Stats
▪ Event Tables
- Number of Partitioning strategies
- Partition by Entity
- Partition by Hash on time

Kudu: Table Creation Example
CREATE EXTERNAL TABLE ny_taxi_trip (
vender_id STRING,
tpep_pickup_datetime TIMESTAMP,
tpep_dropoff_datetime TIMESTAMP,
passenger_count INT,
trip_distance DOUBLE,
pickup_longitude DOUBLE,
pickup_latitude DOUBLE,
rate_code_id STRING,
store_and_fwd_flag STRING,
dropoff_longitude DOUBLE,
dropoff_latitude DOUBLE,
payment_type STRING,
fare_amount DOUBLE,
extra DOUBLE,
mta_tax DOUBLE,
improvement_surcharge DOUBLE,
tip_amount DOUBLE,
tolls_amount DOUBLE,
total_amount DOUBLE
)
STORED AS PARQUET
LOCATION 'usr/root/hive/ny_taxi_trip';

Kudu: Data Population
SparkStreamingTaxiTripToKudu.scala

Kudu: REST API
KuduServiceLayer.scala

View Strategies
Hive Relational Model
Hive Nested Model
Models
Hive Normal Views
Hive Materialized Table
Views
Use in the cases where the view requires
a join that is done through a shuffle
Use only for tables that filter
records/columns or use for marking fields

Nested
▪ Less Space than Denormalization
▪ Still have tables but the cost of joins is all but gone
▪ Also great for cartesian joins
- N x M vs N + M
▪ Not really supported yet with Kudu or HBase with SQL

Nested Writing Example in Spark
{
"id": "0001",
"type": "donut",
"name": "Cake",
"ppu": 0.55,
"batters":
{
"batter":
[
{ "id": "1001", "type": "Regular" },
{ "id": "1002", "type": "Chocolate" },
{ "id": "1003", "type": "Blueberry" },
{ "id": "1004", "type": "Devil's Food" }
]
},
"topping":
[
{ "id": "5001", "type": "None" },
{ "id": "5002", "type": "Glazed" },
{ "id": "5005", "type": "Sugar" },
{ "id": "5007", "type": "Powdered Sugar" },
{ "id": "5006", "type": "Chocolate with Sprinkles" }
]

Nested Writing Example in Spark
val jsonDF = hiveContext.read.json(jsonRDD)
jsonDF.write.parquet("./parquet")
hiveContext.createExternalTable("jsonNestedTable", "./parquet")

Nested: Taxi Example
KuduToNestedHDFS.scala

Entity Centric Time Series
▪ Partition by Entity ID
▪ Order by Time
▪ Allows for free windowing
▪ Allows for fetching of single time window of single entity at web scale

HBase Entity Time Series
Cust-A, 10
Cust-A, 20
Cust-A, 40
Cust-C, 10
Cust-C, 20
Cust-C, 30
Cust-C, 40
Cust-B, 10
Cust-B, 20
Cust-B, 30
Cust-B, 40
Cust-F, 20
Cust-F, 30
Cust-F, 40
Cust-D, 10
Cust-D, 20
Cust-D, 40
Cust-G, 10
Cust-G, 20
Cust-G, 30
Cust-G, 40

Cust-A, 10
Cust-A, 20
Cust-A, 40
Cust-C, 10
Cust-C, 20
Cust-C, 30
Cust-C, 40
Cust-B, 10
Cust-B, 20
Cust-B, 30
Cust-B, 40
Cust-F, 20
Cust-F, 30
Cust-F, 40
Cust-D, 10
Cust-D, 20
Cust-D, 40
Cust-G, 10
Cust-G, 20
Cust-G, 30
Cust-G, 40
Rest Call Short Scan

Cust-A, 10
Cust-A, 20
Cust-A, 40
Cust-C, 10
Cust-C, 20
Cust-C, 30
Cust-C, 40
Cust-B, 10
Cust-B, 20
Cust-B, 30
Cust-B, 40
Cust-F, 20
Cust-F, 30
Cust-F, 40
Cust-D, 10
Cust-D, 20
Cust-D, 40
Cust-G, 10
Cust-G, 20
Cust-G, 30
Cust-G, 40
Mapper Mapper Mapper

Cust-A, 10
Cust-A, 20
Cust-A, 40
Cust-C, 10
Cust-C, 20
Cust-C, 30
Cust-C, 40
Cust-B, 10
Cust-B, 20
Cust-B, 30
Cust-B, 40
Cust-F, 20
Cust-F, 30
Cust-F, 40
Cust-D, 10
Cust-D, 20
Cust-D, 40
Cust-G, 10
Cust-G, 20
Cust-G, 30
Cust-G, 40
Mapper
Mapper Mapper

HBase: Row Key Example
def generateRowKey(customerTrans: CustomerTran, numOfSalts:Int): Array[Byte] = {
val salt = StringUtils.leftPad(
Math.abs(customerTrans.customerId.hashCode % numOfSalts).toString, 4, "0")
Bytes.toBytes(salt + ":" +
customerTrans.customerId + ":" +
StringUtils.leftPad(customerTrans.eventTimeStamp.toString, 11, "0") + ":trans:" +
customerTrans.transId)
}

HBase: Population Example
SparkStreamingTaxiTripToHBase.scala

HBase: REST Example
HBaseServiceLayer.scala

Cassandra Time Series
CREATE TABLE temperature (
weatherstation_id text,
event_time timestamp,
temperature text,
PRIMARY KEY (weatherstation_id,event_time)
);

Cassandra Time Series
INSERT INTO
temperature(weatherstation_id,event_time,temperature)
SELECT event_time,temperature
FROM temperature
WHERE weatherstation_id=’1234ABCD’;

Solr: Data Model
▪ Think of it like a cube on a object type
- In our case a taxi trip
- Allows for rollups and aggregations from object’s point of view
- Think of objects as immutable
- Try to find time based events
- May design more than one object type

Solr Details
1 Trip:101 1
2 Trip:102 1
3 Trip:103 1
ID Document Live
4 Trip:104 1
5 Trip:105 1
ID Field Value Documents
1 Cash 1,3
2 Credit 2
3 Debit 4,5

Single Value Aggregations
▪ Get Array Lengths
ID Field Value Documents
1 Cash 1,3
2 Credit 2
3 Debit 4,5

Multi Value Aggregations
▪ Ordered Merge Join
- Think like a zipper
- Scans
- No Lookups
▪ Top N from both sides
- Leaving the rest to other
▪ Indexes distributed
▪ No need to read document data
1 4 5 7 8 9 10 14 16
2 3 6 11 12 13 15 17 18
1 2 3 6 7 8 10 15 18
Cash
Credit
Vender A
4 5 9 11 12 13 14 16 17Vender B

Solr: Population Example
SparkStreamingTaxiTripToSolR.scala

Storage
Processing
Access

Batch Processing
Considerations

Why have batch processing?
▪ When you need a larger context
- Say, to train a model
▪ Complex periodic job that does something
- Convert data to a nested structure for reduced number of shuffles
▪ In our use-case,
- Kudu -> HDFS Nested is batch processing
- KMeans calculation is also in bash

Batch processing options
▪ Spark (+ MLlib)
▪ MapReduce (+ Mahout)
▪ Flink (+ Flink ML)

Spark
▪ Pretty popular
▪ Much faster than MapReduce
▪ Thriving community

MapReduce
▪ Sloooooow

Flink
▪ Pretty popular
▪ Batch is a special case of Streaming
▪ Developing community

In our use-case
▪ We chose Spark
- We were using Spark Streaming anyways
- Similar code between Spark and Spark Streaming
- Thriving community

Interactive
Data Access
Considerations

Types of data access
▪ REST server/APIs for querying entities and aggregates
▪ UI for displaying search facets
▪ SQL engine

Why have REST server?
▪ Tired of business people telling us how to access data
▪ Serves as an interface between the data engineers and business folks
▪ Lets business folks decide access patterns
▪ Engineers to optimize those patterns
▪ Brownie points from your boss
▪ And, it’s not that difficult to write!

Don’t believe me?
import org.mortbay.jetty.Server
import org.mortbay.jetty.servlet.{Context, ServletHolder}
…
val server = new Server(port)
val sh = new ServletHolder(classOf[ServletContainer])
sh.setInitParameter("com.sun.jersey.config.property.resourceConfigClass",
"com.sun.jersey.api.core.PackagesResourceConfig")
sh.setInitParameter("com.sun.jersey.config.property.packages",
"com.hadooparchitecturebook.taxi360.server.hbase")
sh.setInitParameter("com.sun.jersey.api.json.POJOMappingFeature", "true”)
val context = new Context(server, "/", Context.SESSIONS)
context.addServlet(sh, "/*”)
server.start()
server.join()

Then, write a ServiceLayer
@GET
@Path("vender/{venderId}/timeline")
@Produces(Array(MediaType.APPLICATION_JSON))
def getTripTimeLine (@PathParam("venderId") venderId:String,
@QueryParam("startTime") startTime:String = Long.MinValue.toString,
@QueryParam("endTime") endTime:String = Long.MaxValue.toString):
Array[NyTaxiYellowTrip] = {

Use REST! Say no to business people!
▪ Access data like so:
http://<serverURL>:8080/vender/{venderId}/timeline

UI requirements
Something that can
▪ Represent search results really well
▪ Integrates with Apache Solr on Hadoop

UI options
▪ Hue
▪ Banana
▪ Kibana

We choose Hue
▪ Because it’s included
▪ Please look at the others

SQL engine criteria
▪ Low latency SQL access
▪ Allows for high concurrency
▪ JDBC/ODBC integration
▪ Capable of large scale aggregation
▪ Optionally integrates with Kudu for real-time updates to SQL tables

Apache Hive
▪ Good JDBC integration
▪ Not really low latency, even when using Tez
▪ Doesn’t integrate with Kudu
▪ Can run with MapReduce, Spark, or Tez

Presto
▪ Low latency SQL engine from Facebook
▪ Provides JDBC/ODBC access
▪ Is only in-memory, large aggregations can lead to OOM errors

Apache Impala
▪ Low latency SQL access
▪ Excellent concurrency support
▪ Integrates with Kudu for real-time SQL

Apache Drill
▪ Similar in architecture to Impala

Spark SQL
▪ Builds on top of Spark
▪ JDBC/ODBC access only via Spark Thrift Server
- Doesn’t scale well with larger number of concurrent users
- Doesn’t fully provide secure access.

We choose
▪ Spark SQL
▪ Impala

Processing
Storage Access
Processing &
Ingestion
Engine
Nested
Tables
Indexed
Cube
Relational
Tables
Entity Time
Series Lookup
Batch
Processing
SQL
NRT Rest
NRT
Dashboard

Storage
Processing
Access
Batch
Processing
SQL
NRT REST
NRT
Dashboard

Access
Processing
Storage

Processing
Storage Access

High Level of the Demo Design
Producer
Kafka
Topic Foo
Partition 1
Partition 2
Partition 3
Spark
Streaming
Kudu
Spark
Streaming
HBase
Spark
Streaming
Solr
Spark
Streaming
HDFS
Kudu
HBase
Solr
HDFS
SQL
REST
REST
Hue
SQL

Other Sessions
▪ Ask Us Anything session (Mark and Jonathan) – Thursday, 11:50 AM
▪ Stream me up, Scotty: Transitioning to the cloud using a streaming data platform
(Gwen) – Wednesday, 2:40 PM
▪ One cluster does not fit all: Architecture patterns for multicluster Apache Kafka
deployments (Gwen) – Thursday, 2:40 PM
▪ Ask me Anything (Gwen) – Thursday, 4:20

Thank you!
@hadooparchbook
tiny.cloudera.com/app-arch-sanjose
Jonathan Seidman | @jseidman
Ted Malaska | @ted_malaska
Mark Grover | @mark_grover
Gwen Shapira | @gwenshap

Architecting a Next Generation Data Platform

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Architecting a Next Generation Data Platform

Similar to Architecting a Next Generation Data Platform (20)

More from hadooparchbook

More from hadooparchbook (9)

Recently uploaded

Recently uploaded (20)

Architecting a Next Generation Data Platform