Hadoop application architectures - using Customer 360 (more generally, Entity 360) as an example. By Ted Malaska, Jonathan Seidman and Mark Grover at Strata + Hadoop World 2016 in NYC.
Application of Residue Theorem to evaluate real integrations.pptx
Hadoop application architectures - using Customer 360 as an example
1. Hadoop Application
Architectures:
Architecting a Next
Generation Data
Platform
Strata + Hadoop World, New York 2016
tiny.cloudera.com/app-arch-ny
tiny.cloudera.com/app-arch-questions
Mark Grover | @mark_grover
Ted Malaska | @ted_malaska
Jonathan Seidman | @jseidman
4. Questions? tiny.cloudera.com/app-arch-
questions
About the presenters
▪ Principal Solutions Architect
at Cloudera
▪ Previously, lead architect at
FINRA
▪ Contributor to Apache
Hadoop, HBase, Flume, Avro,
Pig, Spark, YARN, Sqoop,
Kudu, Kafka
Ted Malaska
5. Questions? tiny.cloudera.com/app-arch-
questions
About the presenters
▪ Partner Software Engineer at
Cloudera
▪ Contributor to Apache Sqoop.
▪ Previously, Technical Lead on
the big data team at Orbitz,
co-founder of the Chicago
Hadoop User Group and
Chicago Big Data
Jonathan Seidman
6. Questions? tiny.cloudera.com/app-arch-
questions
About the presenters
▪ Software Engineer on Spark
at Cloudera
▪ Committer on Apache Bigtop,
PMC member on Apache
Sentry(incubating)
▪ Contributor to Apache Spark,
Hadoop, Hive, Sqoop, Pig,
Flume
Mark Grover
20. Questions? tiny.cloudera.com/app-arch-
questions
Requirements
▪ To support all this, we need:
- Reliable ingestion of streaming and batch data.
- Ability to perform transformations on streaming data in flight.
- Ability to perform sophisticated processing of historical data.
29. Questions? tiny.cloudera.com/app-arch-
questions
High level architecture
Source Transport Stream
Processing
Storage Access
Custom
Producer
or
Processing &
Ingestion
Engine
Nested
Tables
Indexed
Cube
Relational
Tables
Entity Time
Series
Lookup
Batch
Processing
SQL
NRT Rest
NRT
Dashboard
30. Questions? tiny.cloudera.com/app-arch-
questions
Buffering Data – Flume vs. Kafka
▪ Flume – well integrated with Hadoop.
- Great choice when ingesting data into HDFS.
- Can support simple transformations.
- Less coding.
▪ But…
- Interface between Kafka and the streaming layer is already well defined.
- Transformations are done in the stream processing layer.
- We need a more general purpose system at this layer.
50. Questions? tiny.cloudera.com/app-arch-
questions
Kafka Message Loss
▪Kafka
-Failure >= then acknowledgements
-Use a higher ack level
▪Producer
-If disconnected from Kafka and buffer overflows
-Consider a producer side buffer (e.g. Flume)
▪Source
-If data retention is not enough before Producer consumes messages
55. Questions? tiny.cloudera.com/app-arch-
questions
What do we mean by streaming?
Constant low
milliseconds & under
Low milliseconds to
seconds, delay in
case of failures
10s of seconds or
more, re-run in case
of failures
Real-time Near real-time Batch
56. Questions? tiny.cloudera.com/app-arch-
questions
What do we mean by streaming?
Constant low
milliseconds & under
Low milliseconds to
seconds, delay in
case of failures
10s of seconds or
more, re-run in case
of failures
Real-time Near real-time Batch
57. Questions? tiny.cloudera.com/app-arch-
questions
But, there’s no free lunch
Constant low
milliseconds & under
Low milliseconds to
seconds, delay in
case of failures
10s of seconds or
more, re-run in case
of failures
Real-time Near real-time Batch
“Difficult” architectures, lower
latency
“Easier” architectures, higher
latency
64. Questions? tiny.cloudera.com/app-arch-
questions
#1 – Simple Ingestion
1. Zero transformation
- No transformation, plain ingest
- Keep the original format – SequenceFile, Text, etc.
- Allows to store data that may have errors in the schema
2. Format transformation
- Simply change the format of the field
- To a structured format, say, Avro, for example
- Can do schema validation
3. Atomic transformation
- Mask a credit card number
66. Questions? tiny.cloudera.com/app-arch-
questions
Where to store the context?
1. Locally Broadcast Cached Dim Data
- Local to Process (On Heap, Off Heap)
- Local to Node (Off Process)
2. Partitioned Cache
- Shuffle to move new data to partitioned cache
3. External Fetch Data (e.g. HBase, Memcached)
82. Questions? tiny.cloudera.com/app-arch-
questions
Flink
▪ True “streaming” system, but not as feature rich as Spark
▪ Much better event time handling
▪ Good built-in backpressure support
▪ Allows stateful transformations
▪ Lower Latency
- No Micro Batching
- Asynchronous Barrier Snapshotting (ABS)
89. Questions? tiny.cloudera.com/app-arch-
questions
Flume
▪ Well integrated with the Hadoop ecosystem
▪ Allowed interceptors (for simple transformations)
▪ Supports buffering
- Memory
- File
- Kafka
▪ But no real fault-tolerance
▪ No state management
93. Questions? tiny.cloudera.com/app-arch-
questions
Spark Streaming
▪ We chose Spark Streaming because:
- Same execution engine for batch and streaming
- Similar code for batch and streaming
- Support for security, kafka integration
- Thriving community
- We don’t have low millisecond requirements
94. Questions? tiny.cloudera.com/app-arch-
questions
High level architecture
Source Transport Stream
Processing
Storage Access
Custom
Producer
or
Nested
Tables
Indexed
Cube
Relational
Tables
Entity Time
Series
Lookup
Batch
Processing
SQL
NRT REST
NRT
Dashboard
98. Questions? tiny.cloudera.com/app-arch-
questions
Structured Landing Zones
Hive Relational Model
Kudu/HDFS
Hive Nested Model
HDFS
Aggregations
Kudu
HBase Entity Time
Series
Solr
Traditional SQL
Optimized for nested Structures like JSON
Optimized Storing and mutating aggregates
Optimized Entity 360 and time base access
Optimized faceted charts and reverse index
look ups
100. Questions? tiny.cloudera.com/app-arch-
questions
Kudu Data Models
▪ Entity Summary Tables
- Quick update and access of aggregate of Entity Stats
▪ Event Tables
- Number of Partitioning strategies
- Partition by Entity
- Partition by Hash on time
104. Questions? tiny.cloudera.com/app-arch-
questions
View Strategies
Hive Relational Model
Hive Nested Model
Models
Hive Normal Views
Hive Materialized
Table Views
Use in the cases where the view
requires a join that is done through a
shuffle
Use only for tables that filter
records/columns or use for marking
fields
105. Questions? tiny.cloudera.com/app-arch-
questions
Nested
▪ Less Space than Denormalization
▪ Still have tables but the cost of joins is all but gone
▪ Also great for cartesian joins
- N x M vs N + M
▪ Not really supported yet with Kudu or HBase with SQL
117. Questions? tiny.cloudera.com/app-arch-
questions
Solr: Data Model
▪ Think of it like a cube on a object type
- In our case a taxi trip
- Allows for rollups and aggregations from object’s point of view
- Think of objects as immutable
- Try to find time based events
- May design more than one object type
124. Questions? tiny.cloudera.com/app-arch-
questions
High level architecture
Source Transport Stream
Processing
Storage Access
Custom
Producer
or
Nested
Tables
Indexed
Cube
Relational
Tables
Entity Time
Series
Lookup
Batch
Processing
SQL
NRT REST
NRT
Dashboard
125. Questions? tiny.cloudera.com/app-arch-
questions
Why have batch processing?
▪ When you need a larger context
- Say, to train a model
▪ Complex periodic job that does something
- Convert data to a nested structure for reduced number of shuffles
▪ In our use-case,
- Kudu -> HDFS Nested is batch processing
- KMeans calculation is also in bash
135. Questions? tiny.cloudera.com/app-arch-
questions
Why have REST server?
▪ Tired of business people telling us how to access data
▪ Serves as an interface between the data engineers and business folks
▪ Lets business folks decide access patterns
▪ Engineers to optimize those patterns
▪ Brownie points from your boss
▪ And, it’s not that difficult to write!
136. Questions? tiny.cloudera.com/app-arch-
questions
Don’t believe me?
import org.mortbay.jetty.Server
import org.mortbay.jetty.servlet.{Context, ServletHolder}
…
val server = new Server(port)
val sh = new ServletHolder(classOf[ServletContainer])
sh.setInitParameter("com.sun.jersey.config.property.resourceConfigClass",
"com.sun.jersey.api.core.PackagesResourceConfig")
sh.setInitParameter("com.sun.jersey.config.property.packages",
"com.hadooparchitecturebook.taxi360.server.hbase")
sh.setInitParameter("com.sun.jersey.api.json.POJOMappingFeature", "true”)
val context = new Context(server, "/", Context.SESSIONS)
context.addServlet(sh, "/*”)
server.start()
server.join()
144. Questions? tiny.cloudera.com/app-arch-
questions
SQL engine criteria
▪ Low latency SQL access
▪ Allows for high concurrency
▪ JDBC/ODBC integration
▪ Capable of large scale aggregation
▪ Optionally integrates with Kudu for real-time updates to SQL tables