Shannon Holgate: Bending non-splittable data to harness distributed performance

Semi-structured Data and Hadoop
Shannon Holgate

Semi-structured Data
... and Hadoop

Agenda
Motivation for Analysing XML
Solution design in Spark on Hadoop
Tuning the Solution for performance

What to take away
Thumbs up to XML processing in Hadoop
Spark can process XML quite easily
Tuning is absolutely essential

Extensive Schemas
SOAP
Bloat on the wire
XML is Incorrectly used

XML is also
Human Readable
Extremely Portable
Storable in Databases

aking XML perfect for communicating Da

XML is consistently the preferred
data transfer protocol we see in
Financial Institutions

Why should we care?
Financial Institutions have the data
and financial backing to use Big Data
technologies

Our first Big Data
engagement is with a Global
Insurance Provider

This customer would like to
process XML at Scale within
Budget

We want Hadoop, let's prove it can keep
up with our current applications
Words from the customer:

Create a solution which ingests, extracts
and exports XML on Hadoop

And make sure this solution has the
performance to replace Teradata

Hadoop provides the opportunity to pay
only for what you need

This means no Oracle guy knocking on the
maintenance door

And no absurd Teradata licensing fees

XML is worth analysing
Financial Institutions are ready for Big Data
Hadoop is a cost effective Big Data solution

d a way to process XML in a performant fashio

Specifications of the Hadoop Cluster
Expected Load
Data sources
Integration points
Transformation Logic

Cluster Capacity
Installed Services
YARN
Kerberos and Sentry
Specifications of the Hadoop Cluster

Development Test Production
Cluster Size
6 Nodes,128GB
12 cores
12 Nodes,128GB
12 cores
12 Nodes,128GB
12 cores
Installed
Services
Spark v1.3
Flume v1.5
Sqoop v1.4.5 ....
Spark v1.3
Flume v1.5
Sqoop v1.4.5 ....
Spark v1.3
Flume v1.5
Sqoop v1.4.5 ....
YARN Yes Yes Yes
Kerberos
Sentry
Both Both Both

Data Sources
Batch loads vs. Streaming
Apples vs. Oranges

Flume and Spark can be used for
Streaming XML messages

Batch Loads are better suited to a
scheduled Oozie job

Expected Load
Messages per Second
Message scheduling
Expected Message Size

Expected Load - Streaming Messages
Messages per Second 48MPS, 192MPS peak
Message Scheduling 24/7
Expected Message Size 15KB

Expected Load - Batch Messages
Messages per Second 15GB daily
Message Scheduling 22:00 daily
Expected Message Size 15KB

Transformation Logic
Capture User Stories for Extraction Criteria
Work with Product Owner to Create Data Map

Integration points
XML source location
Export destination
Audit endpoints

Integration points
XML source location JMS source
Export destination Exadata
Audit endpoints Exadata

The Pipeline
Deep Dive
The Tech Stack
Worked Example - Streaming Messages

The Tech Stack
Flume Ingestion
Spark Xml to Avro
Spark Avro to PSV
Sqoop Export of PSV

Flume Agent with JMS
source and HDFS sink

Spark Job every 10 minutes
Reads 10 minutes of
streamed XML
Converts to Avro Datafiles

Spark job running prior to
export
Converts Avro to PSV for
Sqoop

Sqoop export for each table
in Exadata
Reads PSV versions of the
Avro data

Data warehouse holding the
completed pipeline data

Spark on Scala
Access to Java Libraries
Scala is Functional by design

Reading XML in Spark
Keep things simple
Use the Xml Input Format from
Hadoop Streaming
Inputs are split from opening to
closing tag

Avro Support in Spark
Use Kryo Serialisation for correct
Avro support
Avro Serialisation and data format
Avro Serialisation and Parquet Data
Format

Design Extractors
In this case we want to turn one XML
message into 5 different Avros
5 Extraction Classes should be
created

Design Extractors - cont
Use DOM/SAX if you have no definite
XSD for the XML
DOM is acceptable as the data is
already in memory

Spark Processing
All extractions should occur within a
single Map
Map only job
Try not to cause any shuffles

Writing the Avro output
Use the AvroJob MapReduce output
format
Create Avro Datafiles

Take time to understand incoming XML
Design solution to fit the Hadoop Cluster
XML in Spark should be processed carefully

Tuning Flume
Build Flume Cluster and Load
balance
Source should read enough events
for 1 block
File channel vs. Memory channel

Tuning Spark - Executors
Vital Tuning point

Tuning Spark - Executors
Memory allocation
Number of cores
Number of Executors

Memory allocation
--executor-memory
Max out but leave some for the daemons

Number of cores
Spark can run 1 task per core
HDFS Client doesn't like concurrent threads
Limit to 5

Number of Executors
More Executors, less cores
Big nodes? 3-5 Executers per node
Adjust memory and cores to match

If the Cluster is YARN enabled it will limit
memory and cores

Solution?
Number of Executers = Number of
Cores/Number of Container Cores * Nodes

Or,
Number of Executers = Memory per
Node/Memory per Container * Nodes

Tuning Spark - Monitoring
Spark Context Web UI
Coda Hale Metrics

Tuning Sqoop Exports
Use Direct Connectors
Tweak Number of Mappers

Cluster Flume where possible
More Spark Executers > More cores
Sqoop mappers

XML is absolutely a worthwhile data
source to analyse

Focus on using Spark to Extract XML data
and move into Avro and Parquet

Tuning should revolve around Spark
allocated resources

Shannon Holgate
Senior Analytics Engineer @ Kainos
@sholgate13

Shannon Holgate: Bending non-splittable data to harness distributed performance

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (18)

Similar to Shannon Holgate: Bending non-splittable data to harness distributed performance

Similar to Shannon Holgate: Bending non-splittable data to harness distributed performance (20)

More from AnalyticsConf

More from AnalyticsConf (14)

Recently uploaded

Recently uploaded (20)

Shannon Holgate: Bending non-splittable data to harness distributed performance