Efficient processing of large and complex XML documents in Hadoop

Eﬃcient
processing
of
large
and

complex
XML
documents
in
Hadoop

Sujoe
Bose

Senior
Principal,

Sabre
Holdings

June,
2013

Presenta.on
Outline

§  MoBvaBon

§  ETL
vs.
ELT

§  Avro
Format

§  Mapping
from
XML
to
Avro

§  Interfaces
to
access
Avro

§  Performance
and
Storage
consideraBons

§  Other
types
of
storage/processing
formats

conﬁdenBal
2

You
will
learn
about
…

§  A
method
to
store
and
process
complex
XML
data
in

Hadoop
as
Avro
files

§  Interfaces
to
access
and
analyze
data
in
Avro
from

Hive,
Java
and
Pig

§  VariaBons
of
the
method
and
their
relaBve
trade-‐offs

in
storage
and
processing

confidenBal
3

Mo.va.on

§  Prevalence
of
XML
and
its
derivaBves

–  Spurred
by
WebServices
and
SOA

–  Preferred
communicaBon
format
unBl
newer
formats

entered

–  Data
and
logs
represented
in
XML

§  XML
–
metadata
combined
data

–  Flexibility
vs.
Complexity

§  Could
be
arbitrarily
nested
and
large

§  Volumes
of
documents
–
Big
Data

conﬁdenBal
4

Challenges

§  Parsing
XML
is
CPU
Intensive

§  Certain
parsers/parsing
methods
result
in
more

memory
consumpBon

§  Repeated
parsing
for
each
query

§  Large
and
deeply
nested
XMLs
makes
problem
worse

§  Presence
of
tags
in
data
result
in
high
I/O
due
to

storage
size

§  Special
handling
of
opBonal
ﬁelds

conﬁdenBal
5

ETL
vs.
ELT

conﬁdenBal
6

§  Hadoop
generally
built
for
EL
–
T

–  aka
Schema-‐on-‐Read

–  Load
as-‐is

–  Transform
on
Access/Query

§  Compare
with
Data
Warehouse
ETL

–  Aka
Schema-‐on-‐Write

–  Transform
and
Load

–  Queries
are
lot
simpler

–  TransformaBon
and
cleansing
done
a
priori

Mix
of
ETL
and
ELT

§  Generally
beaer
in

Flexibility

§  More
suitable
for
simpler

and
well-‐deﬁned
formats

§  More
applicable
for

experimentaBon

§  XML
data
parsed
on

demand
for
every
query

conﬁdenBal
7

§  Generally
beaer
in

Performance

§  More
suitable
when

substanBal
cleansing
and

reformacng
is
needed

§  RepeBBve
queries
and

producBon
workloads

§  XML
Data
pre-‐parsed
to

minimize
resource
usage

ELT
ETL

Approaches

conﬁdenBal
8

XML
Files

Avro
Files

ETL

Pre-‐parsing

Pig

UDF

Avro

Schema

On-‐demand

Parsing

Interfaces
Processing
Data

Hive

SerDe

MapReduce
Pig

UDF

Hive

SerDe

MapReduce

ELT

conﬁdenBal
9
conﬁdenBal
9

XML
Files

Avro
Files

ETL

Pre-‐parsing

Pig

UDF

Avro

Schema

On-‐demand

Parsing

Interfaces
Processing
Data

Hive

SerDe

MapReduce
Pig

UDF

Hive

SerDe

MapReduce

ETL

conﬁdenBal
10
conﬁdenBal
10

XML
Files

Avro
Files

ETL

Pre-‐parsing

Pig

UDF

Avro

Schema

On-‐demand

Parsing

Interfaces
Processing
Data

Hive

SerDe

MapReduce
Pig

UDF

Hive

SerDe

MapReduce

XML
Pre-‐parsing

§  Nested
Elements
and
Aaributes

§  RepresentaBon
of
parsed
XML
Structure

§  Enter
Avro!

conﬁdenBal
11

Avro

§  Data
serializaBon
system

§  Specifically
designed
for
Hadoop,
but
used
in
other

environments
also

§  Rich
data
structures:
Arrays,
Records,
Maps
etc.

§  Compact,
fast,
binary
data
format

§  Metadata
stored
at
file
level
–
not
record
level

§ 
Split-‐able
–
Ideal
for
Map-‐Reduce

confidenBal
12

Avro
APIs

§  Generic
Objects
and
Pre-‐generated
Objects

–  Easy
API
including
simple
gets
and
puts

§  APIs
in
several
languages

–  Java

–  C#

–  C/C++

–  Python

–  Ruby

conﬁdenBal
13

Use-‐case

§  FIXML
–
Financial
InformaBon
eXchange

–  hap://www.fixprotocol.org/specificaBons/

§  XML
Database
Benchmark

–  hap://tpox.sourceforge.net/

§  Provides
sample
data
for
benchmarking

§  Data
Generator
for
generaBng
large
and
predictable

datasets

confidenBal
14

FIXML

§  XML
Data
Generator

–  hap://tpox.sourceforge.net/tpoxdata.htm

§  Order:
Buy
and
sell
order
of
securiBes

conﬁdenBal
15

Simple
mapping

conﬁdenBal
16

XML
Avro
Pig

Elements
with
repeated

nested
elements

Array
Bag

Elements
with
aaributes
and

text
elements

Record
Tuple

Aaributes
and
Text
Elements
Field
Field

Avro
Schema

{
"type": "record",
"name": "FIXOrder",
"namespace": "com.sabre.fixml",
"doc": "Definition and mapping for FIX Orders",
"mapping": "/FIXML",
"fields":
[
{ "name":"v", "type":"string", "mapping":"@v"},
{ "name":"r", "type":"string", "mapping":"@r"},
{ "name":"s", "type":"string", "mapping":"@s"},
{ "name":"Order", "mapping":"Order", "type":
{
"name":"OrderRecord", "mapping":"Order", "type": "record", "fields":
[
{ "name":"ID", "type":"string", "mapping":"@ID"},
{ "name":"ID2", "type":"string", "mapping":"@ID2"},
{ "name":"OrignDt", "type":"string", "mapping":"@OrignDt"},
{ "name":"TrdDt", "type":"string", "mapping":"@TrdDt"},
{ "name":"Acct", "type":"string", "mapping":"@Acct"},
{ "name":"AcctTyp", "type":"string", "mapping":"@AcctTyp"},
{ "name":"DayBkngInst", "type":"string", "mapping":"@DayBkngInst"},
{ "name":"BkngUnit", "type":"string", "mapping":"@BkngUnit"},
{ "name":"PreallocMeth", "type":"string", "mapping":"@PreallocMeth"},
{ "name":"AllocID", "type":"string", "mapping":"@AllocID"},
{ "name":"CshMgn", "type":"string", "mapping":"@CshMgn"},
{ "name":"ClrFeeInd", "type":"string", "mapping":"@ClrFeeInd"},
...

conﬁdenBal
17

Pig
Schema

FIXOrder: tuple (
v: chararray,
r: chararray,
s: chararray,
Order: tuple (
ID: chararray,
ID2: chararray,
OrignDt: chararray,
TrdDt: chararray,
Acct: chararray,
AcctTyp: chararray,
DayBkngInst: chararray,
BkngUnit: chararray,
PreallocMeth: chararray,
AllocID: chararray,
CshMgn: chararray,
ClrFeeInd: chararray,
conﬁdenBal
18

Avro
–
Access
Methods

§  Direct
support
for
access
from
Hive
(using
SerDe)

CREATE EXTERNAL TABLE <TableName>!
ROW FORMAT SERDE
‘org.apache.hadoop.hive.serde2.avro.AvroSerDe’!
STORED as INPUTFORMAT
‘org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat’!
OUTPUTFORMAT!
‘org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat’!
LOCATION ‘location-of-avro-files’!
TBLPROPERTIES ('avro.schema.url'=‘location-of-schema-
file.avsc')

§  Access
via
Pig
-‐
AvroStorage

§  Avro
API
-‐
Java
MapReduce

conﬁdenBal
19

Test
Data

§  Base
SecuriBes
Order
ﬁle
500,000
records

§  Replicated
for
volume

–  15x
-‐
7.5
million
records

–  30x
-‐
15
million
records

–  45x
-‐
22.5
million
records

–  60x
–
30
million
records

–  75x
–
37.5
million
records

conﬁdenBal
20

Comparison

conﬁdenBal
21

XML
Files

Avro
Files

ETL

Pre-‐parsing

Pig

UDF

Avro

Schema

On-‐demand

Parsing

Interfaces
Processing
Data

Hive

SerDe

MapReduce
Pig

UDF

Hive

SerDe

MapReduce

File
sizes:
Orders

§  Base
Data

–  XML
ﬁle
size
as
is:
749,337,916
(750MB)

–  Gzip
Compressed:
182,687,654
(183MB)

§  Applied
Avro
conversion

–  Avro
Snappy:
151,647,926
(152MB)

–  Avro
Gzip:
107,898,177
(108MB)

conﬁdenBal
22

Storage
Size
Comparison

conﬁdenBal
23

Test
Environment

§  18
Nodes

§  Node
conﬁguraBon:

–  12
cores
per
node

–  48GB
memory

– 
36
TB
with
12
disks
of
3TB
each

§  CDH
4.1.2

conﬁdenBal
24

Sample
Query

§  Security
Orders
per
Account

order_records
=
LOAD
'$AVRO_INPUT'
using
AVRO_LOAD
AS
(

-‐-‐-‐-‐-‐-‐-‐
Pig
Schema
goes
here
-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐

);

order_projecBon
=
FOREACH
order_records
GENERATE
Order.Acct
as
Account,
Order.OrdQty.Qty

as
QuanBty;

order_group
=
GROUP
order_projecBon
BY
Account;

order_count
=
FOREACH
order_group
GENERATE
group,
SUM(order_projecBon.QuanBty);

STORE
order_count
INTO
'$PIG_OUTPUT'
Using
PigStorage(',');

conﬁdenBal
25

Run
Types

§  Pre-‐parsed
approach:

–  XML
to
Avro
materializaBon:
xml-‐to-‐avro

•  XML
to
Avro
is
run
only
once
on
the
data

–  Avro
to
Pig
via
UDF:
avro-‐to-‐pig

§  Parse
on
demand

–  XML
parsing
using
Pig
UDF:
xml-‐to-‐pig

conﬁdenBal
26

conﬁdenBal
27

Run
.me
in
Seconds

Analysis
on
raw
XML:

XML
to
Pig

Pre-‐parsing
XML:

XML
to
Avro

Analysis
on
parsed
XML:

Avro
to
Pig

conﬁdenBal
28

CPU
Usage
Comparison

Analysis
on
raw
XML:

XML
to
Pig

Pre-‐parsing
XML:

XML
to
Avro

Analysis
on
parsed
XML:

Avro
to
Pig

conﬁdenBal
29
conﬁdenBal
29

Memory
Usage
Comparison:
Total
Memused
(GB)

Analysis
on
raw
XML:

XML
to
Pig

Pre-‐parsing
XML:

XML
to
Avro

Analysis
on
parsed
XML:

Avro
to
Pig

Results

§  Analysis
on
pre-‐parsed
data
compared
raw
XML

–  RunBme
reducBon
by
more
than
50%

–  Memory
and
CPU
consumpBon
reduced
by
about
50%

§  Pre-‐parsing
stage
takes
more
resources
and
Bme

than
on-‐demand
parsing

§  RepeBBve
queries
will
beneﬁt
from
one-‐Bme
pre-‐
parsing

conﬁdenBal
30

Caveats

§  Not
all
ﬁelds
were
extracted
from
the
XML
input

(opBonal
elements)

§  Challenge
in
keeping-‐up
with
versions/changes
of

XML

§  Performance
numbers
can
depend
on
the
type
of

data
and
the
mapping
used

conﬁdenBal
31

Alterna.ves

§  Formats
other
than
Avro
may
be
more
suitable

§  Record
Columnar
formats
(RC
Files
&
ORC
Files)

§  Trevni:
a
column
ﬁle
format
supporBng
Avro

§  Parquet:
another
columnar
storage
for
Hadoop

conﬁdenBal
32

Mo.va.on
for
Columnar
Format

§  Map
Reduce
capability

§  Column
ProjecBons
reduce
I/O

§  Column
Compression
due
to
similarity
of
data

further
reduces
I/O

conﬁdenBal
33

Summary

§  Materialized
version
well-‐suited
for
repeated
queries

§  For
ad-‐hoc/experimental
queries
parse-‐on-‐demand

is
beaer

§  Mapping
from
XML
to
Avro
can
be
automated

§  Hive,
Pig
and
MapReduce
Interfaces
to
access
Avro

Files

§  RelaBve
trade-‐offs
between
flexibility
and

performance/storage

confidenBal
34

Ques.ons
&
Comments

conﬁdenBal
35

Thanks
for
Listening

sujoe.bose@sabre.com

Efficient processing of large and complex XML documents in Hadoop

More Related Content

What's hot

Viewers also liked

Similar to Efficient processing of large and complex XML documents in Hadoop

More from DataWorks Summit

Recently uploaded

In this document

Efficient processing of large and complex XML documents in Hadoop