Efficient	
  processing	
  of	
  large	
  and	
  
complex	
  XML	
  documents	
  in	
  Hadoop	
  
	
  
Sujoe	
  Bose	
  
Sen...
Presenta.on	
  Outline	
  
§  MoBvaBon	
  
§  ETL	
  vs.	
  ELT	
  
§  Avro	
  Format	
  
§  Mapping	
  from	
  XML	
 ...
You	
  will	
  learn	
  about	
  …	
  
§  A	
  method	
  to	
  store	
  and	
  process	
  complex	
  XML	
  data	
  in	
 ...
Mo.va.on	
  
§  Prevalence	
  of	
  XML	
  and	
  its	
  derivaBves	
  
–  Spurred	
  by	
  WebServices	
  and	
  SOA	
  ...
Challenges	
  
§  Parsing	
  XML	
  is	
  CPU	
  Intensive	
  
§  Certain	
  parsers/parsing	
  methods	
  result	
  in	...
ETL	
  vs.	
  ELT	
  
confidenBal	
   6	
  
§  Hadoop	
  generally	
  built	
  for	
  EL	
  –	
  T	
  
–  aka	
  Schema-­‐...
Mix	
  of	
  ETL	
  and	
  ELT	
  
§  Generally	
  beaer	
  in	
  
Flexibility	
  
§  More	
  suitable	
  for	
  simpler...
Approaches	
  
confidenBal	
   8	
  
XML	
  Files	
  
Avro	
  Files	
  
ETL	
  
Pre-­‐parsing	
  
Pig	
  
UDF	
  
Avro	
  
...
ELT	
  
confidenBal	
   9	
  confidenBal	
   9	
  
XML	
  Files	
  
Avro	
  Files	
  
ETL	
  
Pre-­‐parsing	
  
Pig	
  
UDF	...
ETL	
  
confidenBal	
   10	
  confidenBal	
   10	
  
XML	
  Files	
  
Avro	
  Files	
  
ETL	
  
Pre-­‐parsing	
  
Pig	
  
UD...
XML	
  Pre-­‐parsing	
  
§  Nested	
  Elements	
  and	
  Aaributes	
  
§  RepresentaBon	
  of	
  parsed	
  XML	
  Struct...
Avro	
  
§  Data	
  serializaBon	
  system	
  
§  Specifically	
  designed	
  for	
  Hadoop,	
  but	
  used	
  in	
  othe...
Avro	
  APIs	
  
§  Generic	
  Objects	
  and	
  Pre-­‐generated	
  Objects	
  
–  Easy	
  API	
  including	
  simple	
  ...
Use-­‐case	
  
§  FIXML	
  –	
  Financial	
  InformaBon	
  eXchange	
  
–  hap://www.fixprotocol.org/specificaBons/	
  
§ ...
FIXML	
  
§  XML	
  Data	
  Generator	
  
–  hap://tpox.sourceforge.net/tpoxdata.htm	
  
§  Order:	
  Buy	
  and	
  sell...
Simple	
  mapping	
  
confidenBal	
   16	
  
XML	
   Avro	
   Pig	
  
Elements	
  with	
  repeated	
  
nested	
  elements	
...
Avro	
  Schema	
  
{
"type": "record",
"name": "FIXOrder",
"namespace": "com.sabre.fixml",
"doc": "Definition and mapping ...
Pig	
  Schema	
  
FIXOrder: tuple (
v: chararray,
r: chararray,
s: chararray,
Order: tuple (
ID: chararray,
ID2: chararray...
Avro	
  –	
  Access	
  Methods	
  
§  Direct	
  support	
  for	
  access	
  from	
  Hive	
  (using	
  SerDe)	
  
	
  
CRE...
Test	
  Data	
  
§  Base	
  SecuriBes	
  Order	
  file	
  500,000	
  records	
  
§  Replicated	
  for	
  volume	
  
–  15...
Comparison	
  
confidenBal	
   21	
  
XML	
  Files	
  
Avro	
  Files	
  
ETL	
  
Pre-­‐parsing	
  
Pig	
  
UDF	
  
Avro	
  ...
File	
  sizes:	
  Orders	
  
§  Base	
  Data	
  
–  XML	
  file	
  size	
  as	
  is:	
  749,337,916	
  (750MB)	
  	
  
–  ...
Storage	
  Size	
  Comparison	
  
confidenBal	
   23	
  
Test	
  Environment	
  
§  18	
  Nodes	
  
§  Node	
  configuraBon:	
  
–  12	
  cores	
  per	
  node	
  
–  48GB	
  memo...
Sample	
  Query	
  
§  Security	
  Orders	
  per	
  Account	
  
order_records	
  =	
  LOAD	
  '$AVRO_INPUT'	
  using	
  A...
Run	
  Types	
  
§  Pre-­‐parsed	
  approach:	
  
–  XML	
  to	
  Avro	
  materializaBon:	
  xml-­‐to-­‐avro	
  
•  XML	
...
confidenBal	
   27	
  
Run	
  .me	
  in	
  Seconds	
  
Analysis	
  on	
  raw	
  XML:	
  
XML	
  to	
  Pig	
  
Pre-­‐parsing...
confidenBal	
   28	
  
CPU	
  Usage	
  Comparison	
  
Analysis	
  on	
  raw	
  XML:	
  
XML	
  to	
  Pig	
  
Pre-­‐parsing	...
confidenBal	
   29	
  confidenBal	
   29	
  
Memory	
  Usage	
  Comparison:	
  Total	
  Memused	
  (GB)	
  
Analysis	
  on	
...
Results	
  
§  Analysis	
  on	
  pre-­‐parsed	
  data	
  compared	
  raw	
  XML	
  
–  RunBme	
  reducBon	
  by	
  more	
...
Caveats	
  
§  Not	
  all	
  fields	
  were	
  extracted	
  from	
  the	
  XML	
  input	
  
(opBonal	
  elements)	
  
§  ...
Alterna.ves	
  
§  Formats	
  other	
  than	
  Avro	
  may	
  be	
  more	
  suitable	
  
§  Record	
  Columnar	
  format...
Mo.va.on	
  for	
  Columnar	
  Format	
  
§  Map	
  Reduce	
  capability	
  
§  Column	
  ProjecBons	
  reduce	
  I/O	
 ...
Summary	
  
§  Materialized	
  version	
  well-­‐suited	
  for	
  repeated	
  queries	
  
§  For	
  ad-­‐hoc/experimenta...
Ques.ons	
  &	
  Comments	
  
confidenBal	
   35	
  
Thanks	
  for	
  Listening	
  
	
  sujoe.bose@sabre.com	
  
	
  
Upcoming SlideShare
Loading in...5
×

Efficient processing of large and complex XML documents in Hadoop

19,916

Published on

Many systems capture XML data in Hadoop for analytical processing. When XML documents are large and have complex nested structures, processing such data repeatedly would be inefficient as parsing XML becomes CPU intensive, not to mention the inefficiency of storing XML in its native form. The problem is compounded in the Big Data space, when millions of such documents have to be processed and analyzed within a reasonable time. In this talk an efficient method is proposed by leveraging the Avro storage and communication format, which is flexible, compact and specifically built for Hadoop environments to model complex data structures. XML documents may be parsed and converted into Avro format on load, which can then be accessed via Hive using a SQL-like interface, Java MapReduce or Pig. A concrete use-case is provided that validates this approach along with variations of the same and their relative trade-offs.

Published in: Technology

Efficient processing of large and complex XML documents in Hadoop

  1. 1. Efficient  processing  of  large  and   complex  XML  documents  in  Hadoop     Sujoe  Bose   Senior  Principal,   Sabre  Holdings   June,  2013  
  2. 2. Presenta.on  Outline   §  MoBvaBon   §  ETL  vs.  ELT   §  Avro  Format   §  Mapping  from  XML  to  Avro   §  Interfaces  to  access  Avro   §  Performance  and  Storage  consideraBons   §  Other  types  of  storage/processing  formats   confidenBal   2  
  3. 3. You  will  learn  about  …   §  A  method  to  store  and  process  complex  XML  data  in   Hadoop  as  Avro  files   §  Interfaces  to  access  and  analyze  data  in  Avro  from   Hive,  Java  and  Pig   §  VariaBons  of  the  method  and  their  relaBve  trade-­‐offs   in  storage  and  processing   confidenBal   3  
  4. 4. Mo.va.on   §  Prevalence  of  XML  and  its  derivaBves   –  Spurred  by  WebServices  and  SOA   –  Preferred  communicaBon  format  unBl  newer  formats   entered   –  Data  and  logs  represented  in  XML   §  XML  –  metadata  combined  data     –  Flexibility  vs.  Complexity   §  Could  be  arbitrarily  nested  and  large   §  Volumes  of  documents  –  Big  Data   confidenBal   4  
  5. 5. Challenges   §  Parsing  XML  is  CPU  Intensive   §  Certain  parsers/parsing  methods  result  in  more   memory  consumpBon   §  Repeated  parsing  for  each  query   §  Large  and  deeply  nested  XMLs  makes  problem  worse   §  Presence  of  tags  in  data  result  in  high  I/O  due  to   storage  size   §  Special  handling  of  opBonal  fields   confidenBal   5  
  6. 6. ETL  vs.  ELT   confidenBal   6   §  Hadoop  generally  built  for  EL  –  T   –  aka  Schema-­‐on-­‐Read   –  Load  as-­‐is   –  Transform  on  Access/Query   §  Compare  with  Data  Warehouse  ETL   –  Aka  Schema-­‐on-­‐Write   –  Transform  and  Load   –  Queries  are  lot  simpler   –  TransformaBon  and  cleansing  done  a  priori  
  7. 7. Mix  of  ETL  and  ELT   §  Generally  beaer  in   Flexibility   §  More  suitable  for  simpler   and  well-­‐defined  formats   §  More  applicable  for   experimentaBon   §  XML  data  parsed  on   demand  for  every  query   confidenBal   7   §  Generally  beaer  in   Performance   §  More  suitable  when   substanBal  cleansing  and   reformacng  is  needed   §  RepeBBve  queries  and   producBon  workloads   §  XML  Data  pre-­‐parsed  to   minimize  resource  usage   ELT   ETL  
  8. 8. Approaches   confidenBal   8   XML  Files   Avro  Files   ETL   Pre-­‐parsing   Pig   UDF   Avro   Schema   On-­‐demand   Parsing   Interfaces  Processing  Data   Hive   SerDe   MapReduce   Pig   UDF   Hive   SerDe   MapReduce  
  9. 9. ELT   confidenBal   9  confidenBal   9   XML  Files   Avro  Files   ETL   Pre-­‐parsing   Pig   UDF   Avro   Schema   On-­‐demand   Parsing   Interfaces  Processing  Data   Hive   SerDe   MapReduce   Pig   UDF   Hive   SerDe   MapReduce  
  10. 10. ETL   confidenBal   10  confidenBal   10   XML  Files   Avro  Files   ETL   Pre-­‐parsing   Pig   UDF   Avro   Schema   On-­‐demand   Parsing   Interfaces  Processing  Data   Hive   SerDe   MapReduce   Pig   UDF   Hive   SerDe   MapReduce  
  11. 11. XML  Pre-­‐parsing   §  Nested  Elements  and  Aaributes   §  RepresentaBon  of  parsed  XML  Structure   §  Enter  Avro!   confidenBal   11  
  12. 12. Avro   §  Data  serializaBon  system   §  Specifically  designed  for  Hadoop,  but  used  in  other   environments  also   §  Rich  data  structures:  Arrays,  Records,  Maps  etc.   §  Compact,  fast,  binary  data  format   §  Metadata  stored  at  file  level  –  not  record  level   §   Split-­‐able  –  Ideal  for  Map-­‐Reduce   confidenBal   12  
  13. 13. Avro  APIs   §  Generic  Objects  and  Pre-­‐generated  Objects   –  Easy  API  including  simple  gets  and  puts   §  APIs  in  several  languages   –  Java   –  C#   –  C/C++   –  Python   –  Ruby   confidenBal   13  
  14. 14. Use-­‐case   §  FIXML  –  Financial  InformaBon  eXchange   –  hap://www.fixprotocol.org/specificaBons/   §  XML  Database  Benchmark   –  hap://tpox.sourceforge.net/   §  Provides  sample  data  for  benchmarking   §  Data  Generator  for  generaBng  large  and  predictable   datasets   confidenBal   14  
  15. 15. FIXML   §  XML  Data  Generator   –  hap://tpox.sourceforge.net/tpoxdata.htm   §  Order:  Buy  and  sell  order  of  securiBes   confidenBal   15  
  16. 16. Simple  mapping   confidenBal   16   XML   Avro   Pig   Elements  with  repeated   nested  elements   Array   Bag   Elements  with  aaributes  and   text  elements   Record   Tuple   Aaributes  and  Text  Elements   Field   Field  
  17. 17. Avro  Schema   { "type": "record", "name": "FIXOrder", "namespace": "com.sabre.fixml", "doc": "Definition and mapping for FIX Orders", "mapping": "/FIXML", "fields": [ { "name":"v", "type":"string", "mapping":"@v"}, { "name":"r", "type":"string", "mapping":"@r"}, { "name":"s", "type":"string", "mapping":"@s"}, { "name":"Order", "mapping":"Order", "type": { "name":"OrderRecord", "mapping":"Order", "type": "record", "fields": [ { "name":"ID", "type":"string", "mapping":"@ID"}, { "name":"ID2", "type":"string", "mapping":"@ID2"}, { "name":"OrignDt", "type":"string", "mapping":"@OrignDt"}, { "name":"TrdDt", "type":"string", "mapping":"@TrdDt"}, { "name":"Acct", "type":"string", "mapping":"@Acct"}, { "name":"AcctTyp", "type":"string", "mapping":"@AcctTyp"}, { "name":"DayBkngInst", "type":"string", "mapping":"@DayBkngInst"}, { "name":"BkngUnit", "type":"string", "mapping":"@BkngUnit"}, { "name":"PreallocMeth", "type":"string", "mapping":"@PreallocMeth"}, { "name":"AllocID", "type":"string", "mapping":"@AllocID"}, { "name":"CshMgn", "type":"string", "mapping":"@CshMgn"}, { "name":"ClrFeeInd", "type":"string", "mapping":"@ClrFeeInd"}, ...   confidenBal   17  
  18. 18. Pig  Schema   FIXOrder: tuple ( v: chararray, r: chararray, s: chararray, Order: tuple ( ID: chararray, ID2: chararray, OrignDt: chararray, TrdDt: chararray, Acct: chararray, AcctTyp: chararray, DayBkngInst: chararray, BkngUnit: chararray, PreallocMeth: chararray, AllocID: chararray, CshMgn: chararray, ClrFeeInd: chararray, confidenBal   18  
  19. 19. Avro  –  Access  Methods   §  Direct  support  for  access  from  Hive  (using  SerDe)     CREATE EXTERNAL TABLE <TableName>! ROW FORMAT SERDE ‘org.apache.hadoop.hive.serde2.avro.AvroSerDe’! STORED as INPUTFORMAT ‘org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat’! OUTPUTFORMAT! ‘org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat’! LOCATION ‘location-of-avro-files’! TBLPROPERTIES ('avro.schema.url'=‘location-of-schema- file.avsc')   §  Access  via  Pig  -­‐  AvroStorage   §  Avro  API  -­‐  Java  MapReduce   confidenBal   19  
  20. 20. Test  Data   §  Base  SecuriBes  Order  file  500,000  records   §  Replicated  for  volume   –  15x  -­‐  7.5  million  records   –  30x  -­‐  15  million  records   –  45x  -­‐  22.5  million  records   –  60x  –  30  million  records   –  75x  –  37.5  million  records     confidenBal   20  
  21. 21. Comparison   confidenBal   21   XML  Files   Avro  Files   ETL   Pre-­‐parsing   Pig   UDF   Avro   Schema   On-­‐demand   Parsing   Interfaces  Processing  Data   Hive   SerDe   MapReduce   Pig   UDF   Hive   SerDe   MapReduce  
  22. 22. File  sizes:  Orders   §  Base  Data   –  XML  file  size  as  is:  749,337,916  (750MB)     –  Gzip  Compressed:  182,687,654  (183MB)     §  Applied  Avro  conversion   –  Avro  Snappy:  151,647,926  (152MB)     –  Avro  Gzip:  107,898,177  (108MB)     confidenBal   22  
  23. 23. Storage  Size  Comparison   confidenBal   23  
  24. 24. Test  Environment   §  18  Nodes   §  Node  configuraBon:   –  12  cores  per  node   –  48GB  memory   –   36  TB  with  12  disks  of  3TB  each   §  CDH  4.1.2   confidenBal   24  
  25. 25. Sample  Query   §  Security  Orders  per  Account   order_records  =  LOAD  '$AVRO_INPUT'  using  AVRO_LOAD  AS  (   -­‐-­‐-­‐-­‐-­‐-­‐-­‐  Pig  Schema  goes  here  -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐   );     order_projecBon  =  FOREACH  order_records  GENERATE  Order.Acct  as  Account,  Order.OrdQty.Qty   as  QuanBty;     order_group  =  GROUP  order_projecBon  BY  Account;     order_count  =  FOREACH  order_group  GENERATE  group,  SUM(order_projecBon.QuanBty);     STORE  order_count  INTO  '$PIG_OUTPUT'  Using  PigStorage(',');   confidenBal   25  
  26. 26. Run  Types   §  Pre-­‐parsed  approach:   –  XML  to  Avro  materializaBon:  xml-­‐to-­‐avro   •  XML  to  Avro  is  run  only  once  on  the  data   –  Avro  to  Pig  via  UDF:  avro-­‐to-­‐pig   §  Parse  on  demand   –  XML  parsing  using  Pig  UDF:  xml-­‐to-­‐pig   confidenBal   26  
  27. 27. confidenBal   27   Run  .me  in  Seconds   Analysis  on  raw  XML:   XML  to  Pig   Pre-­‐parsing  XML:   XML  to  Avro   Analysis  on  parsed  XML:   Avro  to  Pig  
  28. 28. confidenBal   28   CPU  Usage  Comparison   Analysis  on  raw  XML:   XML  to  Pig   Pre-­‐parsing  XML:   XML  to  Avro   Analysis  on  parsed  XML:   Avro  to  Pig  
  29. 29. confidenBal   29  confidenBal   29   Memory  Usage  Comparison:  Total  Memused  (GB)   Analysis  on  raw  XML:   XML  to  Pig   Pre-­‐parsing  XML:   XML  to  Avro   Analysis  on  parsed  XML:   Avro  to  Pig  
  30. 30. Results   §  Analysis  on  pre-­‐parsed  data  compared  raw  XML   –  RunBme  reducBon  by  more  than  50%   –  Memory  and  CPU  consumpBon  reduced  by  about  50%   §  Pre-­‐parsing  stage  takes  more  resources  and  Bme   than  on-­‐demand  parsing   §  RepeBBve  queries  will  benefit  from  one-­‐Bme  pre-­‐ parsing   confidenBal   30  
  31. 31. Caveats   §  Not  all  fields  were  extracted  from  the  XML  input   (opBonal  elements)   §  Challenge  in  keeping-­‐up  with  versions/changes  of   XML   §  Performance  numbers  can  depend  on  the  type  of   data  and  the  mapping  used   confidenBal   31  
  32. 32. Alterna.ves   §  Formats  other  than  Avro  may  be  more  suitable   §  Record  Columnar  formats  (RC  Files  &  ORC  Files)   §  Trevni:  a  column  file  format  supporBng  Avro   §  Parquet:  another  columnar  storage  for  Hadoop   confidenBal   32  
  33. 33. Mo.va.on  for  Columnar  Format   §  Map  Reduce  capability   §  Column  ProjecBons  reduce  I/O   §  Column  Compression  due  to  similarity  of  data   further  reduces  I/O   confidenBal   33  
  34. 34. Summary   §  Materialized  version  well-­‐suited  for  repeated  queries   §  For  ad-­‐hoc/experimental  queries  parse-­‐on-­‐demand   is  beaer   §  Mapping  from  XML  to  Avro  can  be  automated   §  Hive,  Pig  and  MapReduce  Interfaces  to  access  Avro   Files   §  RelaBve  trade-­‐offs  between  flexibility  and   performance/storage   confidenBal   34  
  35. 35. Ques.ons  &  Comments   confidenBal   35   Thanks  for  Listening    sujoe.bose@sabre.com    
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×