Performant XML processing in a distributed environment is a major challenge. We will dive into a financially viable pipeline to extract, load and transform XML on a platform of performance and flexibility. Hadoop, Spark and Impala will feature in this session about bringing commodity applications to a market of proprietary solutions.
What can we achieve by processing non-splittable data in a distributed fashion? I will talk about the motivation behind our research and show how we evolved a solution to cope with an ever changing environment. Stepping into the solution, I will show how you can strip away the restrictions of XML and load it onto Hadoop ready for analysis at scale in both an adhoc and modelling fashion.
Evaluating and evolving this solution is paramount. Load and throughput testing methods are highlighted along with guidance on tuning both the pipeline and the Hadoop platform to ensure your solution is optimised for a dynamic environment.
61. Reading XML in Spark
Keep things simple
Use the Xml Input Format from
Hadoop Streaming
Inputs are split from opening to
closing tag
62.
63. Avro Support in Spark
Use Kryo Serialisation for correct
Avro support
Avro Serialisation and data format
Avro Serialisation and Parquet Data
Format
64. Design Extractors
In this case we want to turn one XML
message into 5 different Avros
5 Extraction Classes should be
created
65. Design Extractors - cont
Use DOM/SAX if you have no definite
XSD for the XML
DOM is acceptable as the data is
already in memory