3. Parquet
• OSS Created by Twitter and Cloudera, based on Google
Dremel
• Columnar File Format
• Limit I/O to only data that is needed
• Compresses very well - ADAM file are 5-25% smaller
than BAM file without loss of data
• 3 layers of parallelism: File/row group, Column chunk,
Page
6. Parquet/Spark integration
• 1 row group in Parquet maps
to 1 partition in spark
• We interact with Parquet via
input/output formats
• Spark builds and execute a
computation Directed Acyclic
Graph(DAG), manages data
locality, error/retries
6