1
Parquet data format &
Impala overview
2
Agenda
• Objective
• Various data formats
• Use case
• Parquet
• Impala
3
Objective
• 2 fold:
• Quest for a more performant data format than
Avro for nested data
• Understand and test new data formats in general
4
Hadoop data formats
• Sequence file. It stores key-value pairs of data in
a flat binary file. Rows stored as values.
• ORC. Stores column oriented data. Added RLE
and Dictionary encoding, and statistics, single file
output. Will add Bloom filter.
• Avro. Data serialization framework: serialization
format & exchange service, for any language. Data
accompanied by schema (in JSON). Supports
schema evolution.
5
Parquet
• Columnar storage
• Automatic dictionary encoding and run-length
encoding. Separation of encoding vs compression.
• Run-length encoding: replaces sequences ("runs")
of consecutive repeated characters (or other units
of data) with a single character and the length of
the run.
• Dictionary encoding takes the different values
present in a column, and represents each one in
compact 2-byte form
6
Parquet
• Parquet can handle multiple schemas. Support
schema evolution.
• LogType A : organizationId, userId, timestamp,
recordId, cpuTime
• LogType V : userId, organizationId, timestamp,
foo, bar
• Can be used by any project in the Hadoop
ecosystem. Integrations provided for M/R, Pig,
Hive, Cascading and Impala.
7
Parquet
• SELECT vs INSERT.
• Parquet tables require relatively little memory to
query, because a query reads and decompresses data
in 8MB chunks.
• Inserting into a Parquet table is a more memory-
intensive operation because the data for each data file
(with a maximum size of 1GB) is stored in memory
until encoded, compressed, and written to disk.
8
Parquet
• Memory issues (Heap space error) resolved by:
• Reducing the parquet.block.size.The block size is the
size of a row group being buffered in memory and its
default value is 256 MB.
• The total memory allocated was around 1 GB.
• Using multiple Hive partitions -> multiple buffers were
getting created (one for writing into each partition ) .
• So writing data using parquet will always have a high
memory requirement .
• Hive’s Distribute by: was workaround to memory issues!
9
Parquet vs other formats
Performance test with 100G data over multiple queries
Parquet wins
10
Impala overview
• MPP implementation of a query engine
• Impala vs Hive: SQL queries for interactive
exploratory analytics on large data sets. Vs Hive,
runs as batch.
• Not using M/R – but uses HDFS
• Not CEP – closer to a RDBMS.
• Impala uses the same metadata store as Hive to
record information about table structure and
properties
11
Impala overview
• Can create a table in Hive, and use it in Impala
• E.g. Impala doesn’t support Avro, but Hive does
• Language is mix between SQL & HiveQL
• Requires a lot of memory (128 G min./node)
• Initial load of data via Refresh; can take a lot of time
• loads the block location data for newly added data
files
12
Impala overview
• Shortcomings
• Impala doesn’t support nested types at this point
(version 1.2.3) as long as it contains only Impala-
compatible data types – it cannot contain nested types
such as array, map, or struct.
• Impala currently does not "spill to disk"
• if intermediate results being processed on a node
exceed the memory reserved for Impala on that
node.
• No Custom Serializer/Deserializer classes (SerDes)
• Impala cancels a running query if any host on which that
query is executing fails
13
Impala overview
• Example. For create a PARQUET table in IMPALA there
are 3 ways:
• -> PARQUET table created in HIVE (with no nested
data types).
• -> Create and load with data a normal text table in
IMPALA:
• IMPALA> create table parquet_table_name LIKE
text_table_name STORED AS PARQUET LOCATION
/user/hdfs/..’;
• Create Parquet format table and then insert into parquet
table using normal text table.
• IMPALA> insert overwrite table parquet_table_name
select * from text_table_name;
14
Use Case
• Can't query Avro table in Impala because having
nested columns.
• Avro table created through Hive, we can use it in
Impala as long as it contains only Impala-compatible
data types.
• (cannot contain nested types such as array, map,
orstruct).
15
Use Case
• How to deal with nested XML data in Hadoop?
• There is no direct mapping from xml to avro. Process goes:
• Parse XML and Convert to Avro : Parse XML using XMLStreamReader and
• Perform JAXB unmarshalling and Create Avro Records from JAXB objects.Need to write
a java class for this.Tried using Parquet/Avro:
• Tested: Process Xml – first convert into Avro and then store into Parquet format using
parquet-avro apis.
• The problem is the Schema provided has some arrays which is union of type string and
null both.
• Currently this AvroSchemaConverter is not able to handle such avro schema and it gives
exception.
• Tested: Impala 1.2.3 on CDH 4.5
• Impala doesn’t support nested types at this point
16
Thank you

Parquet and impala overview external

  • 1.
    1 Parquet data format& Impala overview
  • 2.
    2 Agenda • Objective • Variousdata formats • Use case • Parquet • Impala
  • 3.
    3 Objective • 2 fold: •Quest for a more performant data format than Avro for nested data • Understand and test new data formats in general
  • 4.
    4 Hadoop data formats •Sequence file. It stores key-value pairs of data in a flat binary file. Rows stored as values. • ORC. Stores column oriented data. Added RLE and Dictionary encoding, and statistics, single file output. Will add Bloom filter. • Avro. Data serialization framework: serialization format & exchange service, for any language. Data accompanied by schema (in JSON). Supports schema evolution.
  • 5.
    5 Parquet • Columnar storage •Automatic dictionary encoding and run-length encoding. Separation of encoding vs compression. • Run-length encoding: replaces sequences ("runs") of consecutive repeated characters (or other units of data) with a single character and the length of the run. • Dictionary encoding takes the different values present in a column, and represents each one in compact 2-byte form
  • 6.
    6 Parquet • Parquet canhandle multiple schemas. Support schema evolution. • LogType A : organizationId, userId, timestamp, recordId, cpuTime • LogType V : userId, organizationId, timestamp, foo, bar • Can be used by any project in the Hadoop ecosystem. Integrations provided for M/R, Pig, Hive, Cascading and Impala.
  • 7.
    7 Parquet • SELECT vsINSERT. • Parquet tables require relatively little memory to query, because a query reads and decompresses data in 8MB chunks. • Inserting into a Parquet table is a more memory- intensive operation because the data for each data file (with a maximum size of 1GB) is stored in memory until encoded, compressed, and written to disk.
  • 8.
    8 Parquet • Memory issues(Heap space error) resolved by: • Reducing the parquet.block.size.The block size is the size of a row group being buffered in memory and its default value is 256 MB. • The total memory allocated was around 1 GB. • Using multiple Hive partitions -> multiple buffers were getting created (one for writing into each partition ) . • So writing data using parquet will always have a high memory requirement . • Hive’s Distribute by: was workaround to memory issues!
  • 9.
    9 Parquet vs otherformats Performance test with 100G data over multiple queries Parquet wins
  • 10.
    10 Impala overview • MPPimplementation of a query engine • Impala vs Hive: SQL queries for interactive exploratory analytics on large data sets. Vs Hive, runs as batch. • Not using M/R – but uses HDFS • Not CEP – closer to a RDBMS. • Impala uses the same metadata store as Hive to record information about table structure and properties
  • 11.
    11 Impala overview • Cancreate a table in Hive, and use it in Impala • E.g. Impala doesn’t support Avro, but Hive does • Language is mix between SQL & HiveQL • Requires a lot of memory (128 G min./node) • Initial load of data via Refresh; can take a lot of time • loads the block location data for newly added data files
  • 12.
    12 Impala overview • Shortcomings •Impala doesn’t support nested types at this point (version 1.2.3) as long as it contains only Impala- compatible data types – it cannot contain nested types such as array, map, or struct. • Impala currently does not "spill to disk" • if intermediate results being processed on a node exceed the memory reserved for Impala on that node. • No Custom Serializer/Deserializer classes (SerDes) • Impala cancels a running query if any host on which that query is executing fails
  • 13.
    13 Impala overview • Example.For create a PARQUET table in IMPALA there are 3 ways: • -> PARQUET table created in HIVE (with no nested data types). • -> Create and load with data a normal text table in IMPALA: • IMPALA> create table parquet_table_name LIKE text_table_name STORED AS PARQUET LOCATION /user/hdfs/..’; • Create Parquet format table and then insert into parquet table using normal text table. • IMPALA> insert overwrite table parquet_table_name select * from text_table_name;
  • 14.
    14 Use Case • Can'tquery Avro table in Impala because having nested columns. • Avro table created through Hive, we can use it in Impala as long as it contains only Impala-compatible data types. • (cannot contain nested types such as array, map, orstruct).
  • 15.
    15 Use Case • Howto deal with nested XML data in Hadoop? • There is no direct mapping from xml to avro. Process goes: • Parse XML and Convert to Avro : Parse XML using XMLStreamReader and • Perform JAXB unmarshalling and Create Avro Records from JAXB objects.Need to write a java class for this.Tried using Parquet/Avro: • Tested: Process Xml – first convert into Avro and then store into Parquet format using parquet-avro apis. • The problem is the Schema provided has some arrays which is union of type string and null both. • Currently this AvroSchemaConverter is not able to handle such avro schema and it gives exception. • Tested: Impala 1.2.3 on CDH 4.5 • Impala doesn’t support nested types at this point
  • 16.

Editor's Notes

  • #5 Also splittables.