Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Hive – SerDe and LazySerDe <ul><li>Part of Apache Hadoop Hive Project  http://hadoop.apache.org/hive </li></ul><ul><li>Dat...
Where is SerDe? File on HDFS Hierarchical Object Writable Stream Stream Hierarchical Object Map Output File Writable Writa...
SerDe, ObjectInspector and TypeInfo Hierarchical Object Writable Writable Struct int string list struct map string string ...
LazySimpleSerDe components LazyStruct LazyInteger LazyString LazyMap LazyString LazyString LazyString LazyString LazyStruc...
LazyPrimitive <ul><li>LazyString/LazyInteger </li></ul><ul><ul><li>setAll(byte[] data, int start, int length) </li></ul></...
LazyNonPrimitive <ul><li>LazyStruct/LazyArray/LazyMap </li></ul><ul><ul><li>setAll(byte[] data, int start, int length) </l...
Why another SerDe? <ul><li>Functionality: </li></ul><ul><ul><li>MetadataTypedColumnSetSerDe can only deal with String colu...
Features of LazySimpleSerDe <ul><li>Functionality: </li></ul><ul><ul><li>Fully compatible with MetaDataSerDe and Dynamic/T...
Profiling result of a mapper <ul><li>17%: TrackedRecordReader (should include InputFileFormat and decompression) </li></ul...
TypeInfo String specification <ul><li>Why not Thrift? </li></ul><ul><ul><li>Hard to parse </li></ul></ul><ul><li>Simple Sy...
Future of SerDe <ul><li>HIVE-337  LazySimpleSerDe should support multi-level nested array, map, struct types (Done) </li><...
Upcoming SlideShare
Loading in …5
×
Upcoming SlideShare
Hive Object Model
Next
Download to read offline and view in fullscreen.

8

Share

Download to read offline

Hive - SerDe and LazySerde

Download to read offline

This is a description of the SerDe layer in Hadoop Hive project.
LazySerDe is a particular implementation of the SerDe interface.

Related Audiobooks

Free with a 30 day trial from Scribd

See all

Hive - SerDe and LazySerde

  1. 1. Hive – SerDe and LazySerDe <ul><li>Part of Apache Hadoop Hive Project http://hadoop.apache.org/hive </li></ul><ul><li>Data Infrastructure Team, Facebook Inc. (Slides by Zheng Shao) </li></ul>
  2. 2. Where is SerDe? File on HDFS Hierarchical Object Writable Stream Stream Hierarchical Object Map Output File Writable Writable Writable Writable Writable Hierarchical Object File on HDFS User Script Hierarchical Object Hierarchical Object Hive Operator Hive Operator SerDe FileFormat / Hadoop Serialization Mapper Reducer ObjectInspector imp 1.0 3 54 Imp 0.2 1 33 clk 2.2 8 212 Imp 0.7 2 22 thrift_record<…> thrift_record<…> thrift_record<…> thrift_record<…> BytesWritable(x3Fx64x72x00) Text(‘imp 1.0 3 54’) // UTF8 encoded Java Object Object of a Java Class Standard Object Use ArrayList for struct and array Use HashMap for map LazyObject Lazily-deserialized
  3. 3. SerDe, ObjectInspector and TypeInfo Hierarchical Object Writable Writable Struct int string list struct map string string Hierarchical Object String Object TypeInfo BytesWritable(x3Fx64x72x00) Text(‘ a=av:b=bv 23 1:2=4:5 abcd ’) class HO { HashMap<String, String> a, Integer b, List<ClassC> c, String d; } Class ClassC { Integer a, Integer b; } List ( HashMap(“a”  “av”, “b”  “bv”), 23, List(List(1,null),List(2,4),List(5,null)), “ abcd” ) int int HashMap(“a”  “av”, “b”  “bv”), HashMap<String, String> a, “ av” getType ObjectInspector1 getFieldOI getStructField getType ObjectInspector2 getMapValueOI getMapValue deserialize SerDe serialize getOI getType ObjectInspector3
  4. 4. LazySimpleSerDe components LazyStruct LazyInteger LazyString LazyMap LazyString LazyString LazyString LazyString LazyStructOI(“ “) LazyArrayOI(“:”) LazyMapOI(“:”,”=“) StandardIntegerOI StandardStringOI StandardStringOI byte[] data Hierarchical Object / LazyObject One Per SerDe instance LazyObjectInspector Singleton byte[](‘ a=av:b=bv 23 1:2=4:5 abcd ’) LazyStruct LazyStructOI(“=“) StandardIntegerOI LazyStruct LazyArray LazyInteger LazyInteger LazyInteger LazyInteger
  5. 5. LazyPrimitive <ul><li>LazyString/LazyInteger </li></ul><ul><ul><li>setAll(byte[] data, int start, int length) </li></ul></ul><ul><ul><ul><li>LazyString: parse the data and create a String object </li></ul></ul></ul><ul><ul><ul><li>LazyInteger: parse the data and create an Integer object </li></ul></ul></ul><ul><ul><li>getObject() – returns the corresponding String/Integer object </li></ul></ul><ul><li>Future </li></ul><ul><ul><li>Replace String/Integer with Text/IntWritable </li></ul></ul><ul><ul><li>The Text/IntWritable object is owned by the LazyString/LazyInteger object. </li></ul></ul>
  6. 6. LazyNonPrimitive <ul><li>LazyStruct/LazyArray/LazyMap </li></ul><ul><ul><li>setAll(byte[] data, int start, int length) </li></ul></ul><ul><ul><ul><li>Remember data , start and length , and set parsed to false. </li></ul></ul></ul><ul><ul><li>getStructField/getArrayElement/getMapValue </li></ul></ul><ul><ul><ul><li>If not parsed yet, parse the byte and remember starting positions of each field/element/key/value </li></ul></ul></ul><ul><ul><ul><li>For Struct/Array, do setAll on the corresponding LazyObject and return it </li></ul></ul></ul><ul><ul><ul><li>For Map, search for the serialized key and return the corresponding value (after doing a setAll on the value). </li></ul></ul></ul>
  7. 7. Why another SerDe? <ul><li>Functionality: </li></ul><ul><ul><li>MetadataTypedColumnSetSerDe can only deal with String columns </li></ul></ul><ul><ul><li>DynamicSerDe can deal with all primitive columns and primitive lists/maps, but it does not fully support nested types yet. </li></ul></ul><ul><li>Efficiency: </li></ul><ul><ul><li>Both MetadataTypedColumnSetSerDe and DynamicSerDe uses String.split() and are not efficient for long rows </li></ul></ul>
  8. 8. Features of LazySimpleSerDe <ul><li>Functionality: </li></ul><ul><ul><li>Fully compatible with MetaDataSerDe and Dynamic/TCTLSeparated </li></ul></ul><ul><ul><li>Fully support all nested types (Map Key must be primitive) </li></ul></ul><ul><li>Efficiency: </li></ul><ul><ul><li>Fully support lazy deserialization - only deserialize the field (and create Objects) when asked. </li></ul></ul><ul><ul><li>Reuse multiple-levels of LazyObjects. </li></ul></ul><ul><ul><li>Read numbers without UTF-8 decoding </li></ul></ul><ul><ul><li>(TODO) Fully reuse objects - IntWritable for Integer, Text for String </li></ul></ul><ul><ul><li>(TODO) Write numbers without UTF-8 encoding </li></ul></ul>
  9. 9. Profiling result of a mapper <ul><li>17%: TrackedRecordReader (should include InputFileFormat and decompression) </li></ul><ul><li>22%: Operator.close </li></ul><ul><li>|- 12%: DynamicSerDe.serialize (NOTE: This includes UTF-8 encoding) </li></ul><ul><li>|- 4%: mapOutputBuffer.collect (should include compression and OutputFileFormat) </li></ul><ul><li>50%: Operator.forward </li></ul><ul><li>|- 18%: Text.decode (from LazySerDe) </li></ul><ul><li>| |- 7%: CharacterSet.decode() (UTF-8 decoding) </li></ul><ul><li>| |- 5%: toString()  (where we create the string object) </li></ul><ul><li>|- 3%: LazyStruct.parse (the code that search for separators in the row) </li></ul><ul><li>|- 3%: Arrays.asList() (from UnionStructOI.getStructFieldData) </li></ul><ul><li>|- 8%: GroupByOperator.processHashAggr </li></ul><ul><li>|- 3%: HashMap.get() in GroupByOperator </li></ul><ul><li>* Performance Data from Rodrigo Schmidt </li></ul>
  10. 10. TypeInfo String specification <ul><li>Why not Thrift? </li></ul><ul><ul><li>Hard to parse </li></ul></ul><ul><li>Simple Syntax </li></ul><ul><ul><li>Type: PrimitiveType | MapType | ArrayType | StructType </li></ul></ul><ul><ul><li>PrimitiveType: int | bigint | tinyint | smallint | double | string </li></ul></ul><ul><ul><li>MapType: map<Type, Type> </li></ul></ul><ul><ul><li>ArrayType: array<Type> </li></ul></ul><ul><ul><li>StructType: struct< [Name : Type]+ > </li></ul></ul><ul><li>Example: array<map<string,struct<a:int,b:array<string>,c:doube>>> </li></ul>
  11. 11. Future of SerDe <ul><li>HIVE-337 LazySimpleSerDe should support multi-level nested array, map, struct types (Done) </li></ul><ul><li>HIVE-136 SerDe should escape some special characters </li></ul><ul><li>HIVE-266 Improve SerDe performance by using Text instead of String </li></ul><ul><li>HIVE-352 Make Hive support column based storage (Yongqiang He) </li></ul><ul><li>HIVE-358 Short-circuiting serialization </li></ul><ul><li>Binary-format Lazy SerDe </li></ul>
  • lalpal

    Feb. 16, 2019
  • skale1990

    Jun. 15, 2015
  • cainanyang

    Jun. 11, 2014
  • code6

    Nov. 23, 2013
  • tf0054

    Sep. 19, 2011
  • seoeun25

    Jun. 25, 2011
  • PhilippeJulio

    Jan. 13, 2010
  • warwithin

    Sep. 26, 2009

This is a description of the SerDe layer in Hadoop Hive project. LazySerDe is a particular implementation of the SerDe interface.

Views

Total views

11,865

On Slideshare

0

From embeds

0

Number of embeds

209

Actions

Downloads

264

Shares

0

Comments

0

Likes

8

×