Hive - SerDe and LazySerde
Upcoming SlideShare
Loading in...5

Hive - SerDe and LazySerde



This is a description of the SerDe layer in Hadoop Hive project.

This is a description of the SerDe layer in Hadoop Hive project.
LazySerDe is a particular implementation of the SerDe interface.



Total Views
Views on SlideShare
Embed Views



7 Embeds 145 88 20 14 11 6 5 1



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

Hive - SerDe and LazySerde Hive - SerDe and LazySerde Presentation Transcript

  • Hive – SerDe and LazySerDe
    • Part of Apache Hadoop Hive Project
    • Data Infrastructure Team, Facebook Inc. (Slides by Zheng Shao)
  • Where is SerDe? File on HDFS Hierarchical Object Writable Stream Stream Hierarchical Object Map Output File Writable Writable Writable Writable Writable Hierarchical Object File on HDFS User Script Hierarchical Object Hierarchical Object Hive Operator Hive Operator SerDe FileFormat / Hadoop Serialization Mapper Reducer ObjectInspector imp 1.0 3 54 Imp 0.2 1 33 clk 2.2 8 212 Imp 0.7 2 22 thrift_record<…> thrift_record<…> thrift_record<…> thrift_record<…> BytesWritable(x3Fx64x72x00) Text(‘imp 1.0 3 54’) // UTF8 encoded Java Object Object of a Java Class Standard Object Use ArrayList for struct and array Use HashMap for map LazyObject Lazily-deserialized
  • SerDe, ObjectInspector and TypeInfo Hierarchical Object Writable Writable Struct int string list struct map string string Hierarchical Object String Object TypeInfo BytesWritable(x3Fx64x72x00) Text(‘ a=av:b=bv 23 1:2=4:5 abcd ’) class HO { HashMap<String, String> a, Integer b, List<ClassC> c, String d; } Class ClassC { Integer a, Integer b; } List ( HashMap(“a”  “av”, “b”  “bv”), 23, List(List(1,null),List(2,4),List(5,null)), “ abcd” ) int int HashMap(“a”  “av”, “b”  “bv”), HashMap<String, String> a, “ av” getType ObjectInspector1 getFieldOI getStructField getType ObjectInspector2 getMapValueOI getMapValue deserialize SerDe serialize getOI getType ObjectInspector3
  • LazySimpleSerDe components LazyStruct LazyInteger LazyString LazyMap LazyString LazyString LazyString LazyString LazyStructOI(“ “) LazyArrayOI(“:”) LazyMapOI(“:”,”=“) StandardIntegerOI StandardStringOI StandardStringOI byte[] data Hierarchical Object / LazyObject One Per SerDe instance LazyObjectInspector Singleton byte[](‘ a=av:b=bv 23 1:2=4:5 abcd ’) LazyStruct LazyStructOI(“=“) StandardIntegerOI LazyStruct LazyArray LazyInteger LazyInteger LazyInteger LazyInteger
  • LazyPrimitive
    • LazyString/LazyInteger
      • setAll(byte[] data, int start, int length)
        • LazyString: parse the data and create a String object
        • LazyInteger: parse the data and create an Integer object
      • getObject() – returns the corresponding String/Integer object
    • Future
      • Replace String/Integer with Text/IntWritable
      • The Text/IntWritable object is owned by the LazyString/LazyInteger object.
  • LazyNonPrimitive
    • LazyStruct/LazyArray/LazyMap
      • setAll(byte[] data, int start, int length)
        • Remember data , start and length , and set parsed to false.
      • getStructField/getArrayElement/getMapValue
        • If not parsed yet, parse the byte and remember starting positions of each field/element/key/value
        • For Struct/Array, do setAll on the corresponding LazyObject and return it
        • For Map, search for the serialized key and return the corresponding value (after doing a setAll on the value).
  • Why another SerDe?
    • Functionality:
      • MetadataTypedColumnSetSerDe can only deal with String columns
      • DynamicSerDe can deal with all primitive columns and primitive lists/maps, but it does not fully support nested types yet.
    • Efficiency:
      • Both MetadataTypedColumnSetSerDe and DynamicSerDe uses String.split() and are not efficient for long rows
  • Features of LazySimpleSerDe
    • Functionality:
      • Fully compatible with MetaDataSerDe and Dynamic/TCTLSeparated
      • Fully support all nested types (Map Key must be primitive)
    • Efficiency:
      • Fully support lazy deserialization - only deserialize the field (and create Objects) when asked.
      • Reuse multiple-levels of LazyObjects.
      • Read numbers without UTF-8 decoding
      • (TODO) Fully reuse objects - IntWritable for Integer, Text for String
      • (TODO) Write numbers without UTF-8 encoding
  • Profiling result of a mapper
    • 17%: TrackedRecordReader (should include InputFileFormat and decompression)
    • 22%: Operator.close
    • |- 12%: DynamicSerDe.serialize (NOTE: This includes UTF-8 encoding)
    • |- 4%: mapOutputBuffer.collect (should include compression and OutputFileFormat)
    • 50%: Operator.forward
    • |- 18%: Text.decode (from LazySerDe)
    • | |- 7%: CharacterSet.decode() (UTF-8 decoding)
    • | |- 5%: toString()  (where we create the string object)
    • |- 3%: LazyStruct.parse (the code that search for separators in the row)
    • |- 3%: Arrays.asList() (from UnionStructOI.getStructFieldData)
    • |- 8%: GroupByOperator.processHashAggr
    • |- 3%: HashMap.get() in GroupByOperator
    • * Performance Data from Rodrigo Schmidt
  • TypeInfo String specification
    • Why not Thrift?
      • Hard to parse
    • Simple Syntax
      • Type: PrimitiveType | MapType | ArrayType | StructType
      • PrimitiveType: int | bigint | tinyint | smallint | double | string
      • MapType: map<Type, Type>
      • ArrayType: array<Type>
      • StructType: struct< [Name : Type]+ >
    • Example: array<map<string,struct<a:int,b:array<string>,c:doube>>>
  • Future of SerDe
    • HIVE-337 LazySimpleSerDe should support multi-level nested array, map, struct types (Done)
    • HIVE-136 SerDe should escape some special characters
    • HIVE-266 Improve SerDe performance by using Text instead of String
    • HIVE-352 Make Hive support column based storage (Yongqiang He)
    • HIVE-358 Short-circuiting serialization
    • Binary-format Lazy SerDe