Your SlideShare is downloading. ×
Hive - SerDe and LazySerde
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Hive - SerDe and LazySerde

9,138
views

Published on

This is a description of the SerDe layer in Hadoop Hive project. …

This is a description of the SerDe layer in Hadoop Hive project.
LazySerDe is a particular implementation of the SerDe interface.

Published in: Technology, Travel

0 Comments
6 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
9,138
On Slideshare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
219
Comments
0
Likes
6
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Hive – SerDe and LazySerDe
    • Part of Apache Hadoop Hive Project http://hadoop.apache.org/hive
    • Data Infrastructure Team, Facebook Inc. (Slides by Zheng Shao)
  • 2. Where is SerDe? File on HDFS Hierarchical Object Writable Stream Stream Hierarchical Object Map Output File Writable Writable Writable Writable Writable Hierarchical Object File on HDFS User Script Hierarchical Object Hierarchical Object Hive Operator Hive Operator SerDe FileFormat / Hadoop Serialization Mapper Reducer ObjectInspector imp 1.0 3 54 Imp 0.2 1 33 clk 2.2 8 212 Imp 0.7 2 22 thrift_record<…> thrift_record<…> thrift_record<…> thrift_record<…> BytesWritable(x3Fx64x72x00) Text(‘imp 1.0 3 54’) // UTF8 encoded Java Object Object of a Java Class Standard Object Use ArrayList for struct and array Use HashMap for map LazyObject Lazily-deserialized
  • 3. SerDe, ObjectInspector and TypeInfo Hierarchical Object Writable Writable Struct int string list struct map string string Hierarchical Object String Object TypeInfo BytesWritable(x3Fx64x72x00) Text(‘ a=av:b=bv 23 1:2=4:5 abcd ’) class HO { HashMap<String, String> a, Integer b, List<ClassC> c, String d; } Class ClassC { Integer a, Integer b; } List ( HashMap(“a”  “av”, “b”  “bv”), 23, List(List(1,null),List(2,4),List(5,null)), “ abcd” ) int int HashMap(“a”  “av”, “b”  “bv”), HashMap<String, String> a, “ av” getType ObjectInspector1 getFieldOI getStructField getType ObjectInspector2 getMapValueOI getMapValue deserialize SerDe serialize getOI getType ObjectInspector3
  • 4. LazySimpleSerDe components LazyStruct LazyInteger LazyString LazyMap LazyString LazyString LazyString LazyString LazyStructOI(“ “) LazyArrayOI(“:”) LazyMapOI(“:”,”=“) StandardIntegerOI StandardStringOI StandardStringOI byte[] data Hierarchical Object / LazyObject One Per SerDe instance LazyObjectInspector Singleton byte[](‘ a=av:b=bv 23 1:2=4:5 abcd ’) LazyStruct LazyStructOI(“=“) StandardIntegerOI LazyStruct LazyArray LazyInteger LazyInteger LazyInteger LazyInteger
  • 5. LazyPrimitive
    • LazyString/LazyInteger
      • setAll(byte[] data, int start, int length)
        • LazyString: parse the data and create a String object
        • LazyInteger: parse the data and create an Integer object
      • getObject() – returns the corresponding String/Integer object
    • Future
      • Replace String/Integer with Text/IntWritable
      • The Text/IntWritable object is owned by the LazyString/LazyInteger object.
  • 6. LazyNonPrimitive
    • LazyStruct/LazyArray/LazyMap
      • setAll(byte[] data, int start, int length)
        • Remember data , start and length , and set parsed to false.
      • getStructField/getArrayElement/getMapValue
        • If not parsed yet, parse the byte and remember starting positions of each field/element/key/value
        • For Struct/Array, do setAll on the corresponding LazyObject and return it
        • For Map, search for the serialized key and return the corresponding value (after doing a setAll on the value).
  • 7. Why another SerDe?
    • Functionality:
      • MetadataTypedColumnSetSerDe can only deal with String columns
      • DynamicSerDe can deal with all primitive columns and primitive lists/maps, but it does not fully support nested types yet.
    • Efficiency:
      • Both MetadataTypedColumnSetSerDe and DynamicSerDe uses String.split() and are not efficient for long rows
  • 8. Features of LazySimpleSerDe
    • Functionality:
      • Fully compatible with MetaDataSerDe and Dynamic/TCTLSeparated
      • Fully support all nested types (Map Key must be primitive)
    • Efficiency:
      • Fully support lazy deserialization - only deserialize the field (and create Objects) when asked.
      • Reuse multiple-levels of LazyObjects.
      • Read numbers without UTF-8 decoding
      • (TODO) Fully reuse objects - IntWritable for Integer, Text for String
      • (TODO) Write numbers without UTF-8 encoding
  • 9. Profiling result of a mapper
    • 17%: TrackedRecordReader (should include InputFileFormat and decompression)
    • 22%: Operator.close
    • |- 12%: DynamicSerDe.serialize (NOTE: This includes UTF-8 encoding)
    • |- 4%: mapOutputBuffer.collect (should include compression and OutputFileFormat)
    • 50%: Operator.forward
    • |- 18%: Text.decode (from LazySerDe)
    • | |- 7%: CharacterSet.decode() (UTF-8 decoding)
    • | |- 5%: toString()  (where we create the string object)
    • |- 3%: LazyStruct.parse (the code that search for separators in the row)
    • |- 3%: Arrays.asList() (from UnionStructOI.getStructFieldData)
    • |- 8%: GroupByOperator.processHashAggr
    • |- 3%: HashMap.get() in GroupByOperator
    • * Performance Data from Rodrigo Schmidt
  • 10. TypeInfo String specification
    • Why not Thrift?
      • Hard to parse
    • Simple Syntax
      • Type: PrimitiveType | MapType | ArrayType | StructType
      • PrimitiveType: int | bigint | tinyint | smallint | double | string
      • MapType: map<Type, Type>
      • ArrayType: array<Type>
      • StructType: struct< [Name : Type]+ >
    • Example: array<map<string,struct<a:int,b:array<string>,c:doube>>>
  • 11. Future of SerDe
    • HIVE-337 LazySimpleSerDe should support multi-level nested array, map, struct types (Done)
    • HIVE-136 SerDe should escape some special characters
    • HIVE-266 Improve SerDe performance by using Text instead of String
    • HIVE-352 Make Hive support column based storage (Yongqiang He)
    • HIVE-358 Short-circuiting serialization
    • Binary-format Lazy SerDe