This set of slides describes the efficient Java object model in Hive.

  • 1. Efficient Object Model in Java Slides by Zheng Shao, Facebook Part of Apache Hadoop Hive Project
  • 2. Object Inspector
  • 3. On-disk Data Format ▪ Single on-disk form system at s ▪ Simplicity ▪ Multiple on-disk form system at s ▪ Ease-of-use ▪ Ease-of-integration ▪ Flexibility: better trade off between space, performance, etc ▪ Hive allow M s ultiple on-disk format
  • 4. Exam M ple ultiple on-disk Formats ▪ File Format: ▪ Row-based ▪ Column-based ▪ Block-based ▪ Rowformat: ▪ Text-based ▪ Binary-based ▪ Customized ▪ Index format
  • 5. In-m ory Data Form em at ▪ Single in-m ory form system em at s ▪ Simplicity: Simpler code ▪ Multiple in-m ory form system em at s ▪ Ease-of-integration: other system m use their ow form s ay n at ▪ Performance: ▪ Multiple on-disk format/external form + efficient loading at M ultiple in-m ory form em at ▪ Hive allow M s ultiple in-m ory form em at
  • 6. Exam M ple ultiple in-m ory Form em ats ▪ Integer: ▪ Integer ▪ IntWritable ▪ LazyInteger ▪ String: ▪ String ▪ Text
  • 7. Multiple In-m ory Form Design Patterns em at ▪ Object-oriented: ▪ A single interface/base class for Integer ▪ Multiple derived classes ▪ Delegation: ▪ data stored in object ▪ format/operations stored in objectInspector ▪ a pair of object and objectInspector represents a data unit ▪ It’ possible to w either one up to conform to the other’ pattern. s rap s
  • 8. Multiple In-m ory Form Design Patterns em at ▪ In OO, w need an interface HiveInteger to represent Integers e ▪ Make Integer, IntWritable classes all implem it. ent ▪ How ever, Integer class is final (not extendable) and does not implem HiveInteger ent ▪ W need to do a conversion, every tim w exchange data w UDF, e e e ith SerDe (Thrift), or other libraries (unless they knowHiveInteger –this is a bad assum ption to m ake in open system ). ▪ Delegation w be a better idea because ill ▪ For Integer, w have an JavaIntegerObjectInspector e ▪ For IntWritable , w have an W e ritableIntegerObjectInspector ▪ W convert param and return values only if necessary e s
  • 9. Delegation Method List ▪ General methods: ▪ List Objects: ▪ isNull(object o) ▪ getListSize(object o) ▪ hashCode(object o) ▪ getListElement(object o) ▪ compare(object o) ▪ getList(object o) ▪ clone(object o) ▪ M Objects: ap ▪ Primitive Objects: ▪ getMapSize(object o) ▪ primitive getValue(object o) ▪ getValueForKey(object o) ▪ String Objects: ▪ getMap(object o) ▪ String getString(object o) ▪ Struct Objects: ▪ Text getText(object o) ▪ getStructField(object o) ▪ getStructAsAList(object o)
  • 10. SerDe
  • 11. Where is SerDe? Hive Operator Hive Operator Re duc e r Mappe r ObjectInspector Hierarchical Hierarchical Hierarchical Hierarchical Hierarchical Object Object Object Standard Object Object Object LazyObject Java Object Use ArrayList for struct and Lazily-deserialized Object of a Java array SerDe Class Use HashM for m ap ap Text(‘ p 1.0 3 54’// UTF8 im ) Writable W ritable W ritable encoded W ritable W ritable Writable BytesW ritable(x3Fx64x72x0 W ritable W ritable 0) FileForm / Hadoop Serialization at File on Map thrift_record<… > Stream Stream im 1.0 3 54 p File on HDFS Output thrift_record<… > Im 0.2 1 33 p HDFS File thrift_record<… > clk 2.2 8 212 thrift_record<… > Im 0.7 2 22 p User Script
  • 12. SerDe, ObjectInspector and TypeInfo “ av” int int String Object Obje c tIns pe c to r3 string string struct getType g e tMapValue Hierarchical getMapValueOI HashMap<String, String> a, Obje c tIns pe c to r2 Object HashM ap(“  “ getType“ ), a” av”“  bv” , b” map int list class HO { string HashM ap<String, String> a, g e tS truc tFie ld Integer b, List ( List<ClassC> c, HashM ap(“  “ , “  “ ), a” av” b” bv” String d; Hierarchical getFieldOI Obje c tIns pe23, r1 c to } Object getType Class ClassC { Struct List(List(1,null),List(2,4),List(5,null)), Integer a, “ abcd” Integer b; Type Info de s e rialize s e rialize S e rDe ) getOI } Writable Writable Text(‘ a=av:b=bv 23 1:2=4:5 BytesWritable(x3Fx64x72x0 abcd’) 0)
  • 13. LazySimpleSerDe components byte[](‘a=av:b=bv 23 1:2=4:5 byte[] data abcd’ ) LazyStruct LazyStructOI(“ ) “ LazyMap LazyInteger LazyArray LazyString LazyMapOI(“ ,” ) :” =“ LazyArrayOI(“ ) :” LazyStruct LazyStringOI LazyString LazyString LazyInteger LazyStringOI LazyString LazyString LazyInteger LazyStructOI(“ ) =“ LazyStruct Hierarchical Object / LazyObject LazyInteger LazyIntegerOI StandardIntegerOI One Per SerDe instance LazyInteger LazyObjectInspector Singleton
  • 14. LazyPrimitive ▪ LazyString/LazyInteger ▪ setAll(byte[] data, int start, int length) ▪ LazyString: parse the data and create a String object ▪ LazyInteger: parse the data and create an Integer object ▪ getObject() –returns the corresponding String/Integer object ▪ Future ▪ Replace String/Integer w Text/IntW ith ritable ▪ The Text/IntWritable object is owned by the LazyString/LazyInteger object.
  • 15. LazyNonPrimitive ▪ LazyStruct/LazyArray/LazyMap ▪ setAll(byte[] data, int start, int length) ▪ Rem ber data, start and length, and set parsed to false. em ▪ getStructField/getArrayElement/getMapValue ▪ If not parsed yet, parse the byte and rem ber starting positions of em each field/element/key/value ▪ For Struct/Array, do setAll on the corresponding LazyObject and return it ▪ For M search for the serialized key and return the corresponding ap, value (after doing a setAll on the value).
  • 16. W another SerDe? hy ▪ Functionality: ▪ MetadataTypedColumnSetSerDe can only deal w String colum ith ns ▪ Dynam icSerDe can deal w all prim ith itive colum and prim ns itive lists/ maps, but it does not fully support nested types yet. ▪ Efficiency: ▪ Both M etadataTypedColum nSetSerDe and Dynam icSerDe uses String.split() and are not efficient for long rows
  • 17. Features of LazySimpleSerDe ▪ Functionality: ▪ Fully compatible w M ith etaDataSerDe and Dynamic/TCTLSeparated ▪ Fully support all nested types (M Key m be prim ap ust itive) ▪ Efficiency: ▪ Fully support lazy deserialization - only deserialize the field (and create Objects) w hen asked. ▪ Reuse multiple-levels of LazyObjects. ▪ Read numbers without UTF-8 decoding ▪ (TODO) Fully reuse objects - IntWritable for Integer, Text for String ▪ (TODO) W num rite bers without UTF-8 encoding
  • 18. Profiling result of a mapper ▪ 17%: TrackedRecordReader (should include InputFileFormat and decompression) ▪ 22%: Operator.close ▪ |-12%: DynamicSerDe.serialize (NOTE: This includes UTF-8 encoding) ▪ |- 4%: mapOutputBuffer.collect (should include compression and OutputFileFormat) ▪ 50%: Operator.forward ▪ |-18%: Text.decode (from LazySerDe) ▪ | |- 7%: CharacterSet.decode() (UTF-8 decoding) ▪ | |- 5%: toString() (where we create the string object) ▪ |- 3%: LazyStruct.parse (the code that search for separators in the row) ▪ |- 3%: Arrays.asList() (from UnionStructOI.getStructFieldData) ▪ |- 8%: GroupByOperator.processHashAggr ▪ |- 3%: HashMap.get() in GroupByOperator ▪ * Performance Data from Rodrigo Schmidt
  • 19. TypeInfo String specification ▪ W not Thrift? hy ▪ Hard to parse ▪ Sim Syntax ple ▪ Type: PrimitiveType | MapType | ArrayType | StructType ▪ PrimitiveType: int | bigint | tinyint | smallint | double | string ▪ MapType: map<Type, Type> ▪ ArrayType: array<Type> ▪ StructType: struct< [Nam : Type]+ > e ▪ Example: array<map<string,struct<a:int,b:array<string>,c:doube>>>
  • 20. Future Works
  • 21. Future Works of ObjectInspector ▪ Delegate all methods described earlier ▪ isNull(), hashCode(), compare() etc are not delegated yet ▪ Support UNION data type: HIVE-537
  • 22. Future Works of SerDe ▪ LazyBinarySerDe: HIVE-553 ▪ A binary-form sortable SerDe: serialized sorting order is the sam at e as deserialized sorting order ▪ A binary-form com at pact SerDe: saving space