Your SlideShare is downloading. ×
  • Like
Hive Object Model
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Hive Object Model

  • 5,097 views
Published

This set of slides describes the efficient Java object model in Hive.

This set of slides describes the efficient Java object model in Hive.

Published in Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
5,097
On SlideShare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
347
Comments
0
Likes
9

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Efficient Object Model in Java Slides by Zheng Shao, Facebook Part of Apache Hadoop Hive Project
  • 2. Object Inspector
  • 3. On-disk Data Format ▪ Single on-disk form system at s ▪ Simplicity ▪ Multiple on-disk form system at s ▪ Ease-of-use ▪ Ease-of-integration ▪ Flexibility: better trade off between space, performance, etc ▪ Hive allow M s ultiple on-disk format
  • 4. Exam M ple ultiple on-disk Formats ▪ File Format: ▪ Row-based ▪ Column-based ▪ Block-based ▪ Rowformat: ▪ Text-based ▪ Binary-based ▪ Customized ▪ Index format
  • 5. In-m ory Data Form em at ▪ Single in-m ory form system em at s ▪ Simplicity: Simpler code ▪ Multiple in-m ory form system em at s ▪ Ease-of-integration: other system m use their ow form s ay n at ▪ Performance: ▪ Multiple on-disk format/external form + efficient loading at M ultiple in-m ory form em at ▪ Hive allow M s ultiple in-m ory form em at
  • 6. Exam M ple ultiple in-m ory Form em ats ▪ Integer: ▪ Integer ▪ IntWritable ▪ LazyInteger ▪ String: ▪ String ▪ Text
  • 7. Multiple In-m ory Form Design Patterns em at ▪ Object-oriented: ▪ A single interface/base class for Integer ▪ Multiple derived classes ▪ Delegation: ▪ data stored in object ▪ format/operations stored in objectInspector ▪ a pair of object and objectInspector represents a data unit ▪ It’ possible to w either one up to conform to the other’ pattern. s rap s
  • 8. Multiple In-m ory Form Design Patterns em at ▪ In OO, w need an interface HiveInteger to represent Integers e ▪ Make Integer, IntWritable classes all implem it. ent ▪ How ever, Integer class is final (not extendable) and does not implem HiveInteger ent ▪ W need to do a conversion, every tim w exchange data w UDF, e e e ith SerDe (Thrift), or other libraries (unless they knowHiveInteger –this is a bad assum ption to m ake in open system ). ▪ Delegation w be a better idea because ill ▪ For Integer, w have an JavaIntegerObjectInspector e ▪ For IntWritable , w have an W e ritableIntegerObjectInspector ▪ W convert param and return values only if necessary e s
  • 9. Delegation Method List ▪ General methods: ▪ List Objects: ▪ isNull(object o) ▪ getListSize(object o) ▪ hashCode(object o) ▪ getListElement(object o) ▪ compare(object o) ▪ getList(object o) ▪ clone(object o) ▪ M Objects: ap ▪ Primitive Objects: ▪ getMapSize(object o) ▪ primitive getValue(object o) ▪ getValueForKey(object o) ▪ String Objects: ▪ getMap(object o) ▪ String getString(object o) ▪ Struct Objects: ▪ Text getText(object o) ▪ getStructField(object o) ▪ getStructAsAList(object o)
  • 10. SerDe
  • 11. Where is SerDe? Hive Operator Hive Operator Re duc e r Mappe r ObjectInspector Hierarchical Hierarchical Hierarchical Hierarchical Hierarchical Object Object Object Standard Object Object Object LazyObject Java Object Use ArrayList for struct and Lazily-deserialized Object of a Java array SerDe Class Use HashM for m ap ap Text(‘ p 1.0 3 54’// UTF8 im ) Writable W ritable W ritable encoded W ritable W ritable Writable BytesW ritable(x3Fx64x72x0 W ritable W ritable 0) FileForm / Hadoop Serialization at File on Map thrift_record<… > Stream Stream im 1.0 3 54 p File on HDFS Output thrift_record<… > Im 0.2 1 33 p HDFS File thrift_record<… > clk 2.2 8 212 thrift_record<… > Im 0.7 2 22 p User Script
  • 12. SerDe, ObjectInspector and TypeInfo “ av” int int String Object Obje c tIns pe c to r3 string string struct getType g e tMapValue Hierarchical getMapValueOI HashMap<String, String> a, Obje c tIns pe c to r2 Object HashM ap(“  “ getType“ ), a” av”“  bv” , b” map int list class HO { string HashM ap<String, String> a, g e tS truc tFie ld Integer b, List ( List<ClassC> c, HashM ap(“  “ , “  “ ), a” av” b” bv” String d; Hierarchical getFieldOI Obje c tIns pe23, r1 c to } Object getType Class ClassC { Struct List(List(1,null),List(2,4),List(5,null)), Integer a, “ abcd” Integer b; Type Info de s e rialize s e rialize S e rDe ) getOI } Writable Writable Text(‘ a=av:b=bv 23 1:2=4:5 BytesWritable(x3Fx64x72x0 abcd’) 0)
  • 13. LazySimpleSerDe components byte[](‘a=av:b=bv 23 1:2=4:5 byte[] data abcd’ ) LazyStruct LazyStructOI(“ ) “ LazyMap LazyInteger LazyArray LazyString LazyMapOI(“ ,” ) :” =“ LazyArrayOI(“ ) :” LazyStruct LazyStringOI LazyString LazyString LazyInteger LazyStringOI LazyString LazyString LazyInteger LazyStructOI(“ ) =“ LazyStruct Hierarchical Object / LazyObject LazyInteger LazyIntegerOI StandardIntegerOI One Per SerDe instance LazyInteger LazyObjectInspector Singleton
  • 14. LazyPrimitive ▪ LazyString/LazyInteger ▪ setAll(byte[] data, int start, int length) ▪ LazyString: parse the data and create a String object ▪ LazyInteger: parse the data and create an Integer object ▪ getObject() –returns the corresponding String/Integer object ▪ Future ▪ Replace String/Integer w Text/IntW ith ritable ▪ The Text/IntWritable object is owned by the LazyString/LazyInteger object.
  • 15. LazyNonPrimitive ▪ LazyStruct/LazyArray/LazyMap ▪ setAll(byte[] data, int start, int length) ▪ Rem ber data, start and length, and set parsed to false. em ▪ getStructField/getArrayElement/getMapValue ▪ If not parsed yet, parse the byte and rem ber starting positions of em each field/element/key/value ▪ For Struct/Array, do setAll on the corresponding LazyObject and return it ▪ For M search for the serialized key and return the corresponding ap, value (after doing a setAll on the value).
  • 16. W another SerDe? hy ▪ Functionality: ▪ MetadataTypedColumnSetSerDe can only deal w String colum ith ns ▪ Dynam icSerDe can deal w all prim ith itive colum and prim ns itive lists/ maps, but it does not fully support nested types yet. ▪ Efficiency: ▪ Both M etadataTypedColum nSetSerDe and Dynam icSerDe uses String.split() and are not efficient for long rows
  • 17. Features of LazySimpleSerDe ▪ Functionality: ▪ Fully compatible w M ith etaDataSerDe and Dynamic/TCTLSeparated ▪ Fully support all nested types (M Key m be prim ap ust itive) ▪ Efficiency: ▪ Fully support lazy deserialization - only deserialize the field (and create Objects) w hen asked. ▪ Reuse multiple-levels of LazyObjects. ▪ Read numbers without UTF-8 decoding ▪ (TODO) Fully reuse objects - IntWritable for Integer, Text for String ▪ (TODO) W num rite bers without UTF-8 encoding
  • 18. Profiling result of a mapper ▪ 17%: TrackedRecordReader (should include InputFileFormat and decompression) ▪ 22%: Operator.close ▪ |-12%: DynamicSerDe.serialize (NOTE: This includes UTF-8 encoding) ▪ |- 4%: mapOutputBuffer.collect (should include compression and OutputFileFormat) ▪ 50%: Operator.forward ▪ |-18%: Text.decode (from LazySerDe) ▪ | |- 7%: CharacterSet.decode() (UTF-8 decoding) ▪ | |- 5%: toString() (where we create the string object) ▪ |- 3%: LazyStruct.parse (the code that search for separators in the row) ▪ |- 3%: Arrays.asList() (from UnionStructOI.getStructFieldData) ▪ |- 8%: GroupByOperator.processHashAggr ▪ |- 3%: HashMap.get() in GroupByOperator ▪ * Performance Data from Rodrigo Schmidt
  • 19. TypeInfo String specification ▪ W not Thrift? hy ▪ Hard to parse ▪ Sim Syntax ple ▪ Type: PrimitiveType | MapType | ArrayType | StructType ▪ PrimitiveType: int | bigint | tinyint | smallint | double | string ▪ MapType: map<Type, Type> ▪ ArrayType: array<Type> ▪ StructType: struct< [Nam : Type]+ > e ▪ Example: array<map<string,struct<a:int,b:array<string>,c:doube>>>
  • 20. Future Works
  • 21. Future Works of ObjectInspector ▪ Delegate all methods described earlier ▪ isNull(), hashCode(), compare() etc are not delegated yet ▪ Support UNION data type: HIVE-537
  • 22. Future Works of SerDe ▪ LazyBinarySerDe: HIVE-553 ▪ A binary-form sortable SerDe: serialized sorting order is the sam at e as deserialized sorting order ▪ A binary-form com at pact SerDe: saving space