Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Efficient Object Model in Java


Slides by Zheng Shao, Facebook
Part of Apache Hadoop Hive Project
Object Inspector
On-disk Data Format
▪   Single on-disk form system
                       at     s
    ▪   Simplicity
▪   Multiple on-disk...
Exam M
    ple ultiple on-disk Formats
▪   File Format:
    ▪   Row-based
    ▪   Column-based
    ▪   Block-based
▪   Row...
In-m ory Data Form
    em            at
▪   Single in-m ory form system
               em       at     s
    ▪   Simplicit...
Exam M
    ple ultiple in-m ory Form
                    em       ats
▪   Integer:
    ▪   Integer
    ▪   IntWritable
   ...
Multiple In-m ory Form Design Patterns
             em       at
▪   Object-oriented:
    ▪   A single interface/base class...
Multiple In-m ory Form Design Patterns
             em       at
▪   In OO, w need an interface HiveInteger to represent In...
Delegation Method List
▪   General methods:                   ▪   List Objects:
    ▪   isNull(object o)                  ...
SerDe
Where is SerDe?
                                                    Hive Operator                            Hive Operator...
SerDe, ObjectInspector and TypeInfo
                              “
                              av”                     ...
LazySimpleSerDe components
                                                     byte[](‘a=av:b=bv 23 1:2=4:5
             ...
LazyPrimitive
▪   LazyString/LazyInteger
    ▪   setAll(byte[] data, int start, int length)
        ▪   LazyString: parse ...
LazyNonPrimitive
▪   LazyStruct/LazyArray/LazyMap
    ▪   setAll(byte[] data, int start, int length)
        ▪   Rem ber d...
W another SerDe?
 hy
▪   Functionality:
    ▪   MetadataTypedColumnSetSerDe can only deal w String colum
                 ...
Features of LazySimpleSerDe
▪   Functionality:
    ▪   Fully compatible w M
                          ith etaDataSerDe and...
Profiling result of a mapper
▪   17%: TrackedRecordReader (should include InputFileFormat and decompression)
▪   22%: Oper...
TypeInfo String specification
▪   W not Thrift?
     hy
    ▪   Hard to parse
▪   Sim Syntax
       ple
    ▪   Type: Prim...
Future Works
Future Works of ObjectInspector
▪   Delegate all methods described earlier
    ▪   isNull(), hashCode(), compare() etc are...
Future Works of SerDe
▪   LazyBinarySerDe: HIVE-553
    ▪   A binary-form sortable SerDe: serialized sorting order is the ...
Upcoming SlideShare
Loading in …5
×

13

Share

Download to read offline

Hive Object Model

Download to read offline

This set of slides describes the efficient Java object model in Hive.

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

Hive Object Model

  1. 1. Efficient Object Model in Java Slides by Zheng Shao, Facebook Part of Apache Hadoop Hive Project
  2. 2. Object Inspector
  3. 3. On-disk Data Format ▪ Single on-disk form system at s ▪ Simplicity ▪ Multiple on-disk form system at s ▪ Ease-of-use ▪ Ease-of-integration ▪ Flexibility: better trade off between space, performance, etc ▪ Hive allow M s ultiple on-disk format
  4. 4. Exam M ple ultiple on-disk Formats ▪ File Format: ▪ Row-based ▪ Column-based ▪ Block-based ▪ Rowformat: ▪ Text-based ▪ Binary-based ▪ Customized ▪ Index format
  5. 5. In-m ory Data Form em at ▪ Single in-m ory form system em at s ▪ Simplicity: Simpler code ▪ Multiple in-m ory form system em at s ▪ Ease-of-integration: other system m use their ow form s ay n at ▪ Performance: ▪ Multiple on-disk format/external form + efficient loading at M ultiple in-m ory form em at ▪ Hive allow M s ultiple in-m ory form em at
  6. 6. Exam M ple ultiple in-m ory Form em ats ▪ Integer: ▪ Integer ▪ IntWritable ▪ LazyInteger ▪ String: ▪ String ▪ Text
  7. 7. Multiple In-m ory Form Design Patterns em at ▪ Object-oriented: ▪ A single interface/base class for Integer ▪ Multiple derived classes ▪ Delegation: ▪ data stored in object ▪ format/operations stored in objectInspector ▪ a pair of object and objectInspector represents a data unit ▪ It’ possible to w either one up to conform to the other’ pattern. s rap s
  8. 8. Multiple In-m ory Form Design Patterns em at ▪ In OO, w need an interface HiveInteger to represent Integers e ▪ Make Integer, IntWritable classes all implem it. ent ▪ How ever, Integer class is final (not extendable) and does not implem HiveInteger ent ▪ W need to do a conversion, every tim w exchange data w UDF, e e e ith SerDe (Thrift), or other libraries (unless they knowHiveInteger –this is a bad assum ption to m ake in open system ). ▪ Delegation w be a better idea because ill ▪ For Integer, w have an JavaIntegerObjectInspector e ▪ For IntWritable , w have an W e ritableIntegerObjectInspector ▪ W convert param and return values only if necessary e s
  9. 9. Delegation Method List ▪ General methods: ▪ List Objects: ▪ isNull(object o) ▪ getListSize(object o) ▪ hashCode(object o) ▪ getListElement(object o) ▪ compare(object o) ▪ getList(object o) ▪ clone(object o) ▪ M Objects: ap ▪ Primitive Objects: ▪ getMapSize(object o) ▪ primitive getValue(object o) ▪ getValueForKey(object o) ▪ String Objects: ▪ getMap(object o) ▪ String getString(object o) ▪ Struct Objects: ▪ Text getText(object o) ▪ getStructField(object o) ▪ getStructAsAList(object o)
  10. 10. SerDe
  11. 11. Where is SerDe? Hive Operator Hive Operator Re duc e r Mappe r ObjectInspector Hierarchical Hierarchical Hierarchical Hierarchical Hierarchical Object Object Object Standard Object Object Object LazyObject Java Object Use ArrayList for struct and Lazily-deserialized Object of a Java array SerDe Class Use HashM for m ap ap Text(‘ p 1.0 3 54’// UTF8 im ) Writable W ritable W ritable encoded W ritable W ritable Writable BytesW ritable(x3Fx64x72x0 W ritable W ritable 0) FileForm / Hadoop Serialization at File on Map thrift_record<… > Stream Stream im 1.0 3 54 p File on HDFS Output thrift_record<… > Im 0.2 1 33 p HDFS File thrift_record<… > clk 2.2 8 212 thrift_record<… > Im 0.7 2 22 p User Script
  12. 12. SerDe, ObjectInspector and TypeInfo “ av” int int String Object Obje c tIns pe c to r3 string string struct getType g e tMapValue Hierarchical getMapValueOI HashMap<String, String> a, Obje c tIns pe c to r2 Object HashM ap(“  “ getType“ ), a” av”“  bv” , b” map int list class HO { string HashM ap<String, String> a, g e tS truc tFie ld Integer b, List ( List<ClassC> c, HashM ap(“  “ , “  “ ), a” av” b” bv” String d; Hierarchical getFieldOI Obje c tIns pe23, r1 c to } Object getType Class ClassC { Struct List(List(1,null),List(2,4),List(5,null)), Integer a, “ abcd” Integer b; Type Info de s e rialize s e rialize S e rDe ) getOI } Writable Writable Text(‘ a=av:b=bv 23 1:2=4:5 BytesWritable(x3Fx64x72x0 abcd’) 0)
  13. 13. LazySimpleSerDe components byte[](‘a=av:b=bv 23 1:2=4:5 byte[] data abcd’ ) LazyStruct LazyStructOI(“ ) “ LazyMap LazyInteger LazyArray LazyString LazyMapOI(“ ,” ) :” =“ LazyArrayOI(“ ) :” LazyStruct LazyStringOI LazyString LazyString LazyInteger LazyStringOI LazyString LazyString LazyInteger LazyStructOI(“ ) =“ LazyStruct Hierarchical Object / LazyObject LazyInteger LazyIntegerOI StandardIntegerOI One Per SerDe instance LazyInteger LazyObjectInspector Singleton
  14. 14. LazyPrimitive ▪ LazyString/LazyInteger ▪ setAll(byte[] data, int start, int length) ▪ LazyString: parse the data and create a String object ▪ LazyInteger: parse the data and create an Integer object ▪ getObject() –returns the corresponding String/Integer object ▪ Future ▪ Replace String/Integer w Text/IntW ith ritable ▪ The Text/IntWritable object is owned by the LazyString/LazyInteger object.
  15. 15. LazyNonPrimitive ▪ LazyStruct/LazyArray/LazyMap ▪ setAll(byte[] data, int start, int length) ▪ Rem ber data, start and length, and set parsed to false. em ▪ getStructField/getArrayElement/getMapValue ▪ If not parsed yet, parse the byte and rem ber starting positions of em each field/element/key/value ▪ For Struct/Array, do setAll on the corresponding LazyObject and return it ▪ For M search for the serialized key and return the corresponding ap, value (after doing a setAll on the value).
  16. 16. W another SerDe? hy ▪ Functionality: ▪ MetadataTypedColumnSetSerDe can only deal w String colum ith ns ▪ Dynam icSerDe can deal w all prim ith itive colum and prim ns itive lists/ maps, but it does not fully support nested types yet. ▪ Efficiency: ▪ Both M etadataTypedColum nSetSerDe and Dynam icSerDe uses String.split() and are not efficient for long rows
  17. 17. Features of LazySimpleSerDe ▪ Functionality: ▪ Fully compatible w M ith etaDataSerDe and Dynamic/TCTLSeparated ▪ Fully support all nested types (M Key m be prim ap ust itive) ▪ Efficiency: ▪ Fully support lazy deserialization - only deserialize the field (and create Objects) w hen asked. ▪ Reuse multiple-levels of LazyObjects. ▪ Read numbers without UTF-8 decoding ▪ (TODO) Fully reuse objects - IntWritable for Integer, Text for String ▪ (TODO) W num rite bers without UTF-8 encoding
  18. 18. Profiling result of a mapper ▪ 17%: TrackedRecordReader (should include InputFileFormat and decompression) ▪ 22%: Operator.close ▪ |-12%: DynamicSerDe.serialize (NOTE: This includes UTF-8 encoding) ▪ |- 4%: mapOutputBuffer.collect (should include compression and OutputFileFormat) ▪ 50%: Operator.forward ▪ |-18%: Text.decode (from LazySerDe) ▪ | |- 7%: CharacterSet.decode() (UTF-8 decoding) ▪ | |- 5%: toString() (where we create the string object) ▪ |- 3%: LazyStruct.parse (the code that search for separators in the row) ▪ |- 3%: Arrays.asList() (from UnionStructOI.getStructFieldData) ▪ |- 8%: GroupByOperator.processHashAggr ▪ |- 3%: HashMap.get() in GroupByOperator ▪ * Performance Data from Rodrigo Schmidt
  19. 19. TypeInfo String specification ▪ W not Thrift? hy ▪ Hard to parse ▪ Sim Syntax ple ▪ Type: PrimitiveType | MapType | ArrayType | StructType ▪ PrimitiveType: int | bigint | tinyint | smallint | double | string ▪ MapType: map<Type, Type> ▪ ArrayType: array<Type> ▪ StructType: struct< [Nam : Type]+ > e ▪ Example: array<map<string,struct<a:int,b:array<string>,c:doube>>>
  20. 20. Future Works
  21. 21. Future Works of ObjectInspector ▪ Delegate all methods described earlier ▪ isNull(), hashCode(), compare() etc are not delegated yet ▪ Support UNION data type: HIVE-537
  22. 22. Future Works of SerDe ▪ LazyBinarySerDe: HIVE-553 ▪ A binary-form sortable SerDe: serialized sorting order is the sam at e as deserialized sorting order ▪ A binary-form com at pact SerDe: saving space
  • stone_xy

    Dec. 29, 2018
  • AdelardoLpezRamn

    May. 17, 2017
  • vasu463

    Mar. 20, 2016
  • AjaiOmtri

    Jul. 24, 2015
  • rdsr13

    Oct. 15, 2014
  • mohit_s

    Aug. 11, 2014
  • JinJie1

    Apr. 9, 2014
  • code6

    Nov. 23, 2013
  • Shorack

    Oct. 3, 2013
  • hellojinjie

    Sep. 27, 2013
  • alexeymigutsky

    Jul. 30, 2013
  • chchen62

    Mar. 24, 2010
  • marc.de.palol

    Aug. 4, 2009

This set of slides describes the efficient Java object model in Hive.

Views

Total views

7,940

On Slideshare

0

From embeds

0

Number of embeds

107

Actions

Downloads

387

Shares

0

Comments

0

Likes

13

×