Efficient Object Model in Java


Slides by Zheng Shao, Facebook
Part of Apache Hadoop Hive Project
Object Inspector
On-disk Data Format
▪   Single on-disk form system
                       at     s
    ▪   Simplicity
▪   Multiple on-disk form system
                         at     s
    ▪   Ease-of-use
    ▪   Ease-of-integration
    ▪   Flexibility: better trade off between space, performance, etc
▪   Hive allow M
              s ultiple on-disk format
Exam M
    ple ultiple on-disk Formats
▪   File Format:
    ▪   Row-based
    ▪   Column-based
    ▪   Block-based
▪   Rowformat:
    ▪   Text-based
    ▪   Binary-based
    ▪   Customized
▪   Index format
In-m ory Data Form
    em            at
▪   Single in-m ory form system
               em       at     s
    ▪   Simplicity: Simpler code
▪   Multiple in-m ory form system
                 em       at     s
    ▪   Ease-of-integration: other system m use their ow form
                                         s ay           n    at
    ▪   Performance:
        ▪   Multiple on-disk format/external form + efficient loading
                                                 at
            M   ultiple in-m ory form
                             em         at
▪   Hive allow M
              s ultiple in-m ory form
                            em       at
Exam M
    ple ultiple in-m ory Form
                    em       ats
▪   Integer:
    ▪   Integer
    ▪   IntWritable
    ▪   LazyInteger
▪   String:
    ▪   String
    ▪   Text
Multiple In-m ory Form Design Patterns
             em       at
▪   Object-oriented:
    ▪   A single interface/base class for Integer
    ▪   Multiple derived classes
▪   Delegation:
    ▪   data stored in object
    ▪   format/operations stored in objectInspector
    ▪   a pair of object and objectInspector represents a data unit
▪   It’ possible to w either one up to conform to the other’ pattern.
      s              rap                                   s
Multiple In-m ory Form Design Patterns
             em       at
▪   In OO, w need an interface HiveInteger to represent Integers
            e
    ▪   Make Integer, IntWritable classes all implem it.
                                                    ent
    ▪   How ever, Integer class is final (not extendable) and does not
        implem HiveInteger
              ent
    ▪   W need to do a conversion, every tim w exchange data w UDF,
           e                                     e e                  ith
        SerDe (Thrift), or other libraries (unless they knowHiveInteger –this
        is a bad assum  ption to m  ake in open system  ).
▪   Delegation w be a better idea because
                ill
    ▪   For Integer, w have an JavaIntegerObjectInspector
                      e
    ▪   For IntWritable , w have an W
                           e         ritableIntegerObjectInspector
    ▪   W convert param and return values only if necessary
         e             s
Delegation Method List
▪   General methods:                   ▪   List Objects:
    ▪   isNull(object o)                   ▪   getListSize(object o)
    ▪   hashCode(object o)                 ▪   getListElement(object o)
    ▪   compare(object o)                  ▪   getList(object o)
    ▪   clone(object o)                ▪   M Objects:
                                            ap
▪   Primitive Objects:                     ▪   getMapSize(object o)
    ▪   primitive getValue(object o)       ▪   getValueForKey(object o)

▪   String Objects:                        ▪   getMap(object o)

    ▪   String getString(object o)     ▪   Struct Objects:
    ▪   Text getText(object o)             ▪   getStructField(object o)
                                           ▪   getStructAsAList(object o)
SerDe
Where is SerDe?
                                                    Hive Operator                            Hive Operator        Re duc e r
            Mappe r


ObjectInspector

                  Hierarchical                Hierarchical    Hierarchical            Hierarchical    Hierarchical
                    Object                      Object           Object
                                                             Standard Object            Object           Object
                                                                                                     LazyObject
                                 Java Object                 Use ArrayList for struct and            Lazily-deserialized
                                 Object of a Java            array
SerDe                            Class                       Use HashM for m
                                                                        ap        ap
                                                     Text(‘ p 1.0 3 54’// UTF8
                                                          im            )
           Writable W ritable          W ritable     encoded W  ritable     W ritable                        Writable
                     BytesW   ritable(x3Fx64x72x0           W ritable    W  ritable
                     0)
FileForm / Hadoop Serialization
        at


        File on                                                              Map
                         thrift_record<… > Stream
                            Stream                       im 1.0 3 54
                                                            p                                                     File on
        HDFS                                                                Output
                         thrift_record<… >               Im 0.2 1 33
                                                            p                                                     HDFS
                                                                             File
                         thrift_record<… >               clk 2.2 8 212
                         thrift_record<… >               Im 0.7 2 22
                                                            p
                                   User Script
SerDe, ObjectInspector and TypeInfo
                              “
                              av”                                                             int            int

     String Object
                           Obje c tIns pe c to r3                        string      string         struct
                                                    getType

              g e tMapValue


     Hierarchical                                 getMapValueOI                      HashMap<String, String> a,
                           Obje c tIns pe c to r2
       Object                    HashM    ap(“  “ getType“ ),
                                                a” av”“  bv”
                                                        , b”                 map      int         list
                                                                                         class HO {
                                                                                                            string
                                                                                           HashM   ap<String, String> a,
              g e tS truc tFie ld                                                          Integer b,
                                        List (                                             List<ClassC> c,
                                          HashM   ap(“  “ , “  “ ),
                                                      a” av” b” bv”                        String d;
       Hierarchical                                 getFieldOI
                          Obje c tIns pe23, r1
                                          c to                                           }
          Object                                       getType                           Class ClassC {
                                                                                     Struct
                                        List(List(1,null),List(2,4),List(5,null)),         Integer a,
                                          “
                                          abcd”                                            Integer b;      Type Info
de s e rialize s e rialize        S e rDe
                                        )      getOI                                     }

Writable        Writable             Text(‘
                                          a=av:b=bv 23 1:2=4:5                          BytesWritable(x3Fx64x72x0
                                     abcd’)                                             0)
LazySimpleSerDe components
                                                     byte[](‘a=av:b=bv 23 1:2=4:5
                                     byte[] data     abcd’  )



               LazyStruct                                                 LazyStructOI(“ )
                                                                                        “




LazyMap        LazyInteger     LazyArray      LazyString        LazyMapOI(“ ,” )
                                                                          :” =“           LazyArrayOI(“ )
                                                                                                      :”

                                        LazyStruct
                                                                                 LazyStringOI
  LazyString         LazyString               LazyInteger
                                                                  LazyStringOI
  LazyString         LazyString               LazyInteger
                                                                                           LazyStructOI(“ )
                                                                                                        =“
                                        LazyStruct

  Hierarchical Object / LazyObject            LazyInteger           LazyIntegerOI            StandardIntegerOI
      One Per SerDe instance
                                              LazyInteger                    LazyObjectInspector
                                                                                 Singleton
LazyPrimitive
▪   LazyString/LazyInteger
    ▪   setAll(byte[] data, int start, int length)
        ▪   LazyString: parse the data and create a String object
        ▪   LazyInteger: parse the data and create an Integer object
    ▪   getObject() –returns the corresponding String/Integer object
▪   Future
    ▪   Replace String/Integer w Text/IntW
                                ith       ritable
    ▪   The Text/IntWritable object is owned by the LazyString/LazyInteger
        object.
LazyNonPrimitive
▪   LazyStruct/LazyArray/LazyMap
    ▪   setAll(byte[] data, int start, int length)
        ▪   Rem ber data, start and length, and set parsed to false.
               em
    ▪   getStructField/getArrayElement/getMapValue
        ▪   If not parsed yet, parse the byte and rem ber starting positions of
                                                     em
            each field/element/key/value
        ▪   For Struct/Array, do setAll on the corresponding LazyObject and
            return it
        ▪   For M search for the serialized key and return the corresponding
                 ap,
            value (after doing a setAll on the value).
W another SerDe?
 hy
▪   Functionality:
    ▪   MetadataTypedColumnSetSerDe can only deal w String colum
                                                   ith          ns
    ▪   Dynam icSerDe can deal w all prim
                                    ith       itive colum and prim
                                                         ns       itive lists/
        maps, but it does not fully support nested types yet.
▪   Efficiency:
    ▪   Both M  etadataTypedColum     nSetSerDe and Dynam  icSerDe uses
        String.split() and are not efficient for long rows
Features of LazySimpleSerDe
▪   Functionality:
    ▪   Fully compatible w M
                          ith etaDataSerDe and Dynamic/TCTLSeparated
    ▪   Fully support all nested types (M Key m be prim
                                         ap    ust     itive)
▪   Efficiency:
    ▪   Fully support lazy deserialization - only deserialize the field (and
        create Objects) w hen asked.
    ▪   Reuse multiple-levels of LazyObjects.
    ▪   Read numbers without UTF-8 decoding
    ▪   (TODO) Fully reuse objects - IntWritable for Integer, Text for String
    ▪   (TODO) W num
                rite bers without UTF-8 encoding
Profiling result of a mapper
▪   17%: TrackedRecordReader (should include InputFileFormat and decompression)
▪   22%: Operator.close
▪   |-12%: DynamicSerDe.serialize (NOTE: This includes UTF-8 encoding)
▪   |- 4%: mapOutputBuffer.collect (should include compression and OutputFileFormat)
▪   50%: Operator.forward
▪   |-18%: Text.decode (from LazySerDe)
▪   | |- 7%: CharacterSet.decode() (UTF-8 decoding)
▪   | |- 5%: toString() (where we create the string object)
▪   |- 3%: LazyStruct.parse (the code that search for separators in the row)
▪   |- 3%: Arrays.asList() (from UnionStructOI.getStructFieldData)
▪   |- 8%: GroupByOperator.processHashAggr
▪   |- 3%: HashMap.get() in GroupByOperator




▪   * Performance Data from Rodrigo Schmidt
TypeInfo String specification
▪   W not Thrift?
     hy
    ▪   Hard to parse
▪   Sim Syntax
       ple
    ▪   Type: PrimitiveType | MapType | ArrayType | StructType
    ▪   PrimitiveType: int | bigint | tinyint | smallint | double | string
    ▪   MapType: map<Type, Type>
    ▪   ArrayType: array<Type>
    ▪   StructType: struct< [Nam : Type]+ >
                                e
▪   Example: array<map<string,struct<a:int,b:array<string>,c:doube>>>
Future Works
Future Works of ObjectInspector
▪   Delegate all methods described earlier
    ▪   isNull(), hashCode(), compare() etc are not delegated yet
▪   Support UNION data type: HIVE-537
Future Works of SerDe
▪   LazyBinarySerDe: HIVE-553
    ▪   A binary-form sortable SerDe: serialized sorting order is the sam
                      at                                                 e
        as deserialized sorting order
    ▪   A binary-form com
                     at  pact SerDe: saving space

Hive Object Model

  • 1.
    Efficient Object Modelin Java Slides by Zheng Shao, Facebook Part of Apache Hadoop Hive Project
  • 2.
  • 3.
    On-disk Data Format ▪ Single on-disk form system at s ▪ Simplicity ▪ Multiple on-disk form system at s ▪ Ease-of-use ▪ Ease-of-integration ▪ Flexibility: better trade off between space, performance, etc ▪ Hive allow M s ultiple on-disk format
  • 4.
    Exam M ple ultiple on-disk Formats ▪ File Format: ▪ Row-based ▪ Column-based ▪ Block-based ▪ Rowformat: ▪ Text-based ▪ Binary-based ▪ Customized ▪ Index format
  • 5.
    In-m ory DataForm em at ▪ Single in-m ory form system em at s ▪ Simplicity: Simpler code ▪ Multiple in-m ory form system em at s ▪ Ease-of-integration: other system m use their ow form s ay n at ▪ Performance: ▪ Multiple on-disk format/external form + efficient loading at M ultiple in-m ory form em at ▪ Hive allow M s ultiple in-m ory form em at
  • 6.
    Exam M ple ultiple in-m ory Form em ats ▪ Integer: ▪ Integer ▪ IntWritable ▪ LazyInteger ▪ String: ▪ String ▪ Text
  • 7.
    Multiple In-m oryForm Design Patterns em at ▪ Object-oriented: ▪ A single interface/base class for Integer ▪ Multiple derived classes ▪ Delegation: ▪ data stored in object ▪ format/operations stored in objectInspector ▪ a pair of object and objectInspector represents a data unit ▪ It’ possible to w either one up to conform to the other’ pattern. s rap s
  • 8.
    Multiple In-m oryForm Design Patterns em at ▪ In OO, w need an interface HiveInteger to represent Integers e ▪ Make Integer, IntWritable classes all implem it. ent ▪ How ever, Integer class is final (not extendable) and does not implem HiveInteger ent ▪ W need to do a conversion, every tim w exchange data w UDF, e e e ith SerDe (Thrift), or other libraries (unless they knowHiveInteger –this is a bad assum ption to m ake in open system ). ▪ Delegation w be a better idea because ill ▪ For Integer, w have an JavaIntegerObjectInspector e ▪ For IntWritable , w have an W e ritableIntegerObjectInspector ▪ W convert param and return values only if necessary e s
  • 9.
    Delegation Method List ▪ General methods: ▪ List Objects: ▪ isNull(object o) ▪ getListSize(object o) ▪ hashCode(object o) ▪ getListElement(object o) ▪ compare(object o) ▪ getList(object o) ▪ clone(object o) ▪ M Objects: ap ▪ Primitive Objects: ▪ getMapSize(object o) ▪ primitive getValue(object o) ▪ getValueForKey(object o) ▪ String Objects: ▪ getMap(object o) ▪ String getString(object o) ▪ Struct Objects: ▪ Text getText(object o) ▪ getStructField(object o) ▪ getStructAsAList(object o)
  • 10.
  • 11.
    Where is SerDe? Hive Operator Hive Operator Re duc e r Mappe r ObjectInspector Hierarchical Hierarchical Hierarchical Hierarchical Hierarchical Object Object Object Standard Object Object Object LazyObject Java Object Use ArrayList for struct and Lazily-deserialized Object of a Java array SerDe Class Use HashM for m ap ap Text(‘ p 1.0 3 54’// UTF8 im ) Writable W ritable W ritable encoded W ritable W ritable Writable BytesW ritable(x3Fx64x72x0 W ritable W ritable 0) FileForm / Hadoop Serialization at File on Map thrift_record<… > Stream Stream im 1.0 3 54 p File on HDFS Output thrift_record<… > Im 0.2 1 33 p HDFS File thrift_record<… > clk 2.2 8 212 thrift_record<… > Im 0.7 2 22 p User Script
  • 12.
    SerDe, ObjectInspector andTypeInfo “ av” int int String Object Obje c tIns pe c to r3 string string struct getType g e tMapValue Hierarchical getMapValueOI HashMap<String, String> a, Obje c tIns pe c to r2 Object HashM ap(“  “ getType“ ), a” av”“  bv” , b” map int list class HO { string HashM ap<String, String> a, g e tS truc tFie ld Integer b, List ( List<ClassC> c, HashM ap(“  “ , “  “ ), a” av” b” bv” String d; Hierarchical getFieldOI Obje c tIns pe23, r1 c to } Object getType Class ClassC { Struct List(List(1,null),List(2,4),List(5,null)), Integer a, “ abcd” Integer b; Type Info de s e rialize s e rialize S e rDe ) getOI } Writable Writable Text(‘ a=av:b=bv 23 1:2=4:5 BytesWritable(x3Fx64x72x0 abcd’) 0)
  • 13.
    LazySimpleSerDe components byte[](‘a=av:b=bv 23 1:2=4:5 byte[] data abcd’ ) LazyStruct LazyStructOI(“ ) “ LazyMap LazyInteger LazyArray LazyString LazyMapOI(“ ,” ) :” =“ LazyArrayOI(“ ) :” LazyStruct LazyStringOI LazyString LazyString LazyInteger LazyStringOI LazyString LazyString LazyInteger LazyStructOI(“ ) =“ LazyStruct Hierarchical Object / LazyObject LazyInteger LazyIntegerOI StandardIntegerOI One Per SerDe instance LazyInteger LazyObjectInspector Singleton
  • 14.
    LazyPrimitive ▪ LazyString/LazyInteger ▪ setAll(byte[] data, int start, int length) ▪ LazyString: parse the data and create a String object ▪ LazyInteger: parse the data and create an Integer object ▪ getObject() –returns the corresponding String/Integer object ▪ Future ▪ Replace String/Integer w Text/IntW ith ritable ▪ The Text/IntWritable object is owned by the LazyString/LazyInteger object.
  • 15.
    LazyNonPrimitive ▪ LazyStruct/LazyArray/LazyMap ▪ setAll(byte[] data, int start, int length) ▪ Rem ber data, start and length, and set parsed to false. em ▪ getStructField/getArrayElement/getMapValue ▪ If not parsed yet, parse the byte and rem ber starting positions of em each field/element/key/value ▪ For Struct/Array, do setAll on the corresponding LazyObject and return it ▪ For M search for the serialized key and return the corresponding ap, value (after doing a setAll on the value).
  • 16.
    W another SerDe? hy ▪ Functionality: ▪ MetadataTypedColumnSetSerDe can only deal w String colum ith ns ▪ Dynam icSerDe can deal w all prim ith itive colum and prim ns itive lists/ maps, but it does not fully support nested types yet. ▪ Efficiency: ▪ Both M etadataTypedColum nSetSerDe and Dynam icSerDe uses String.split() and are not efficient for long rows
  • 17.
    Features of LazySimpleSerDe ▪ Functionality: ▪ Fully compatible w M ith etaDataSerDe and Dynamic/TCTLSeparated ▪ Fully support all nested types (M Key m be prim ap ust itive) ▪ Efficiency: ▪ Fully support lazy deserialization - only deserialize the field (and create Objects) w hen asked. ▪ Reuse multiple-levels of LazyObjects. ▪ Read numbers without UTF-8 decoding ▪ (TODO) Fully reuse objects - IntWritable for Integer, Text for String ▪ (TODO) W num rite bers without UTF-8 encoding
  • 18.
    Profiling result ofa mapper ▪ 17%: TrackedRecordReader (should include InputFileFormat and decompression) ▪ 22%: Operator.close ▪ |-12%: DynamicSerDe.serialize (NOTE: This includes UTF-8 encoding) ▪ |- 4%: mapOutputBuffer.collect (should include compression and OutputFileFormat) ▪ 50%: Operator.forward ▪ |-18%: Text.decode (from LazySerDe) ▪ | |- 7%: CharacterSet.decode() (UTF-8 decoding) ▪ | |- 5%: toString() (where we create the string object) ▪ |- 3%: LazyStruct.parse (the code that search for separators in the row) ▪ |- 3%: Arrays.asList() (from UnionStructOI.getStructFieldData) ▪ |- 8%: GroupByOperator.processHashAggr ▪ |- 3%: HashMap.get() in GroupByOperator ▪ * Performance Data from Rodrigo Schmidt
  • 19.
    TypeInfo String specification ▪ W not Thrift? hy ▪ Hard to parse ▪ Sim Syntax ple ▪ Type: PrimitiveType | MapType | ArrayType | StructType ▪ PrimitiveType: int | bigint | tinyint | smallint | double | string ▪ MapType: map<Type, Type> ▪ ArrayType: array<Type> ▪ StructType: struct< [Nam : Type]+ > e ▪ Example: array<map<string,struct<a:int,b:array<string>,c:doube>>>
  • 20.
  • 21.
    Future Works ofObjectInspector ▪ Delegate all methods described earlier ▪ isNull(), hashCode(), compare() etc are not delegated yet ▪ Support UNION data type: HIVE-537
  • 22.
    Future Works ofSerDe ▪ LazyBinarySerDe: HIVE-553 ▪ A binary-form sortable SerDe: serialized sorting order is the sam at e as deserialized sorting order ▪ A binary-form com at pact SerDe: saving space