Flexible Indexing in Hadoop         Dmitriy Ryaboy @squarecog        Analytics Infrastructure @ Twitter    Hadoop Summit, ...
@JoinTheFlock | Hadoop Summit, June 14 2012   2
@JoinTheFlock | Hadoop Summit, June 14 2012   3
Hadoop is great at plowingthrough data                                                              @JoinTheFlock | Hadoop...
And we do plow   10s of Thousands of Jobs per day100 TB (uncompressed) ingested dailyMany users and diverse use cases     ...
Looking for needles inhaystacks.                                                         @JoinTheFlock | Hadoop Summit, Ju...
Looking for needles inhaystacks.With snowplows.                                                         @JoinTheFlock | Ha...
A Pig Script event_logs = load /logs/lots_of_data                     using ThriftPigLoader(thrift.gen.LogEvent); filtered_...
Find smaller haystacks.                                                                     @JoinTheFlock | Hadoop Summit,...
Use subpartitions!                     @JoinTheFlock | Hadoop Summit, June 14 2012   9
Use subpartitions!• tablename/year/month/day/hour/bucket                                         @JoinTheFlock | Hadoop Su...
Use subpartitions!• tablename/year/month/day/hour/bucket• Only so many things you can partition by                        ...
Use subpartitions!• tablename/year/month/day/hour/bucket• Only so many things you can partition by• Up-front planning requ...
Use subpartitions!• tablename/year/month/day/hour/bucket• Only so many things you can partition by• Up-front planning requ...
Keep the data sorted!                        @JoinTheFlock | Hadoop Summit, June 14 2012   10
Keep the data sorted!• Painful to maintain                        @JoinTheFlock | Hadoop Summit, June 14 2012   10
Keep the data sorted!• Painful to maintain• Only one sort order at a time                                  @JoinTheFlock |...
Keep the data sorted!• Painful to maintain• Only one sort order at a time• Rewrite or duplicate for different query patter...
Trojan Layouts*                  * http://infosys.uni-saarland.de/publications/JQD11.pdf                                  ...
Trojan Layouts*• Identify interesting column groupings                             * http://infosys.uni-saarland.de/public...
Trojan Layouts*• Identify interesting column groupings• Use different column groupings per HDFS block replica             ...
Trojan Layouts*• Identify interesting column groupings• Use different column groupings per HDFS block replica• Requires ch...
Trojan Layouts*• Identify interesting column groupings• Use different column groupings per HDFS block replica• Requires ch...
HBase!         @JoinTheFlock | Hadoop Summit, June 14 2012   12
HBase!• Good solution in many cases!                                 @JoinTheFlock | Hadoop Summit, June 14 2012   12
HBase!• Good solution in many cases!• Maintenance overhead                                 @JoinTheFlock | Hadoop Summit, ...
HBase!• Good solution in many cases!• Maintenance overhead• All data must live in HBase                                 @J...
HBase!• Good solution in many cases!• Maintenance overhead• All data must live in HBase• Full table scans slower than MR  ...
HBase!• Good solution in many cases!• Maintenance overhead• All data must live in HBase• Full table scans slower than MR• ...
HBase!• Good solution in many cases!• Maintenance overhead• All data must live in HBase• Full table scans slower than MR• ...
Hive!        @JoinTheFlock | Hadoop Summit, June 14 2012   13
Hive!• That kind of works, actually.                                  @JoinTheFlock | Hadoop Summit, June 14 2012   13
HiveGeneric Interface for defining indexing behavior.Reference implementation: “compact” index value -> list of HDFS blocks...
WIN!Done, Right?               @JoinTheFlock | Hadoop Summit, June 14 2012   15
HiveGood news if your data is in Hive!Bad news if your world is a little bigger.Indexing is tightly coupled to Hive.No int...
Democracy of Tools                                                                                 @JoinTheFlock | Hadoop ...
Democracy of Tools• Pig                                                                                      @JoinTheFlock...
Democracy of Tools• Pig• Raw Map-Reduce                                                                                   ...
Democracy of Tools• Pig• Raw Map-Reduce• Cascading DSLs (Scalding, Cascalog, Py-Cascading)                                ...
Democracy of Tools• Pig• Raw Map-Reduce• Cascading DSLs (Scalding, Cascalog, Py-Cascading)• Mahout                        ...
Democracy of Tools• Pig• Raw Map-Reduce• Cascading DSLs (Scalding, Cascalog, Py-Cascading)• Mahout• Maybe even Hive       ...
Design Goals               @JoinTheFlock | Hadoop Summit, June 14 2012   18
Design Goals               @JoinTheFlock | Hadoop Summit, June 14 2012   18
Design Goals• Minimal Job/Script modification required                                 @JoinTheFlock | Hadoop Summit, June ...
Design Goals• Minimal Job/Script modification required• As low in the stack as possible                                 @Jo...
Design Goals• Minimal Job/Script modification required• As low in the stack as possible • In fact, pretty sure we could get...
Design Goals• Minimal Job/Script modification required• As low in the stack as possible • In fact, pretty sure we could get...
Design Goals• Minimal Job/Script modification required• As low in the stack as possible • In fact, pretty sure we could get...
Design Goals• Minimal Job/Script modification required• As low in the stack as possible • In fact, pretty sure we could get...
Design Goals• Minimal Job/Script modification required• As low in the stack as possible • In fact, pretty sure we could get...
Elephant-TwinTwitter’s library for creating indexes in Hadoophttps://github.com/twitter/elephant-twinhttps://github.com/tw...
Block-Level IndexesFor each value, record the block it occurs in“Block” can be HDFS block (100s of MBs)Or LZO block (100s ...
Record-Level IndexesFor each value, record some representation of the recordCan be value + offset, as in bitmap indexesCan...
Indexing:                 MR                               Index                 job   InputFormat                 Data   ...
Creating an Index     public abstract class AbstractBlockIndexingJob {    protected abstract List<String> getInput();    p...
Creating an IndexMapper transforms the records: emit <DocId, Value>                     Key                           Valu...
Creating an IndexReducer writes appropriately processed indexes and metadata.MapFile block index:public class MapFileIndex...
Creating an Index: Metadatastruct FileIndexDescriptor {    1: DocType docType    2: IndexType indexType    3: i32 indexVer...
MR       job     searchKey                    IndexedInputFormatRetrieval:                                Index           ...
InputFormat  public class BlockIndexedFileInputFormat<K, V> extendsFileInputFormat<K, V> {    // Indexing jobs call this f...
BinaryExpression  public BinaryExpression(  Expression lhs, Expression rhs, OpType opType)public static enum OpType {    O...
Pig Integration    event_logs = load /logs/lots_of_data    using ThriftPigLoader(	       thrift.gen.LogEvent);	    filtered...
Pig Integration    register elephant-twin-1.0.jar    event_logs = load /logs/lots_of_data    using IndexedLZOPigLoader(	  ...
Optimization: merge neighbors     HDFS Block 1        HDFS Block 2                     @JoinTheFlock | Hadoop Summit, June...
Optimization: merge neighbors           HDFS Block 1                       HDFS Block 2Merge neighbors, share the scan.(Li...
Optimization: merge neighbors            HDFS Block 1                           HDFS Block 2Scans are faster than random r...
Optimization: combine small splits              HDFS Block 1                            HDFS Block 2      match           ...
ApplicabilityMost keys occur in very few blocks!Most frequent key only occurs in half the blocks.                         ...
ResultsApplicable Jobs take 5-10x fewer resourcesAd-hoc jobs particularly likely to benefit“Real” indexes still faster.. --...
Future Work                                                                                @JoinTheFlock | Hadoop Summit, ...
Future Work  • Regex matching on keys                                                                                 @Joi...
Future Work  • Regex matching on keys  • Better Pig pushdown support                                                      ...
Future Work  • Regex matching on keys  • Better Pig pushdown support  • MultiIndexInputFormat                             ...
Future Work  • Regex matching on keys  • Better Pig pushdown support  • MultiIndexInputFormat  • Traditional indexes under...
Future Work  • Regex matching on keys  • Better Pig pushdown support  • MultiIndexInputFormat  • Traditional indexes under...
Questions?@squarecogSounds like fun? We are hiring.                                  @JoinTheFlock | Hadoop Summit, June 1...
Upcoming SlideShare
Loading in...5
×

Flexible In-Situ Indexing for Hadoop via Elephant Twin

8,504

Published on

Slides from the Hadoop Summit 2012 presentation.

Published in: Technology, Business
0 Comments
20 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
8,504
On Slideshare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
160
Comments
0
Likes
20
Embeds 0
No embeds

No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Flexible In-Situ Indexing for Hadoop via Elephant Twin

    1. 1. Flexible Indexing in Hadoop Dmitriy Ryaboy @squarecog Analytics Infrastructure @ Twitter Hadoop Summit, San Jose, CA June 2012
    2. 2. @JoinTheFlock | Hadoop Summit, June 14 2012 2
    3. 3. @JoinTheFlock | Hadoop Summit, June 14 2012 3
    4. 4. Hadoop is great at plowingthrough data @JoinTheFlock | Hadoop Summit, June 14 2012 4 Image source: http://en.wikipedia.org/wiki/File:Snowplow_in_the_morning.jpg
    5. 5. And we do plow 10s of Thousands of Jobs per day100 TB (uncompressed) ingested dailyMany users and diverse use cases @JoinTheFlock | Hadoop Summit, June 14 2012 5
    6. 6. Looking for needles inhaystacks. @JoinTheFlock | Hadoop Summit, June 14 2012 6 Image Source: http://en.wikipedia.org/wiki/File:July_1903_-_on_the_Gaisberg,_nr_Salzburg.JPG
    7. 7. Looking for needles inhaystacks.With snowplows. @JoinTheFlock | Hadoop Summit, June 14 2012 6 Image Source: http://en.wikipedia.org/wiki/File:July_1903_-_on_the_Gaisberg,_nr_Salzburg.JPG
    8. 8. A Pig Script event_logs = load /logs/lots_of_data using ThriftPigLoader(thrift.gen.LogEvent); filtered_logs = filter event_logs by event == something_rare; -- Then do stuff.90% of the mappers in this job output no data.We can do better... @JoinTheFlock | Hadoop Summit, June 14 2012 7
    9. 9. Find smaller haystacks. @JoinTheFlock | Hadoop Summit, June 14 2012 8 Image Source: http://en.wikipedia.org/wiki/File:July_1903_-_on_the_Gaisberg,_nr_Salzburg.JPG
    10. 10. Use subpartitions! @JoinTheFlock | Hadoop Summit, June 14 2012 9
    11. 11. Use subpartitions!• tablename/year/month/day/hour/bucket @JoinTheFlock | Hadoop Summit, June 14 2012 9
    12. 12. Use subpartitions!• tablename/year/month/day/hour/bucket• Only so many things you can partition by @JoinTheFlock | Hadoop Summit, June 14 2012 9
    13. 13. Use subpartitions!• tablename/year/month/day/hour/bucket• Only so many things you can partition by• Up-front planning required @JoinTheFlock | Hadoop Summit, June 14 2012 9
    14. 14. Use subpartitions!• tablename/year/month/day/hour/bucket• Only so many things you can partition by• Up-front planning required• Rewrite or duplicate for different query patterns @JoinTheFlock | Hadoop Summit, June 14 2012 9
    15. 15. Keep the data sorted! @JoinTheFlock | Hadoop Summit, June 14 2012 10
    16. 16. Keep the data sorted!• Painful to maintain @JoinTheFlock | Hadoop Summit, June 14 2012 10
    17. 17. Keep the data sorted!• Painful to maintain• Only one sort order at a time @JoinTheFlock | Hadoop Summit, June 14 2012 10
    18. 18. Keep the data sorted!• Painful to maintain• Only one sort order at a time• Rewrite or duplicate for different query patterns @JoinTheFlock | Hadoop Summit, June 14 2012 10
    19. 19. Trojan Layouts* * http://infosys.uni-saarland.de/publications/JQD11.pdf @JoinTheFlock | Hadoop Summit, June 14 2012 11
    20. 20. Trojan Layouts*• Identify interesting column groupings * http://infosys.uni-saarland.de/publications/JQD11.pdf @JoinTheFlock | Hadoop Summit, June 14 2012 11
    21. 21. Trojan Layouts*• Identify interesting column groupings• Use different column groupings per HDFS block replica * http://infosys.uni-saarland.de/publications/JQD11.pdf @JoinTheFlock | Hadoop Summit, June 14 2012 11
    22. 22. Trojan Layouts*• Identify interesting column groupings• Use different column groupings per HDFS block replica• Requires changes to NN * http://infosys.uni-saarland.de/publications/JQD11.pdf @JoinTheFlock | Hadoop Summit, June 14 2012 11
    23. 23. Trojan Layouts*• Identify interesting column groupings• Use different column groupings per HDFS block replica• Requires changes to NN• ... and increases load on NN * http://infosys.uni-saarland.de/publications/JQD11.pdf @JoinTheFlock | Hadoop Summit, June 14 2012 11
    24. 24. HBase! @JoinTheFlock | Hadoop Summit, June 14 2012 12
    25. 25. HBase!• Good solution in many cases! @JoinTheFlock | Hadoop Summit, June 14 2012 12
    26. 26. HBase!• Good solution in many cases!• Maintenance overhead @JoinTheFlock | Hadoop Summit, June 14 2012 12
    27. 27. HBase!• Good solution in many cases!• Maintenance overhead• All data must live in HBase @JoinTheFlock | Hadoop Summit, June 14 2012 12
    28. 28. HBase!• Good solution in many cases!• Maintenance overhead• All data must live in HBase• Full table scans slower than MR @JoinTheFlock | Hadoop Summit, June 14 2012 12
    29. 29. HBase!• Good solution in many cases!• Maintenance overhead• All data must live in HBase• Full table scans slower than MR• Again with the up-front design @JoinTheFlock | Hadoop Summit, June 14 2012 12
    30. 30. HBase!• Good solution in many cases!• Maintenance overhead• All data must live in HBase• Full table scans slower than MR• Again with the up-front design • Secondary Indexes can help @JoinTheFlock | Hadoop Summit, June 14 2012 12
    31. 31. Hive! @JoinTheFlock | Hadoop Summit, June 14 2012 13
    32. 32. Hive!• That kind of works, actually. @JoinTheFlock | Hadoop Summit, June 14 2012 13
    33. 33. HiveGeneric Interface for defining indexing behavior.Reference implementation: “compact” index value -> list of HDFS blocks; drop unneeded blocks.Other indexes available (bitmap in 0.8)It’ll even update indexes as you add partitions. @JoinTheFlock | Hadoop Summit, June 14 2012 14
    34. 34. WIN!Done, Right? @JoinTheFlock | Hadoop Summit, June 14 2012 15
    35. 35. HiveGood news if your data is in Hive!Bad news if your world is a little bigger.Indexing is tightly coupled to Hive.No interoperability with the rest of the Hadoop stack. @JoinTheFlock | Hadoop Summit, June 14 2012 16
    36. 36. Democracy of Tools @JoinTheFlock | Hadoop Summit, June 14 2012 17 Image Source: http://en.wikipedia.org/wiki/File:20070124_sejm_sala_plenarna.jpg
    37. 37. Democracy of Tools• Pig @JoinTheFlock | Hadoop Summit, June 14 2012 17 Image Source: http://en.wikipedia.org/wiki/File:20070124_sejm_sala_plenarna.jpg
    38. 38. Democracy of Tools• Pig• Raw Map-Reduce @JoinTheFlock | Hadoop Summit, June 14 2012 17 Image Source: http://en.wikipedia.org/wiki/File:20070124_sejm_sala_plenarna.jpg
    39. 39. Democracy of Tools• Pig• Raw Map-Reduce• Cascading DSLs (Scalding, Cascalog, Py-Cascading) @JoinTheFlock | Hadoop Summit, June 14 2012 17 Image Source: http://en.wikipedia.org/wiki/File:20070124_sejm_sala_plenarna.jpg
    40. 40. Democracy of Tools• Pig• Raw Map-Reduce• Cascading DSLs (Scalding, Cascalog, Py-Cascading)• Mahout @JoinTheFlock | Hadoop Summit, June 14 2012 17 Image Source: http://en.wikipedia.org/wiki/File:20070124_sejm_sala_plenarna.jpg
    41. 41. Democracy of Tools• Pig• Raw Map-Reduce• Cascading DSLs (Scalding, Cascalog, Py-Cascading)• Mahout• Maybe even Hive @JoinTheFlock | Hadoop Summit, June 14 2012 17 Image Source: http://en.wikipedia.org/wiki/File:20070124_sejm_sala_plenarna.jpg
    42. 42. Design Goals @JoinTheFlock | Hadoop Summit, June 14 2012 18
    43. 43. Design Goals @JoinTheFlock | Hadoop Summit, June 14 2012 18
    44. 44. Design Goals• Minimal Job/Script modification required @JoinTheFlock | Hadoop Summit, June 14 2012 18
    45. 45. Design Goals• Minimal Job/Script modification required• As low in the stack as possible @JoinTheFlock | Hadoop Summit, June 14 2012 18
    46. 46. Design Goals• Minimal Job/Script modification required• As low in the stack as possible • In fact, pretty sure we could get Hive to use this... @JoinTheFlock | Hadoop Summit, June 14 2012 18
    47. 47. Design Goals• Minimal Job/Script modification required• As low in the stack as possible • In fact, pretty sure we could get Hive to use this...• No unnecessary copies of data @JoinTheFlock | Hadoop Summit, June 14 2012 18
    48. 48. Design Goals• Minimal Job/Script modification required• As low in the stack as possible • In fact, pretty sure we could get Hive to use this...• No unnecessary copies of data• Allow post-factum indexing @JoinTheFlock | Hadoop Summit, June 14 2012 18
    49. 49. Design Goals• Minimal Job/Script modification required• As low in the stack as possible • In fact, pretty sure we could get Hive to use this...• No unnecessary copies of data• Allow post-factum indexing• Graceful degradation @JoinTheFlock | Hadoop Summit, June 14 2012 18
    50. 50. Design Goals• Minimal Job/Script modification required• As low in the stack as possible • In fact, pretty sure we could get Hive to use this...• No unnecessary copies of data• Allow post-factum indexing• Graceful degradation• Flexible on-disk representation @JoinTheFlock | Hadoop Summit, June 14 2012 18
    51. 51. Elephant-TwinTwitter’s library for creating indexes in Hadoophttps://github.com/twitter/elephant-twinhttps://github.com/twitter/elephant-twin-lzo @JoinTheFlock | Hadoop Summit, June 14 2012 19
    52. 52. Block-Level IndexesFor each value, record the block it occurs in“Block” can be HDFS block (100s of MBs)Or LZO block (100s of KBs)Or SequenceFile blockOr RCFile block ...Ignore irrelevant blocksScan relevant blocks using original InputFormat @JoinTheFlock | Hadoop Summit, June 14 2012 20
    53. 53. Record-Level IndexesFor each value, record some representation of the recordCan be value + offset, as in bitmap indexesCan be transformed projection of records, as in Lucene indexesSome queries can be answered directly from index. @JoinTheFlock | Hadoop Summit, June 14 2012 21
    54. 54. Indexing: MR Index job InputFormat Data @JoinTheFlock | Hadoop Summit, June 14 2012 22
    55. 55. Creating an Index public abstract class AbstractBlockIndexingJob { protected abstract List<String> getInput(); protected abstract String getIndex(); protected abstract String getInputFormat(); protected abstract String getValueClass(); protected abstract String getColumnName(); protected abstract Job setMapper(Job job);}public abstract class AbstractLuceneIndexingJob { // Similar.} @JoinTheFlock | Hadoop Summit, June 14 2012 23
    56. 56. Creating an IndexMapper transforms the records: emit <DocId, Value> Key Value Block Offset Column Value Tweet Id TextBlock helper:public abstract class BlockIndexingMapper<KIN, VIN> extendsMapper<KIN, VIN, TextLongPairWritable, LongPairWritable> {}Lucene helper:public abstract class AbstractIndexingMapper<KIN, VIN, KOUT, VOUT>extends Mapper<KIN, VIN, KOUT, VOUT> abstract protected boolean filter(KIN k, VIN v); abstract protected KOUT buildOutputKey(KIN k, VIN v); @JoinTheFlock | Hadoop Summit, June 14 2012 24
    57. 57. Creating an IndexReducer writes appropriately processed indexes and metadata.MapFile block index:public class MapFileIndexingReducer extends Reducer<TextLongPairWritable, LongPairWritable, Text, ListLongPair>Lucene index:public abstract class AbstractLuceneIndexingReducer<KIN, VIN> extends Reducer<KIN, VIN, NullWritable, NullWritable> { protected abstract Document buildDocument(KIN k, VIN v);} @JoinTheFlock | Hadoop Summit, June 14 2012 25
    58. 58. Creating an Index: Metadatastruct FileIndexDescriptor { 1: DocType docType 2: IndexType indexType 3: i32 indexVersion 4: string sourcePath 5: FileChecksum checksum 6: list<IndexedField> indexedFields}struct ETwinIndexDescriptor { 1: list<FileIndexDescriptor> fileIndexDescriptors 2: i32 indexPart 3: optional map<string, string> options} @JoinTheFlock | Hadoop Summit, June 14 2012 26
    59. 59. MR job searchKey IndexedInputFormatRetrieval: Index Data @JoinTheFlock | Hadoop Summit, June 14 2012 27
    60. 60. InputFormat public class BlockIndexedFileInputFormat<K, V> extendsFileInputFormat<K, V> { // Indexing jobs call this function to set up indexing jobrelated parameters. public static void setIndexOptions(Job job, String inputformatClass, String valueClass, String indexDir, String columnName) // Searching jobs call this function to set up searching jobrelated parameters. public static void setSearchOptions(Job job, String inputformatClass, String valueClass, String indexDir, BinaryExpression filter)} @JoinTheFlock | Hadoop Summit, June 14 2012 28
    61. 61. BinaryExpression public BinaryExpression( Expression lhs, Expression rhs, OpType opType)public static enum OpType { OP_PLUS (" + "), OP_MINUS(" - "), ... OP_EQ(" == "), OP_NE(" != "), ... OP_AND(" and "), OP_OR(" or "), ... TERM_COL(" Column "), TERM_CONST(" Constant ");} @JoinTheFlock | Hadoop Summit, June 14 2012 29
    62. 62. Pig Integration event_logs = load /logs/lots_of_data using ThriftPigLoader( thrift.gen.LogEvent); filtered_logs = filter event_logs by event == something_rare; -- Then do stuff. @JoinTheFlock | Hadoop Summit, June 14 2012 30
    63. 63. Pig Integration register elephant-twin-1.0.jar event_logs = load /logs/lots_of_data using IndexedLZOPigLoader( ThriftPigLoader, thrift.gen.LogEvent, /user/dmitriy/etwin); -- Pig will automatically push this down into the Loader and InputFormat filtered_logs = filter event_logs by event == something_rare; @JoinTheFlock | Hadoop Summit, June 14 2012 31
    64. 64. Optimization: merge neighbors HDFS Block 1 HDFS Block 2 @JoinTheFlock | Hadoop Summit, June 14 2012 32
    65. 65. Optimization: merge neighbors HDFS Block 1 HDFS Block 2Merge neighbors, share the scan.(Limit expansion to size of HDFS block) @JoinTheFlock | Hadoop Summit, June 14 2012 33
    66. 66. Optimization: merge neighbors HDFS Block 1 HDFS Block 2Scans are faster than random reads.. allow gaps?Turns out, not that much faster. Better to jump. @JoinTheFlock | Hadoop Summit, June 14 2012 34
    67. 67. Optimization: combine small splits HDFS Block 1 HDFS Block 2 match match match Generated SplitCombine small relevant spans into single splits.Try to take locality into account. @JoinTheFlock | Hadoop Summit, June 14 2012 35
    68. 68. ApplicabilityMost keys occur in very few blocks!Most frequent key only occurs in half the blocks. @JoinTheFlock | Hadoop Summit, June 14 2012 36
    69. 69. ResultsApplicable Jobs take 5-10x fewer resourcesAd-hoc jobs particularly likely to benefit“Real” indexes still faster.. -- but can be represented using the same abstraction @JoinTheFlock | Hadoop Summit, June 14 2012 37
    70. 70. Future Work @JoinTheFlock | Hadoop Summit, June 14 2012 38 Image Source:http://en.wikipedia.org/wiki/File:Shasta_dam_under_construction_new_edit.jpg
    71. 71. Future Work • Regex matching on keys @JoinTheFlock | Hadoop Summit, June 14 2012 38 Image Source:http://en.wikipedia.org/wiki/File:Shasta_dam_under_construction_new_edit.jpg
    72. 72. Future Work • Regex matching on keys • Better Pig pushdown support @JoinTheFlock | Hadoop Summit, June 14 2012 38 Image Source:http://en.wikipedia.org/wiki/File:Shasta_dam_under_construction_new_edit.jpg
    73. 73. Future Work • Regex matching on keys • Better Pig pushdown support • MultiIndexInputFormat @JoinTheFlock | Hadoop Summit, June 14 2012 38 Image Source:http://en.wikipedia.org/wiki/File:Shasta_dam_under_construction_new_edit.jpg
    74. 74. Future Work • Regex matching on keys • Better Pig pushdown support • MultiIndexInputFormat • Traditional indexes under ETwin @JoinTheFlock | Hadoop Summit, June 14 2012 38 Image Source:http://en.wikipedia.org/wiki/File:Shasta_dam_under_construction_new_edit.jpg
    75. 75. Future Work • Regex matching on keys • Better Pig pushdown support • MultiIndexInputFormat • Traditional indexes under ETwin • Index maintenance (via HCatalog?) @JoinTheFlock | Hadoop Summit, June 14 2012 38 Image Source:http://en.wikipedia.org/wiki/File:Shasta_dam_under_construction_new_edit.jpg
    76. 76. Questions?@squarecogSounds like fun? We are hiring. @JoinTheFlock | Hadoop Summit, June 14 2012 39
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×