SlideShare a Scribd company logo
Culvert
A secondary indexing framework for BigTable-
    style databases with HIVE integration

   Ed Kohlwey
   Cloud Computing Team
Session Agenda
•   Secondary Indexing
•   The Solution: Culvert
•   Culvert Design & Architecture
•   How It Works
•   API Examples
•   Where to Get It & Credits
Secondary Indexing
• General design pattern for inverted index
  – Maintain a map from value to location of
    records/documents that contain them
• Lots of different variations
  – Term partitioned index
  – Document partitioned index
• Solves problem of BigTable-style databases
  only having one primary key for records
Sample Inventory Application
  Foo Table
  RowID    contact: city   contact: phone   inventory:count   order:Apples
  Apples                                          5
   John    Springfield     (999)-888-7777                          3
  Pears                                           10

Sample Term-Partitioned Index Table
                      order:Apples Index
                      RowID
                      3 -> Dave
                      3 -> John
                      17 -> Paul
                      20 -> Sue
Sample Inventory Application
        Foo Table
                    RowID                         contact: comments
                    John                          John likes apples.
                    Sue                            Sue likes pears.


  Sample Document-Partitioned Index
              Table
contact:comments Index

RowID      apples:john john:John   likes:John likes:Sue      pears:Sue   sue:Sue
0x178df    -                -      -
0x32da4                                       -              -           -
We found ourselves implementing
these ideas over and over for clients.

        Why not make a library?
Solution: Culvert
Requirements
• Support secondary indexing
• Support an analyst query environment
• Database Extensibility
   – There’s actually a lot of BigTable implementations out
     there (HBase, Cassandra, proprietary)
• Internal Extensibility
   – There’s lots of ways to index records
   – There’s lots of ways to retrieve records
   – Separate retrieval operations from index
     implementation
What Culvert Does
• Indexing
• Interface for queries (Java and HIVE)
• Abstraction mechanism for multiple
  underlying databases
Culvert Design & Architecture
• Use sorted iterators to retrieve values
   – Lots of algorithms can be expressed as sorting (like
     people tend to do in Map/Reduce)
   – Optional “dumping” feature can provide parallelism
• Decorator design pattern is intuitive to interact
  with
• Allows streaming of results as they become
  available
• Uses Coprocessors to implement parallel
  operations
Architecture Diagram
                     Java API                        Hive

                          Culvert Client-Side Operation

               TableAdapter        Constraint             Client




   Culvert Region-Side Operation                Culvert Region-Side Operation
LocalTableAdapter       RemoteOp             LocalTableAdapter       RemoteOp
Constraint Architecture
• Used to express query predicate operations
  – projection and selection (SELECT)
  – set operations (AND/OR)
  – joins
• Decoupled from Indices
  – Currently focused on term-partitioned indices
  – Future work includes expanding document-
    partitioned index functionality
Index Architecture
• Index is an abstract type
  – Defines how to store and use the index
• One index per column
  – Didn’t see a performance reason to index over
    multiple columns
  – Multiple indices complicates framework code
  – Map of “logical fields” was more easily maintained
    in the application
  – May evolve in the future
Index Architecture (cont.)
• One index table per index
  – Allows Index implementations to assume they
    don’t share the index table
  – Don’t need to worry about other Indices
    clobbering their table structure
  – Tables are assumed to be cheap
Table Adapters
• TableAdapter and LocalTableAdapter are
  abstraction mechanisms, roughly equivalent
  to HTable and HRegion
• RemoteOp is roughly equivalent to
  CoprocessorProtocol, is handled by
  TableAdapter and LocalTableAdapter
• Gives implementers fine-grained control over
  parallelism + table operations
Using Culvert With HIVE
• Why HIVE?
  – Already very popular
  – Take advantage of upstream advances
  – Good framework to “optimize later”
• Culvert implements a HIVE StorageHandler
  and PredicateHandler
• Facilitates analyst interaction with database
• Reduces the “SQL Gap”
HIVE Culvert Input Format
• Handles AND, >, < query predicates based on
  indices
• Each index can be broken up into fragments
  based on region start and end keys
  – We take the cross-product of each indexes regions
    to create input splits for AND
How It Works

Overview of Indexing Operations
Indexing
• Indices are built via insertion operations on
  the client (i.e. Client.put(…))
• Whether a field is indexed is controlled by a
  configuration file
• In the future, will support indexing of arbitrary
  columns via Map/Reduce
Retrieval
• Query API is exposed via HIVE and Java
  – HIVE API delegates to Java API
  – Java API is based on subclasses of Constraint
• Focused on providing parallel, real-time query
  execution
Walkthrough of Logical
Operations on Indices
Logical Operations on Indices
• Logical operations can be represented as a merge
  sort if we return the keys from the original table
  in sorted order
• Example: AND
orders:Apples Index             orders:Oranges Index
1 -> Dean                       4 -> Dean
3 -> Susan                      5 -> Susan
4 -> John                       5 -> Paul
8 -> Paul                       6 -> George
14 -> Renee                     12 -> Karen
33 -> Sheryl                    19 -> Tom
Apples < 3 AND Oranges > 5
• First query each index


orders:Apples Index          orders:Oranges Index
1 -> Dean                    4 -> Dean
3 -> Susan                   5 -> Susan
4 -> John                    5 -> Paul
8 -> Paul                    6 -> George
14 -> Renee                  12 -> Karen
33 -> Sheryl                 19 -> Tom
Apples < 3 AND Oranges > 5
• Then order results for each index
• Happens on the region servers


1 -> Dean
3 -> Susan                    5 -> Susan
                              5 -> Paul
                              6 -> George
                              12 -> Karen
                              19 -> Tom
Apples < 3 AND Oranges > 5
• Then order results for each index
• Happens on the region servers


Dean
Susan                         Susan
                              Paul
                              George
                              Karen
                              Tom
Apples < 3 AND Oranges > 5
• Then order results for each index
• Notice this happens on the region servers*
Done

Dean
Susan                        Susan
                             Paul
                             George
                             Karen
                             Tom
Apples < 3 AND Oranges > 5
• Then order results for each index
• Notice this happens on the region servers*
Done

Dean                         Done
Susan                        George
                             Karen
                             Paul
                             Susan
                             Tom
Apples < 3 AND Oranges > 5
• Then merge the sorted results on the client



Dean
Susan                         George
                              Karen
                              Paul
                              Susan
                              Tom
Apples < 3 AND Oranges > 5
• Dean is lowest, Dean is not on the head of all
  the queues, discard


Dean
Susan                         George
                              Karen
                              Paul
                              Susan
                              Tom
Apples < 3 AND Oranges > 5
• George is lowest, George is not on the head of
  all queues, discard


Dean
Susan                         George
                              Karen
                              Paul
                              Susan
                              Tom
Apples < 3 AND Oranges > 5
• Continue…



Dean
Susan                    George
                         Karen
                         Paul
                         Susan
                         Tom
Apples < 3 AND Oranges > 5
  • Susan is on the head of all the queues, return
    Susan


  Dean
✔ Susan                         George
                                Karen
                                Paul
                                Susan                ✔
                                Tom
Apples < 3 AND Oranges > 5
  • Tom is discarded, now we’re finished



  Dean
✔ Susan                        George
                               Karen
                               Paul
                               Susan       ✔
                               Tom
Joins
• Numerous methods possible
• A few examples
  – Use sub-queries to fetch related records
  – Use merge sorting to simultaneously fetch records
    satisfying both sides of the join, filter those that
    don’t match
• Presently, Culvert has only one join (sub-
  queries method)
Example: Join Apple Order Size on
Orange Order Size (order:Apples =
        order:Oranges)
                          User performs joins with a
         JoinConstraint   constraint (decorator design pattern)
Example: Join Apple Order Size on
       Orange Order Size (order:Apples =
               order:Oranges)
                     JoinConstraint

…
John
                     Constraint receives row ID’s from a left
…
                     sub-constraint.

Left SubConstraint
Example: Join Apple Order Size on
       Orange Order Size (order:Apples =
               order:Oranges)
                         JoinConstraint

…
John
…                                         Constraint looks up field
                                          values for the left side (if not
                                          already present in the results)
Left SubConstraint         order:Apples
                     …     …
                     John 5
                     …     …
Example: Join Apple Order Size on
       Orange Order Size (order:Apples =
               order:Oranges)
                         JoinConstraint   For each record in the left
                                          result set, the constraint creates
…                                         a new right-side constraint to
                                          fetch indexed items matching
John                                      the right side of the constraint.
…
                                                      order:Oranges
                                           …          …
Left SubConstraint         order:Apples
                                           George     5
                     …     …
                                           Jane       5
                     John 5
                                           …          …
                     …     …
Example: Join Apple Order Size on
       Orange Order Size (order:Apples =
               order:Oranges)
                                                                      Finally,
                                          …          …       …        the joined
                         JoinConstraint                               records
                                          John 5             George   are returned.
…                                         John 5             Jane
John                                      …          …       …
…
                                                         order:Oranges
                                              …          …
Left SubConstraint         order:Apples
                                              George     5
                     …     …
                                              Jane       5
                     John 5
                                              …          …
                     …     …
Culvert Java API Examples
• Goal: to be intuitive and easy to interact with
• Provide a simple relational API without forcing
  a developer to use SQL
Culvert API Example: Insertion
Configuration culvertConf = CConfiguration.getDefault();
// index definitions are loaded implicitly from the
// configuration
Client client = new Client(culvertConf);
List<CKeyValue> valuesToPut = Lists.newArrayList();
valuesToPut.add(new CKeyValue(
      "foo".getBytes(),
      "bar".getBytes(),
      "baz”.getBytes()));
Put put = new Put(valuesToPut);
client.put("tableName", put);
Culvert API Example: Retrieval
Configuration culvertConf = CConfiguration.getDefault();
// index definitions are loaded implicitly from the configuration
Client client = new Client(culvertConf);
Index c1Index = client.getIndexByName("index1");
Constraint c1Constraint = new IndexRangeConstraint(
      c1Index, new CRange(
            "abba".getBytes(),
            "cadabra".getBytes()));
Index[] c2Indices = client.getIndicesForColumn(
      "rabbit".getBytes(),
      "hat".getBytes());
Constraint c2Constraint = new IndexRangeConstraint(
      c2Indices[0],
      new CRange("bar".getBytes(), "foo".getBytes()));
Constraint and = new And(c1Constraint, c2Constraint);
Iterator<Result> results = client.query("tablename", and);
Future Work
• (Re)Building Indices via Map/Reduce
• More index types
  – Document-partitioned
  – Others?
• More retrieval operations
• Profiling + tuning
• Storing configuration details in a table or in
  Zookeeper
Where to Get It*

http://github.com/booz-allen-hamilton/culvert


          Where to Tweet It

                  #culvert
                                       *Available 6/29/2011
Culvert Team
•   Ed Kohlwey (@ekohlwey)
•   Jesse Yates (@jesse_yates)
•   Jeremy Walsh
•   Tomer Kishoni (@tokbot)
•   Jason Trost (@jason_trost)
Questions?

More Related Content

Recently uploaded

Getting the Most Out of ScyllaDB Monitoring: ShareChat's Tips
Getting the Most Out of ScyllaDB Monitoring: ShareChat's TipsGetting the Most Out of ScyllaDB Monitoring: ShareChat's Tips
Getting the Most Out of ScyllaDB Monitoring: ShareChat's Tips
ScyllaDB
 
Containers & AI - Beauty and the Beast!?!
Containers & AI - Beauty and the Beast!?!Containers & AI - Beauty and the Beast!?!
Containers & AI - Beauty and the Beast!?!
Tobias Schneck
 
Apps Break Data
Apps Break DataApps Break Data
Apps Break Data
Ivo Velitchkov
 
Introducing BoxLang : A new JVM language for productivity and modularity!
Introducing BoxLang : A new JVM language for productivity and modularity!Introducing BoxLang : A new JVM language for productivity and modularity!
Introducing BoxLang : A new JVM language for productivity and modularity!
Ortus Solutions, Corp
 
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptxPRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
christinelarrosa
 
GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
Javier Junquera
 
Demystifying Knowledge Management through Storytelling
Demystifying Knowledge Management through StorytellingDemystifying Knowledge Management through Storytelling
Demystifying Knowledge Management through Storytelling
Enterprise Knowledge
 
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and BioinformaticiansBiomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Neo4j
 
What is an RPA CoE? Session 2 – CoE Roles
What is an RPA CoE?  Session 2 – CoE RolesWhat is an RPA CoE?  Session 2 – CoE Roles
What is an RPA CoE? Session 2 – CoE Roles
DianaGray10
 
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid ResearchHarnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
Neo4j
 
AWS Certified Solutions Architect Associate (SAA-C03)
AWS Certified Solutions Architect Associate (SAA-C03)AWS Certified Solutions Architect Associate (SAA-C03)
AWS Certified Solutions Architect Associate (SAA-C03)
HarpalGohil4
 
Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |
AstuteBusiness
 
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin..."$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
Fwdays
 
What is an RPA CoE? Session 1 – CoE Vision
What is an RPA CoE?  Session 1 – CoE VisionWhat is an RPA CoE?  Session 1 – CoE Vision
What is an RPA CoE? Session 1 – CoE Vision
DianaGray10
 
"NATO Hackathon Winner: AI-Powered Drug Search", Taras Kloba
"NATO Hackathon Winner: AI-Powered Drug Search",  Taras Kloba"NATO Hackathon Winner: AI-Powered Drug Search",  Taras Kloba
"NATO Hackathon Winner: AI-Powered Drug Search", Taras Kloba
Fwdays
 
AppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSFAppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSF
Ajin Abraham
 
Leveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and StandardsLeveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and Standards
Neo4j
 
From Natural Language to Structured Solr Queries using LLMs
From Natural Language to Structured Solr Queries using LLMsFrom Natural Language to Structured Solr Queries using LLMs
From Natural Language to Structured Solr Queries using LLMs
Sease
 
AI in the Workplace Reskilling, Upskilling, and Future Work.pptx
AI in the Workplace Reskilling, Upskilling, and Future Work.pptxAI in the Workplace Reskilling, Upskilling, and Future Work.pptx
AI in the Workplace Reskilling, Upskilling, and Future Work.pptx
Sunil Jagani
 
"What does it really mean for your system to be available, or how to define w...
"What does it really mean for your system to be available, or how to define w..."What does it really mean for your system to be available, or how to define w...
"What does it really mean for your system to be available, or how to define w...
Fwdays
 

Recently uploaded (20)

Getting the Most Out of ScyllaDB Monitoring: ShareChat's Tips
Getting the Most Out of ScyllaDB Monitoring: ShareChat's TipsGetting the Most Out of ScyllaDB Monitoring: ShareChat's Tips
Getting the Most Out of ScyllaDB Monitoring: ShareChat's Tips
 
Containers & AI - Beauty and the Beast!?!
Containers & AI - Beauty and the Beast!?!Containers & AI - Beauty and the Beast!?!
Containers & AI - Beauty and the Beast!?!
 
Apps Break Data
Apps Break DataApps Break Data
Apps Break Data
 
Introducing BoxLang : A new JVM language for productivity and modularity!
Introducing BoxLang : A new JVM language for productivity and modularity!Introducing BoxLang : A new JVM language for productivity and modularity!
Introducing BoxLang : A new JVM language for productivity and modularity!
 
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptxPRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
 
GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
 
Demystifying Knowledge Management through Storytelling
Demystifying Knowledge Management through StorytellingDemystifying Knowledge Management through Storytelling
Demystifying Knowledge Management through Storytelling
 
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and BioinformaticiansBiomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
 
What is an RPA CoE? Session 2 – CoE Roles
What is an RPA CoE?  Session 2 – CoE RolesWhat is an RPA CoE?  Session 2 – CoE Roles
What is an RPA CoE? Session 2 – CoE Roles
 
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid ResearchHarnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
 
AWS Certified Solutions Architect Associate (SAA-C03)
AWS Certified Solutions Architect Associate (SAA-C03)AWS Certified Solutions Architect Associate (SAA-C03)
AWS Certified Solutions Architect Associate (SAA-C03)
 
Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |
 
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin..."$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
 
What is an RPA CoE? Session 1 – CoE Vision
What is an RPA CoE?  Session 1 – CoE VisionWhat is an RPA CoE?  Session 1 – CoE Vision
What is an RPA CoE? Session 1 – CoE Vision
 
"NATO Hackathon Winner: AI-Powered Drug Search", Taras Kloba
"NATO Hackathon Winner: AI-Powered Drug Search",  Taras Kloba"NATO Hackathon Winner: AI-Powered Drug Search",  Taras Kloba
"NATO Hackathon Winner: AI-Powered Drug Search", Taras Kloba
 
AppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSFAppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSF
 
Leveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and StandardsLeveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and Standards
 
From Natural Language to Structured Solr Queries using LLMs
From Natural Language to Structured Solr Queries using LLMsFrom Natural Language to Structured Solr Queries using LLMs
From Natural Language to Structured Solr Queries using LLMs
 
AI in the Workplace Reskilling, Upskilling, and Future Work.pptx
AI in the Workplace Reskilling, Upskilling, and Future Work.pptxAI in the Workplace Reskilling, Upskilling, and Future Work.pptx
AI in the Workplace Reskilling, Upskilling, and Future Work.pptx
 
"What does it really mean for your system to be available, or how to define w...
"What does it really mean for your system to be available, or how to define w..."What does it really mean for your system to be available, or how to define w...
"What does it really mean for your system to be available, or how to define w...
 

Featured

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
Marius Sescu
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
Expeed Software
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
Pixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
ThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
marketingartwork
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
Skeleton Technologies
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
Kurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
SpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Lily Ray
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
Rajiv Jayarajah, MAppComm, ACC
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
Christy Abraham Joy
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
Vit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
MindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
RachelPearson36
 

Featured (20)

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 

Culvert: A Robust Framework for Secondary Indexing of Structured and Unstructured Data

  • 1. Culvert A secondary indexing framework for BigTable- style databases with HIVE integration Ed Kohlwey Cloud Computing Team
  • 2. Session Agenda • Secondary Indexing • The Solution: Culvert • Culvert Design & Architecture • How It Works • API Examples • Where to Get It & Credits
  • 3. Secondary Indexing • General design pattern for inverted index – Maintain a map from value to location of records/documents that contain them • Lots of different variations – Term partitioned index – Document partitioned index • Solves problem of BigTable-style databases only having one primary key for records
  • 4. Sample Inventory Application Foo Table RowID contact: city contact: phone inventory:count order:Apples Apples 5 John Springfield (999)-888-7777 3 Pears 10 Sample Term-Partitioned Index Table order:Apples Index RowID 3 -> Dave 3 -> John 17 -> Paul 20 -> Sue
  • 5. Sample Inventory Application Foo Table RowID contact: comments John John likes apples. Sue Sue likes pears. Sample Document-Partitioned Index Table contact:comments Index RowID apples:john john:John likes:John likes:Sue pears:Sue sue:Sue 0x178df - - - 0x32da4 - - -
  • 6. We found ourselves implementing these ideas over and over for clients. Why not make a library?
  • 8. Requirements • Support secondary indexing • Support an analyst query environment • Database Extensibility – There’s actually a lot of BigTable implementations out there (HBase, Cassandra, proprietary) • Internal Extensibility – There’s lots of ways to index records – There’s lots of ways to retrieve records – Separate retrieval operations from index implementation
  • 9. What Culvert Does • Indexing • Interface for queries (Java and HIVE) • Abstraction mechanism for multiple underlying databases
  • 10. Culvert Design & Architecture • Use sorted iterators to retrieve values – Lots of algorithms can be expressed as sorting (like people tend to do in Map/Reduce) – Optional “dumping” feature can provide parallelism • Decorator design pattern is intuitive to interact with • Allows streaming of results as they become available • Uses Coprocessors to implement parallel operations
  • 11. Architecture Diagram Java API Hive Culvert Client-Side Operation TableAdapter Constraint Client Culvert Region-Side Operation Culvert Region-Side Operation LocalTableAdapter RemoteOp LocalTableAdapter RemoteOp
  • 12. Constraint Architecture • Used to express query predicate operations – projection and selection (SELECT) – set operations (AND/OR) – joins • Decoupled from Indices – Currently focused on term-partitioned indices – Future work includes expanding document- partitioned index functionality
  • 13. Index Architecture • Index is an abstract type – Defines how to store and use the index • One index per column – Didn’t see a performance reason to index over multiple columns – Multiple indices complicates framework code – Map of “logical fields” was more easily maintained in the application – May evolve in the future
  • 14. Index Architecture (cont.) • One index table per index – Allows Index implementations to assume they don’t share the index table – Don’t need to worry about other Indices clobbering their table structure – Tables are assumed to be cheap
  • 15. Table Adapters • TableAdapter and LocalTableAdapter are abstraction mechanisms, roughly equivalent to HTable and HRegion • RemoteOp is roughly equivalent to CoprocessorProtocol, is handled by TableAdapter and LocalTableAdapter • Gives implementers fine-grained control over parallelism + table operations
  • 16. Using Culvert With HIVE • Why HIVE? – Already very popular – Take advantage of upstream advances – Good framework to “optimize later” • Culvert implements a HIVE StorageHandler and PredicateHandler • Facilitates analyst interaction with database • Reduces the “SQL Gap”
  • 17. HIVE Culvert Input Format • Handles AND, >, < query predicates based on indices • Each index can be broken up into fragments based on region start and end keys – We take the cross-product of each indexes regions to create input splits for AND
  • 18. How It Works Overview of Indexing Operations
  • 19. Indexing • Indices are built via insertion operations on the client (i.e. Client.put(…)) • Whether a field is indexed is controlled by a configuration file • In the future, will support indexing of arbitrary columns via Map/Reduce
  • 20. Retrieval • Query API is exposed via HIVE and Java – HIVE API delegates to Java API – Java API is based on subclasses of Constraint • Focused on providing parallel, real-time query execution
  • 22. Logical Operations on Indices • Logical operations can be represented as a merge sort if we return the keys from the original table in sorted order • Example: AND orders:Apples Index orders:Oranges Index 1 -> Dean 4 -> Dean 3 -> Susan 5 -> Susan 4 -> John 5 -> Paul 8 -> Paul 6 -> George 14 -> Renee 12 -> Karen 33 -> Sheryl 19 -> Tom
  • 23. Apples < 3 AND Oranges > 5 • First query each index orders:Apples Index orders:Oranges Index 1 -> Dean 4 -> Dean 3 -> Susan 5 -> Susan 4 -> John 5 -> Paul 8 -> Paul 6 -> George 14 -> Renee 12 -> Karen 33 -> Sheryl 19 -> Tom
  • 24. Apples < 3 AND Oranges > 5 • Then order results for each index • Happens on the region servers 1 -> Dean 3 -> Susan 5 -> Susan 5 -> Paul 6 -> George 12 -> Karen 19 -> Tom
  • 25. Apples < 3 AND Oranges > 5 • Then order results for each index • Happens on the region servers Dean Susan Susan Paul George Karen Tom
  • 26. Apples < 3 AND Oranges > 5 • Then order results for each index • Notice this happens on the region servers* Done Dean Susan Susan Paul George Karen Tom
  • 27. Apples < 3 AND Oranges > 5 • Then order results for each index • Notice this happens on the region servers* Done Dean Done Susan George Karen Paul Susan Tom
  • 28. Apples < 3 AND Oranges > 5 • Then merge the sorted results on the client Dean Susan George Karen Paul Susan Tom
  • 29. Apples < 3 AND Oranges > 5 • Dean is lowest, Dean is not on the head of all the queues, discard Dean Susan George Karen Paul Susan Tom
  • 30. Apples < 3 AND Oranges > 5 • George is lowest, George is not on the head of all queues, discard Dean Susan George Karen Paul Susan Tom
  • 31. Apples < 3 AND Oranges > 5 • Continue… Dean Susan George Karen Paul Susan Tom
  • 32. Apples < 3 AND Oranges > 5 • Susan is on the head of all the queues, return Susan Dean ✔ Susan George Karen Paul Susan ✔ Tom
  • 33. Apples < 3 AND Oranges > 5 • Tom is discarded, now we’re finished Dean ✔ Susan George Karen Paul Susan ✔ Tom
  • 34. Joins • Numerous methods possible • A few examples – Use sub-queries to fetch related records – Use merge sorting to simultaneously fetch records satisfying both sides of the join, filter those that don’t match • Presently, Culvert has only one join (sub- queries method)
  • 35. Example: Join Apple Order Size on Orange Order Size (order:Apples = order:Oranges) User performs joins with a JoinConstraint constraint (decorator design pattern)
  • 36. Example: Join Apple Order Size on Orange Order Size (order:Apples = order:Oranges) JoinConstraint … John Constraint receives row ID’s from a left … sub-constraint. Left SubConstraint
  • 37. Example: Join Apple Order Size on Orange Order Size (order:Apples = order:Oranges) JoinConstraint … John … Constraint looks up field values for the left side (if not already present in the results) Left SubConstraint order:Apples … … John 5 … …
  • 38. Example: Join Apple Order Size on Orange Order Size (order:Apples = order:Oranges) JoinConstraint For each record in the left result set, the constraint creates … a new right-side constraint to fetch indexed items matching John the right side of the constraint. … order:Oranges … … Left SubConstraint order:Apples George 5 … … Jane 5 John 5 … … … …
  • 39. Example: Join Apple Order Size on Orange Order Size (order:Apples = order:Oranges) Finally, … … … the joined JoinConstraint records John 5 George are returned. … John 5 Jane John … … … … order:Oranges … … Left SubConstraint order:Apples George 5 … … Jane 5 John 5 … … … …
  • 40. Culvert Java API Examples • Goal: to be intuitive and easy to interact with • Provide a simple relational API without forcing a developer to use SQL
  • 41. Culvert API Example: Insertion Configuration culvertConf = CConfiguration.getDefault(); // index definitions are loaded implicitly from the // configuration Client client = new Client(culvertConf); List<CKeyValue> valuesToPut = Lists.newArrayList(); valuesToPut.add(new CKeyValue( "foo".getBytes(), "bar".getBytes(), "baz”.getBytes())); Put put = new Put(valuesToPut); client.put("tableName", put);
  • 42. Culvert API Example: Retrieval Configuration culvertConf = CConfiguration.getDefault(); // index definitions are loaded implicitly from the configuration Client client = new Client(culvertConf); Index c1Index = client.getIndexByName("index1"); Constraint c1Constraint = new IndexRangeConstraint( c1Index, new CRange( "abba".getBytes(), "cadabra".getBytes())); Index[] c2Indices = client.getIndicesForColumn( "rabbit".getBytes(), "hat".getBytes()); Constraint c2Constraint = new IndexRangeConstraint( c2Indices[0], new CRange("bar".getBytes(), "foo".getBytes())); Constraint and = new And(c1Constraint, c2Constraint); Iterator<Result> results = client.query("tablename", and);
  • 43. Future Work • (Re)Building Indices via Map/Reduce • More index types – Document-partitioned – Others? • More retrieval operations • Profiling + tuning • Storing configuration details in a table or in Zookeeper
  • 44. Where to Get It* http://github.com/booz-allen-hamilton/culvert Where to Tweet It #culvert *Available 6/29/2011
  • 45. Culvert Team • Ed Kohlwey (@ekohlwey) • Jesse Yates (@jesse_yates) • Jeremy Walsh • Tomer Kishoni (@tokbot) • Jason Trost (@jason_trost)

Editor's Notes

  1. Just say the bullet points,