SlideShare a Scribd company logo
1 of 32
A Big Data Refinery Built on HBase

        Stanislav Barton
   Internet Memory Research
A Content-oriented platform
• Mignify = a content-oriented platform which
    – Continuously (almost) ingests Web documents
    – Stores and preserves these documents as such AND
    – Produces structured content extracted (from single
      documents) and aggregated (from groups of documents)
    – Organizes both raw documents, extracted and aggregated
      information in a consistent information space.
=> Original documents and extracted information are uniformly
   stored in HBase
A Service-oriented platform
• Mignify = a service-oriented platform
  – Physical storage layer on “custom hardware”
  – Crawl on-demand, with sophisticated navigation options
  – Able to host third-party extractors, aggregators and
    classifiers
  – Run new algorithms on existing collections as they arrive
  – Supports search, navigation, on-demand query and
    extraction
=> A « Web Observatory » built around Hadoop/Hbase.



   –
Customers/Users
• Web Archivists – store and organize, live
  search
• Search engineers – organize and refine, big
  throughput
• Data miners – refine, secondary indices
• Researchers – store, organize and/or refine
Talk Outline
•   Architecture Overview
•   Use Cases/Scenarios
•   Data model
•   Queries/ Query language / Query processing
•   Usage Examples
•   Alternative HW platform
Overview of Mignify
Typical scenarios
• Full text indexers
   – Collect documents, with metadata and graph
     information
• Wrapper extraction
   – Get structured information from web sites
• Entity annotation
   – Annotate documents with entity references
• Classification
   – Aggregate subsets (e.g., domains); assign them to
     topics
Data in HBase
• HBase as a first choice data store:
  – Inherent versioning (timestamped values)
  – Real time access (Cache, index, key ordering)
  – Column oriented storage
  – Seamless integration with Hadoop
  – Big community
  – Production ready/mature implementation
Data model
• Data stored in one big table along with
  metadata and extraction results
• Though separated in column families
  – CFs as secondary indices
• Raw data stored in HBase ( < 10 MB, HDFS
  otherwise)
• Data stored as rows (versioned)
• Values are typed
Types and schema,
               where and why
• Initially, our collections consists of purely non-
  structured data
• All the point for Mignify is to produce a
  backbone of structured information,
• Done through an iterative process which
  progressively builds a rich set of interrelated
  annotations
• Typing = important to ensure the safety of the
  whole process, automation
Data model II
                                                                    A= <CF, Qualifier>:Type
                                                  Schema            Type:Writable or byte[]


          Resource               Version1:(A1.. Ak), Version2:(Ak+1, … Am),…
mignify




           Versions          <t’,{V1, … Vk}>, <t’’,{Vk+1, … Vm}>,…<t’’’, Vm+1,… Vl>


                            <(CF1,Qa,t’),v1>,<(CF1,Qb,t’’),v2>, … <(CFn,Qz,t’’’),vm>
           HTable                               v1,… vm:byte[]
HBase




           HFiles     CF1                CF2                               CFn
Extraction Platform
                   Main Principles
 A framework for specifying data extraction from very
  large datasets
    Easy integration and application of new extractors
 High level of genericity in terms of (i) data sources,
  (ii) extractors, and (iii) data sinks
    An extraction process specification combines these
     elements
 [currently] A single extractor engine
    Based on the specification, data extraction is processed by
     a single, generic MapReduce job.
Extraction Platform
                   Main Concepts
Important: typing (We care about types and schemas!)
 Input and Output Pipes
    Declare data source (e.g., a HBase collection) and data
     sinks (e.g, Hbase, HDFS, csv, …)
 Filters (Boolean operators that apply to input data)
 Extractors
    Takes an input Resource, produce Features
 Views
    Combination of input and output pipes, filters and
     extractors
Data Queries
• Various data sources (HTable, data files,…)
• Projections using column families and qualifiers
• Selections by HBase filters
FilterList ret = new FilterList(FilterList.Operator.MUST_PASS_ONE);
Filter f1 = new SingleColumnValueFilter(Bytes.toBytes("meta"), Bytes.toBytes("detectedMime"),
CompareFilter.CompareOp.EQUAL, Bytes.toBytes(“text/html”));
ret.addFilter(f1);

• Query results either flushed to files or back to
  HBase –> materialized views
• Views are defined per collection as a set of pairs
  of Extractors (user defined function) and Filters
Query Language (DML)
• For each employee with salary higher than 2,000,
  compute total costs
  – SELECT f(hire_date, salary) FROM employees WHERE
    salary >= 2000
  – f(hire_date, salary)= mon(today-hire_date)*salary
• For each web page detect mime type, language
  – For each RSS feed, get summary,
  – For each HTML page extract plain text
• Currently a wizzard producing a JSON doc
User functions
• List<byte[]> f(Row r)
• May calculate new attribute values, stored
  with Row and reused by other function
• Execution plan: order matters!
• Function associates description of input and
  output fields
  – Fields dependencies give order
Input Pipes
• Defines how to get the data
  – Archive files, Text files, HBase table
  – Format
  – The Mappers have always Resource on input,
    several custom InputFormats and RecordReaders
Output pipes
• Defines what to do with (where to store) the
  query result
  – File, Table
  – Format
  – What columns to materialize
• Most of the time PutSortReducer used so OP
  defines OutputFormat and what to emit
Query Processing
• With typed data, Resource wrapper and IPs and OPs – one
  universal MapReduce job to execute/process queries!!
• Most efficient (for data insertion): MapReduce job with
  Custom Mapper and PutSortReducer
• Job init: build combined filter, IP and OP to define input and
  output formats
• Mapper set-up: Init plan of user function applications, init
  of functions themselves
• Mapper map: apply functions on a row according to plan,
  use OP to emit values
• Not all combinations can leverage the PutSortReducer
  (writing to one table at a time, …)
Query Processing II
Archive
  File


HBase




                       Reduce
                                             HBase
             Map
 Data
 File                                         File




            Views               Co-scanner
Data Subscriptions / Views
• Data-in-a-View satisfaction can be checked at
  the ingestion time, before the data is inserted
• Mimicking client side co-processors – allowing
  the use of bulk loading (no coprocessors for
  bulk load in the moment)
• When new data arrives, user functions/actions
  are triggered
  – On demand crawls, focused crawls
Second Level Triggers
• User code ran in the reduce phase (when
  ingesting):
  – Put f(Put p, Row r)
  – Previous versions on the input of the code Can alter
    the processed Put object before final flush to HFile
• Co-scanner: User-defined scanner traversing the
  processed region aligned with the keys in the
  created HFile
• Example: Change detection on a re-crawled
  resource
Aggregation Queries
• Compute frequency of mime types in the
  collection
• For a web domain, compute spammicity using
  word distribution and out-links
• Based on a two-step aggregation process
  – Data is extracted in the Map(), emitted with the
    View signature
  – Data is collected in the Reduce(), grouped on the
    combination (view sig., aggr. key) and aggregated.
Aggregation Queries Processing
• Processed using MapReduce, multiple
  (compatible) aggregations at once (reading is
  the most expensive)
• Aggregation map phase: List<Pairs> map(Row
  r), Pair=<Agg_key, value>, Agg_key=<agg_id,
  agg_on_value,…>
• Aggregation reduce phase: reduce(Agg_key,
  values, context)
Aggregation Processing Numbers
Compute mime type distribution of web pages per PLD:
SELECT pld, mime, count(*) FROM web_collection
      GROUP BY extract_pld(url), mime
Data Ingestion
Our crawler asynchronously writes to Hbase:

Input: archive files (ARC, WARC) in HDFS
Output: Htable

SELECT *, f1(*),..Fn(*) FROM hdfs://path/*.warc.gz


1.    Pre-compute the split region boundaries on a data sample
     –    MapReduce on a data input sample
2.    Process batch (~0.5TB) MapReduce ingestion
3.    Split manually too big regions (or candidates)
4.    If there is still input go to 2.
Data Ingestion Numbers
• Store indexable web resources from WARC
  files to HBase, detect mime type, language,
  extract plain text and analyze RSS feeds
• Reaching steady 40MB/s including extraction
• Upper bound 170MB/s (distributed reading of
  archive files in HDFS)
• HBase is idle most of the time!
  – Allows compacting store files in the meanwhile
Web Archive Collection
• Column families (basic set):
  1. Raw content (payload)
  2. Meta data (mime, IP, …)
  3. Baseline analytics (plain text, detected mime, …)
• Usually one another CF per analytics result
• CFs as secondary indices:
  – All analyzed feeds at one place (no need for filter
    if I am interested in all such rows)
Web Archive Collection II
• More than 3,000 regions (in one collection)
• 12TB of compressed of indexable data (and
  counting)
• Crawl to store/process machine ratio is 1:1.2
• Storage scales out
HW Architecture
• Tens of small low-consumption nodes with a lot
  of disk space:
  – 15TB per node, 8GB RAM, dual core CPU
  – No enclosure -> no active cooling -> no expensive
    datacenter-ish environment needed
• Low per PB storage price (70 nodes/PB), car
  batteries as UPS, commodity (real low-cost) HW
  (esp. disks)
• Still some reasonable computational power
Conclusions
• Conclusions
  – Data refinery platform
  – Customizable, extensible
  – Large scale
• Future work
  – Incorporating external secondary indices to filter
    HBase rows/cells
     • Full text index filtering
     • Temporal filtering
  – Larger (100s TBs) scale deployment
Acknowledgments
• European Union projects:
  – LAWA: Longitudinal Analytics of Web Archive data
  – SCAPE - SCAlable Preservation Environments

More Related Content

What's hot

Harmonizing Multi-tenant HBase Clusters for Managing Workload Diversity
Harmonizing Multi-tenant HBase Clusters for Managing Workload DiversityHarmonizing Multi-tenant HBase Clusters for Managing Workload Diversity
Harmonizing Multi-tenant HBase Clusters for Managing Workload DiversityHBaseCon
 
Data Evolution in HBase
Data Evolution in HBaseData Evolution in HBase
Data Evolution in HBaseHBaseCon
 
HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...
HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...
HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...Cloudera, Inc.
 
Dancing with the elephant h base1_final
Dancing with the elephant   h base1_finalDancing with the elephant   h base1_final
Dancing with the elephant h base1_finalasterix_smartplatf
 
HBaseCon 2013: Compaction Improvements in Apache HBase
HBaseCon 2013: Compaction Improvements in Apache HBaseHBaseCon 2013: Compaction Improvements in Apache HBase
HBaseCon 2013: Compaction Improvements in Apache HBaseCloudera, Inc.
 
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBaseHBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBaseCloudera, Inc.
 
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | Building a Large Search Platform on a Shoestring BudgetHBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | Building a Large Search Platform on a Shoestring BudgetCloudera, Inc.
 
HBaseCon 2015 General Session: State of HBase
HBaseCon 2015 General Session: State of HBaseHBaseCon 2015 General Session: State of HBase
HBaseCon 2015 General Session: State of HBaseHBaseCon
 
HBase Read High Availability Using Timeline Consistent Region Replicas
HBase  Read High Availability Using Timeline Consistent Region ReplicasHBase  Read High Availability Using Timeline Consistent Region Replicas
HBase Read High Availability Using Timeline Consistent Region Replicasenissoz
 
HBaseCon 2013: Integration of Apache Hive and HBase
HBaseCon 2013: Integration of Apache Hive and HBaseHBaseCon 2013: Integration of Apache Hive and HBase
HBaseCon 2013: Integration of Apache Hive and HBaseCloudera, Inc.
 
HBase for Architects
HBase for ArchitectsHBase for Architects
HBase for ArchitectsNick Dimiduk
 
HBaseCon 2015: Graph Processing of Stock Market Order Flow in HBase on AWS
HBaseCon 2015: Graph Processing of Stock Market Order Flow in HBase on AWSHBaseCon 2015: Graph Processing of Stock Market Order Flow in HBase on AWS
HBaseCon 2015: Graph Processing of Stock Market Order Flow in HBase on AWSHBaseCon
 
Large-scale Web Apps @ Pinterest
Large-scale Web Apps @ PinterestLarge-scale Web Apps @ Pinterest
Large-scale Web Apps @ PinterestHBaseCon
 
Apache Spark on Apache HBase: Current and Future
Apache Spark on Apache HBase: Current and Future Apache Spark on Apache HBase: Current and Future
Apache Spark on Apache HBase: Current and Future HBaseCon
 
HBaseCon 2013: ETL for Apache HBase
HBaseCon 2013: ETL for Apache HBaseHBaseCon 2013: ETL for Apache HBase
HBaseCon 2013: ETL for Apache HBaseCloudera, Inc.
 
Apache Big Data EU 2015 - HBase
Apache Big Data EU 2015 - HBaseApache Big Data EU 2015 - HBase
Apache Big Data EU 2015 - HBaseNick Dimiduk
 
Mapreduce over snapshots
Mapreduce over snapshotsMapreduce over snapshots
Mapreduce over snapshotsenissoz
 
HBaseCon 2013: Honeycomb - MySQL Backed by Apache HBase
HBaseCon 2013: Honeycomb - MySQL Backed by Apache HBase HBaseCon 2013: Honeycomb - MySQL Backed by Apache HBase
HBaseCon 2013: Honeycomb - MySQL Backed by Apache HBase Cloudera, Inc.
 

What's hot (20)

Harmonizing Multi-tenant HBase Clusters for Managing Workload Diversity
Harmonizing Multi-tenant HBase Clusters for Managing Workload DiversityHarmonizing Multi-tenant HBase Clusters for Managing Workload Diversity
Harmonizing Multi-tenant HBase Clusters for Managing Workload Diversity
 
Data Evolution in HBase
Data Evolution in HBaseData Evolution in HBase
Data Evolution in HBase
 
HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...
HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...
HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...
 
Dancing with the elephant h base1_final
Dancing with the elephant   h base1_finalDancing with the elephant   h base1_final
Dancing with the elephant h base1_final
 
HBaseCon 2013: Compaction Improvements in Apache HBase
HBaseCon 2013: Compaction Improvements in Apache HBaseHBaseCon 2013: Compaction Improvements in Apache HBase
HBaseCon 2013: Compaction Improvements in Apache HBase
 
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBaseHBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
 
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | Building a Large Search Platform on a Shoestring BudgetHBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
HBaseCon 2015 General Session: State of HBase
HBaseCon 2015 General Session: State of HBaseHBaseCon 2015 General Session: State of HBase
HBaseCon 2015 General Session: State of HBase
 
HBase Read High Availability Using Timeline Consistent Region Replicas
HBase  Read High Availability Using Timeline Consistent Region ReplicasHBase  Read High Availability Using Timeline Consistent Region Replicas
HBase Read High Availability Using Timeline Consistent Region Replicas
 
HBaseCon 2013: Integration of Apache Hive and HBase
HBaseCon 2013: Integration of Apache Hive and HBaseHBaseCon 2013: Integration of Apache Hive and HBase
HBaseCon 2013: Integration of Apache Hive and HBase
 
HBase for Architects
HBase for ArchitectsHBase for Architects
HBase for Architects
 
HBaseCon 2015: Graph Processing of Stock Market Order Flow in HBase on AWS
HBaseCon 2015: Graph Processing of Stock Market Order Flow in HBase on AWSHBaseCon 2015: Graph Processing of Stock Market Order Flow in HBase on AWS
HBaseCon 2015: Graph Processing of Stock Market Order Flow in HBase on AWS
 
Large-scale Web Apps @ Pinterest
Large-scale Web Apps @ PinterestLarge-scale Web Apps @ Pinterest
Large-scale Web Apps @ Pinterest
 
Apache Spark on Apache HBase: Current and Future
Apache Spark on Apache HBase: Current and Future Apache Spark on Apache HBase: Current and Future
Apache Spark on Apache HBase: Current and Future
 
HBaseCon 2013: ETL for Apache HBase
HBaseCon 2013: ETL for Apache HBaseHBaseCon 2013: ETL for Apache HBase
HBaseCon 2013: ETL for Apache HBase
 
Apache phoenix
Apache phoenixApache phoenix
Apache phoenix
 
Apache Big Data EU 2015 - HBase
Apache Big Data EU 2015 - HBaseApache Big Data EU 2015 - HBase
Apache Big Data EU 2015 - HBase
 
Mapreduce over snapshots
Mapreduce over snapshotsMapreduce over snapshots
Mapreduce over snapshots
 
HBaseCon 2013: Honeycomb - MySQL Backed by Apache HBase
HBaseCon 2013: Honeycomb - MySQL Backed by Apache HBase HBaseCon 2013: Honeycomb - MySQL Backed by Apache HBase
HBaseCon 2013: Honeycomb - MySQL Backed by Apache HBase
 

Viewers also liked

HBaseCon 2013: Using Apache HBase for Large Matrices
HBaseCon 2013: Using Apache HBase for Large MatricesHBaseCon 2013: Using Apache HBase for Large Matrices
HBaseCon 2013: Using Apache HBase for Large MatricesCloudera, Inc.
 
HBaseCon 2013: Deal Personalization Engine with HBase @ Groupon
HBaseCon 2013: Deal Personalization Engine with HBase @ GrouponHBaseCon 2013: Deal Personalization Engine with HBase @ Groupon
HBaseCon 2013: Deal Personalization Engine with HBase @ GrouponCloudera, Inc.
 
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
HBaseCon 2012 | HBase Schema Design - Ian Varley, SalesforceHBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
HBaseCon 2012 | HBase Schema Design - Ian Varley, SalesforceCloudera, Inc.
 
Hadoop World 2011: Advanced HBase Schema Design
Hadoop World 2011: Advanced HBase Schema DesignHadoop World 2011: Advanced HBase Schema Design
Hadoop World 2011: Advanced HBase Schema DesignCloudera, Inc.
 
HBaseCon 2013: Being Smarter Than the Smart Meter
HBaseCon 2013: Being Smarter Than the Smart MeterHBaseCon 2013: Being Smarter Than the Smart Meter
HBaseCon 2013: Being Smarter Than the Smart MeterCloudera, Inc.
 
HBaseCon 2013: Apache HBase on Flash
HBaseCon 2013: Apache HBase on FlashHBaseCon 2013: Apache HBase on Flash
HBaseCon 2013: Apache HBase on FlashCloudera, Inc.
 
HBaseCon 2013: Apache HBase, Meet Ops. Ops, Meet Apache HBase.
HBaseCon 2013: Apache HBase, Meet Ops. Ops, Meet Apache HBase.HBaseCon 2013: Apache HBase, Meet Ops. Ops, Meet Apache HBase.
HBaseCon 2013: Apache HBase, Meet Ops. Ops, Meet Apache HBase.Cloudera, Inc.
 
HBaseCon 2013: Evolving a First-Generation Apache HBase Deployment to Second...
HBaseCon 2013:  Evolving a First-Generation Apache HBase Deployment to Second...HBaseCon 2013:  Evolving a First-Generation Apache HBase Deployment to Second...
HBaseCon 2013: Evolving a First-Generation Apache HBase Deployment to Second...Cloudera, Inc.
 
HBaseCon 2012 | HBase for the Worlds Libraries - OCLC
HBaseCon 2012 | HBase for the Worlds Libraries - OCLCHBaseCon 2012 | HBase for the Worlds Libraries - OCLC
HBaseCon 2012 | HBase for the Worlds Libraries - OCLCCloudera, Inc.
 
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...Cloudera, Inc.
 
HBaseCon 2015: Trafodion - Integrating Operational SQL into HBase
HBaseCon 2015: Trafodion - Integrating Operational SQL into HBaseHBaseCon 2015: Trafodion - Integrating Operational SQL into HBase
HBaseCon 2015: Trafodion - Integrating Operational SQL into HBaseHBaseCon
 
Tales from the Cloudera Field
Tales from the Cloudera FieldTales from the Cloudera Field
Tales from the Cloudera FieldHBaseCon
 
HBaseCon 2013: Rebuilding for Scale on Apache HBase
HBaseCon 2013: Rebuilding for Scale on Apache HBaseHBaseCon 2013: Rebuilding for Scale on Apache HBase
HBaseCon 2013: Rebuilding for Scale on Apache HBaseCloudera, Inc.
 
Cross-Site BigTable using HBase
Cross-Site BigTable using HBaseCross-Site BigTable using HBase
Cross-Site BigTable using HBaseHBaseCon
 
HBaseCon 2015: DeathStar - Easy, Dynamic, Multi-tenant HBase via YARN
HBaseCon 2015: DeathStar - Easy, Dynamic,  Multi-tenant HBase via YARNHBaseCon 2015: DeathStar - Easy, Dynamic,  Multi-tenant HBase via YARN
HBaseCon 2015: DeathStar - Easy, Dynamic, Multi-tenant HBase via YARNHBaseCon
 
HBaseCon 2012 | Building Mobile Infrastructure with HBase
HBaseCon 2012 | Building Mobile Infrastructure with HBaseHBaseCon 2012 | Building Mobile Infrastructure with HBase
HBaseCon 2012 | Building Mobile Infrastructure with HBaseCloudera, Inc.
 
HBaseCon 2012 | Relaxed Transactions for HBase - Francis Liu, Yahoo!
HBaseCon 2012 | Relaxed Transactions for HBase - Francis Liu, Yahoo!HBaseCon 2012 | Relaxed Transactions for HBase - Francis Liu, Yahoo!
HBaseCon 2012 | Relaxed Transactions for HBase - Francis Liu, Yahoo!Cloudera, Inc.
 
HBaseCon 2012 | Scaling GIS In Three Acts
HBaseCon 2012 | Scaling GIS In Three ActsHBaseCon 2012 | Scaling GIS In Three Acts
HBaseCon 2012 | Scaling GIS In Three ActsCloudera, Inc.
 
HBaseCon 2013: Apache Hadoop and Apache HBase for Real-Time Video Analytics
HBaseCon 2013: Apache Hadoop and Apache HBase for Real-Time Video Analytics HBaseCon 2013: Apache Hadoop and Apache HBase for Real-Time Video Analytics
HBaseCon 2013: Apache Hadoop and Apache HBase for Real-Time Video Analytics Cloudera, Inc.
 
HBaseCon 2012 | Leveraging HBase for the World’s Largest Curated Genomic Data...
HBaseCon 2012 | Leveraging HBase for the World’s Largest Curated Genomic Data...HBaseCon 2012 | Leveraging HBase for the World’s Largest Curated Genomic Data...
HBaseCon 2012 | Leveraging HBase for the World’s Largest Curated Genomic Data...Cloudera, Inc.
 

Viewers also liked (20)

HBaseCon 2013: Using Apache HBase for Large Matrices
HBaseCon 2013: Using Apache HBase for Large MatricesHBaseCon 2013: Using Apache HBase for Large Matrices
HBaseCon 2013: Using Apache HBase for Large Matrices
 
HBaseCon 2013: Deal Personalization Engine with HBase @ Groupon
HBaseCon 2013: Deal Personalization Engine with HBase @ GrouponHBaseCon 2013: Deal Personalization Engine with HBase @ Groupon
HBaseCon 2013: Deal Personalization Engine with HBase @ Groupon
 
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
HBaseCon 2012 | HBase Schema Design - Ian Varley, SalesforceHBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
 
Hadoop World 2011: Advanced HBase Schema Design
Hadoop World 2011: Advanced HBase Schema DesignHadoop World 2011: Advanced HBase Schema Design
Hadoop World 2011: Advanced HBase Schema Design
 
HBaseCon 2013: Being Smarter Than the Smart Meter
HBaseCon 2013: Being Smarter Than the Smart MeterHBaseCon 2013: Being Smarter Than the Smart Meter
HBaseCon 2013: Being Smarter Than the Smart Meter
 
HBaseCon 2013: Apache HBase on Flash
HBaseCon 2013: Apache HBase on FlashHBaseCon 2013: Apache HBase on Flash
HBaseCon 2013: Apache HBase on Flash
 
HBaseCon 2013: Apache HBase, Meet Ops. Ops, Meet Apache HBase.
HBaseCon 2013: Apache HBase, Meet Ops. Ops, Meet Apache HBase.HBaseCon 2013: Apache HBase, Meet Ops. Ops, Meet Apache HBase.
HBaseCon 2013: Apache HBase, Meet Ops. Ops, Meet Apache HBase.
 
HBaseCon 2013: Evolving a First-Generation Apache HBase Deployment to Second...
HBaseCon 2013:  Evolving a First-Generation Apache HBase Deployment to Second...HBaseCon 2013:  Evolving a First-Generation Apache HBase Deployment to Second...
HBaseCon 2013: Evolving a First-Generation Apache HBase Deployment to Second...
 
HBaseCon 2012 | HBase for the Worlds Libraries - OCLC
HBaseCon 2012 | HBase for the Worlds Libraries - OCLCHBaseCon 2012 | HBase for the Worlds Libraries - OCLC
HBaseCon 2012 | HBase for the Worlds Libraries - OCLC
 
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...
 
HBaseCon 2015: Trafodion - Integrating Operational SQL into HBase
HBaseCon 2015: Trafodion - Integrating Operational SQL into HBaseHBaseCon 2015: Trafodion - Integrating Operational SQL into HBase
HBaseCon 2015: Trafodion - Integrating Operational SQL into HBase
 
Tales from the Cloudera Field
Tales from the Cloudera FieldTales from the Cloudera Field
Tales from the Cloudera Field
 
HBaseCon 2013: Rebuilding for Scale on Apache HBase
HBaseCon 2013: Rebuilding for Scale on Apache HBaseHBaseCon 2013: Rebuilding for Scale on Apache HBase
HBaseCon 2013: Rebuilding for Scale on Apache HBase
 
Cross-Site BigTable using HBase
Cross-Site BigTable using HBaseCross-Site BigTable using HBase
Cross-Site BigTable using HBase
 
HBaseCon 2015: DeathStar - Easy, Dynamic, Multi-tenant HBase via YARN
HBaseCon 2015: DeathStar - Easy, Dynamic,  Multi-tenant HBase via YARNHBaseCon 2015: DeathStar - Easy, Dynamic,  Multi-tenant HBase via YARN
HBaseCon 2015: DeathStar - Easy, Dynamic, Multi-tenant HBase via YARN
 
HBaseCon 2012 | Building Mobile Infrastructure with HBase
HBaseCon 2012 | Building Mobile Infrastructure with HBaseHBaseCon 2012 | Building Mobile Infrastructure with HBase
HBaseCon 2012 | Building Mobile Infrastructure with HBase
 
HBaseCon 2012 | Relaxed Transactions for HBase - Francis Liu, Yahoo!
HBaseCon 2012 | Relaxed Transactions for HBase - Francis Liu, Yahoo!HBaseCon 2012 | Relaxed Transactions for HBase - Francis Liu, Yahoo!
HBaseCon 2012 | Relaxed Transactions for HBase - Francis Liu, Yahoo!
 
HBaseCon 2012 | Scaling GIS In Three Acts
HBaseCon 2012 | Scaling GIS In Three ActsHBaseCon 2012 | Scaling GIS In Three Acts
HBaseCon 2012 | Scaling GIS In Three Acts
 
HBaseCon 2013: Apache Hadoop and Apache HBase for Real-Time Video Analytics
HBaseCon 2013: Apache Hadoop and Apache HBase for Real-Time Video Analytics HBaseCon 2013: Apache Hadoop and Apache HBase for Real-Time Video Analytics
HBaseCon 2013: Apache Hadoop and Apache HBase for Real-Time Video Analytics
 
HBaseCon 2012 | Leveraging HBase for the World’s Largest Curated Genomic Data...
HBaseCon 2012 | Leveraging HBase for the World’s Largest Curated Genomic Data...HBaseCon 2012 | Leveraging HBase for the World’s Largest Curated Genomic Data...
HBaseCon 2012 | Leveraging HBase for the World’s Largest Curated Genomic Data...
 

Similar to HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory Research

Real time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stackReal time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stackDataWorks Summit/Hadoop Summit
 
AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)
AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)
AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)Amazon Web Services
 
A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...
A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...
A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...HBaseCon
 
A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...
A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...
A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...Chris Huang
 
Bigdata Hadoop project payment gateway domain
Bigdata Hadoop project payment gateway domainBigdata Hadoop project payment gateway domain
Bigdata Hadoop project payment gateway domainKamal A
 
Honu - A Large Scale Streaming Data Collection and Processing Pipeline__Hadoo...
Honu - A Large Scale Streaming Data Collection and Processing Pipeline__Hadoo...Honu - A Large Scale Streaming Data Collection and Processing Pipeline__Hadoo...
Honu - A Large Scale Streaming Data Collection and Processing Pipeline__Hadoo...Yahoo Developer Network
 
Hadoop summit 2010, HONU
Hadoop summit 2010, HONUHadoop summit 2010, HONU
Hadoop summit 2010, HONUJerome Boulon
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introductionSandeep Singh
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucenelucenerevolution
 
Performance Hive+Tez 2
Performance Hive+Tez 2Performance Hive+Tez 2
Performance Hive+Tez 2t3rmin4t0r
 
Hive big-data meetup
Hive big-data meetupHive big-data meetup
Hive big-data meetupRemus Rusanu
 
Parquet Twitter Seattle open house
Parquet Twitter Seattle open houseParquet Twitter Seattle open house
Parquet Twitter Seattle open houseJulien Le Dem
 

Similar to HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory Research (20)

מיכאל
מיכאלמיכאל
מיכאל
 
Big Data Processing
Big Data ProcessingBig Data Processing
Big Data Processing
 
Apache Drill at ApacheCon2014
Apache Drill at ApacheCon2014Apache Drill at ApacheCon2014
Apache Drill at ApacheCon2014
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Real time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stackReal time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stack
 
HBase ArcheTypes
HBase ArcheTypesHBase ArcheTypes
HBase ArcheTypes
 
AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)
AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)
AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)
 
Drill at the Chicago Hug
Drill at the Chicago HugDrill at the Chicago Hug
Drill at the Chicago Hug
 
A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...
A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...
A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...
 
A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...
A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...
A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...
 
Bigdata Hadoop project payment gateway domain
Bigdata Hadoop project payment gateway domainBigdata Hadoop project payment gateway domain
Bigdata Hadoop project payment gateway domain
 
Honu - A Large Scale Streaming Data Collection and Processing Pipeline__Hadoo...
Honu - A Large Scale Streaming Data Collection and Processing Pipeline__Hadoo...Honu - A Large Scale Streaming Data Collection and Processing Pipeline__Hadoo...
Honu - A Large Scale Streaming Data Collection and Processing Pipeline__Hadoo...
 
Elasticsearch Introduction at BigData meetup
Elasticsearch Introduction at BigData meetupElasticsearch Introduction at BigData meetup
Elasticsearch Introduction at BigData meetup
 
Hadoop summit 2010, HONU
Hadoop summit 2010, HONUHadoop summit 2010, HONU
Hadoop summit 2010, HONU
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
 
Performance Hive+Tez 2
Performance Hive+Tez 2Performance Hive+Tez 2
Performance Hive+Tez 2
 
Hive big-data meetup
Hive big-data meetupHive big-data meetup
Hive big-data meetup
 
Parquet Twitter Seattle open house
Parquet Twitter Seattle open houseParquet Twitter Seattle open house
Parquet Twitter Seattle open house
 

More from Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
 

More from Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Recently uploaded

Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 

Recently uploaded (20)

Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 

HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory Research

  • 1. A Big Data Refinery Built on HBase Stanislav Barton Internet Memory Research
  • 2. A Content-oriented platform • Mignify = a content-oriented platform which – Continuously (almost) ingests Web documents – Stores and preserves these documents as such AND – Produces structured content extracted (from single documents) and aggregated (from groups of documents) – Organizes both raw documents, extracted and aggregated information in a consistent information space. => Original documents and extracted information are uniformly stored in HBase
  • 3. A Service-oriented platform • Mignify = a service-oriented platform – Physical storage layer on “custom hardware” – Crawl on-demand, with sophisticated navigation options – Able to host third-party extractors, aggregators and classifiers – Run new algorithms on existing collections as they arrive – Supports search, navigation, on-demand query and extraction => A « Web Observatory » built around Hadoop/Hbase. –
  • 4. Customers/Users • Web Archivists – store and organize, live search • Search engineers – organize and refine, big throughput • Data miners – refine, secondary indices • Researchers – store, organize and/or refine
  • 5. Talk Outline • Architecture Overview • Use Cases/Scenarios • Data model • Queries/ Query language / Query processing • Usage Examples • Alternative HW platform
  • 7. Typical scenarios • Full text indexers – Collect documents, with metadata and graph information • Wrapper extraction – Get structured information from web sites • Entity annotation – Annotate documents with entity references • Classification – Aggregate subsets (e.g., domains); assign them to topics
  • 8. Data in HBase • HBase as a first choice data store: – Inherent versioning (timestamped values) – Real time access (Cache, index, key ordering) – Column oriented storage – Seamless integration with Hadoop – Big community – Production ready/mature implementation
  • 9. Data model • Data stored in one big table along with metadata and extraction results • Though separated in column families – CFs as secondary indices • Raw data stored in HBase ( < 10 MB, HDFS otherwise) • Data stored as rows (versioned) • Values are typed
  • 10. Types and schema, where and why • Initially, our collections consists of purely non- structured data • All the point for Mignify is to produce a backbone of structured information, • Done through an iterative process which progressively builds a rich set of interrelated annotations • Typing = important to ensure the safety of the whole process, automation
  • 11. Data model II A= <CF, Qualifier>:Type Schema Type:Writable or byte[] Resource Version1:(A1.. Ak), Version2:(Ak+1, … Am),… mignify Versions <t’,{V1, … Vk}>, <t’’,{Vk+1, … Vm}>,…<t’’’, Vm+1,… Vl> <(CF1,Qa,t’),v1>,<(CF1,Qb,t’’),v2>, … <(CFn,Qz,t’’’),vm> HTable v1,… vm:byte[] HBase HFiles CF1 CF2 CFn
  • 12. Extraction Platform Main Principles  A framework for specifying data extraction from very large datasets  Easy integration and application of new extractors  High level of genericity in terms of (i) data sources, (ii) extractors, and (iii) data sinks  An extraction process specification combines these elements  [currently] A single extractor engine  Based on the specification, data extraction is processed by a single, generic MapReduce job.
  • 13. Extraction Platform Main Concepts Important: typing (We care about types and schemas!)  Input and Output Pipes  Declare data source (e.g., a HBase collection) and data sinks (e.g, Hbase, HDFS, csv, …)  Filters (Boolean operators that apply to input data)  Extractors  Takes an input Resource, produce Features  Views  Combination of input and output pipes, filters and extractors
  • 14. Data Queries • Various data sources (HTable, data files,…) • Projections using column families and qualifiers • Selections by HBase filters FilterList ret = new FilterList(FilterList.Operator.MUST_PASS_ONE); Filter f1 = new SingleColumnValueFilter(Bytes.toBytes("meta"), Bytes.toBytes("detectedMime"), CompareFilter.CompareOp.EQUAL, Bytes.toBytes(“text/html”)); ret.addFilter(f1); • Query results either flushed to files or back to HBase –> materialized views • Views are defined per collection as a set of pairs of Extractors (user defined function) and Filters
  • 15. Query Language (DML) • For each employee with salary higher than 2,000, compute total costs – SELECT f(hire_date, salary) FROM employees WHERE salary >= 2000 – f(hire_date, salary)= mon(today-hire_date)*salary • For each web page detect mime type, language – For each RSS feed, get summary, – For each HTML page extract plain text • Currently a wizzard producing a JSON doc
  • 16. User functions • List<byte[]> f(Row r) • May calculate new attribute values, stored with Row and reused by other function • Execution plan: order matters! • Function associates description of input and output fields – Fields dependencies give order
  • 17. Input Pipes • Defines how to get the data – Archive files, Text files, HBase table – Format – The Mappers have always Resource on input, several custom InputFormats and RecordReaders
  • 18. Output pipes • Defines what to do with (where to store) the query result – File, Table – Format – What columns to materialize • Most of the time PutSortReducer used so OP defines OutputFormat and what to emit
  • 19. Query Processing • With typed data, Resource wrapper and IPs and OPs – one universal MapReduce job to execute/process queries!! • Most efficient (for data insertion): MapReduce job with Custom Mapper and PutSortReducer • Job init: build combined filter, IP and OP to define input and output formats • Mapper set-up: Init plan of user function applications, init of functions themselves • Mapper map: apply functions on a row according to plan, use OP to emit values • Not all combinations can leverage the PutSortReducer (writing to one table at a time, …)
  • 20. Query Processing II Archive File HBase Reduce HBase Map Data File File Views Co-scanner
  • 21. Data Subscriptions / Views • Data-in-a-View satisfaction can be checked at the ingestion time, before the data is inserted • Mimicking client side co-processors – allowing the use of bulk loading (no coprocessors for bulk load in the moment) • When new data arrives, user functions/actions are triggered – On demand crawls, focused crawls
  • 22. Second Level Triggers • User code ran in the reduce phase (when ingesting): – Put f(Put p, Row r) – Previous versions on the input of the code Can alter the processed Put object before final flush to HFile • Co-scanner: User-defined scanner traversing the processed region aligned with the keys in the created HFile • Example: Change detection on a re-crawled resource
  • 23. Aggregation Queries • Compute frequency of mime types in the collection • For a web domain, compute spammicity using word distribution and out-links • Based on a two-step aggregation process – Data is extracted in the Map(), emitted with the View signature – Data is collected in the Reduce(), grouped on the combination (view sig., aggr. key) and aggregated.
  • 24. Aggregation Queries Processing • Processed using MapReduce, multiple (compatible) aggregations at once (reading is the most expensive) • Aggregation map phase: List<Pairs> map(Row r), Pair=<Agg_key, value>, Agg_key=<agg_id, agg_on_value,…> • Aggregation reduce phase: reduce(Agg_key, values, context)
  • 25. Aggregation Processing Numbers Compute mime type distribution of web pages per PLD: SELECT pld, mime, count(*) FROM web_collection GROUP BY extract_pld(url), mime
  • 26. Data Ingestion Our crawler asynchronously writes to Hbase: Input: archive files (ARC, WARC) in HDFS Output: Htable SELECT *, f1(*),..Fn(*) FROM hdfs://path/*.warc.gz 1. Pre-compute the split region boundaries on a data sample – MapReduce on a data input sample 2. Process batch (~0.5TB) MapReduce ingestion 3. Split manually too big regions (or candidates) 4. If there is still input go to 2.
  • 27. Data Ingestion Numbers • Store indexable web resources from WARC files to HBase, detect mime type, language, extract plain text and analyze RSS feeds • Reaching steady 40MB/s including extraction • Upper bound 170MB/s (distributed reading of archive files in HDFS) • HBase is idle most of the time! – Allows compacting store files in the meanwhile
  • 28. Web Archive Collection • Column families (basic set): 1. Raw content (payload) 2. Meta data (mime, IP, …) 3. Baseline analytics (plain text, detected mime, …) • Usually one another CF per analytics result • CFs as secondary indices: – All analyzed feeds at one place (no need for filter if I am interested in all such rows)
  • 29. Web Archive Collection II • More than 3,000 regions (in one collection) • 12TB of compressed of indexable data (and counting) • Crawl to store/process machine ratio is 1:1.2 • Storage scales out
  • 30. HW Architecture • Tens of small low-consumption nodes with a lot of disk space: – 15TB per node, 8GB RAM, dual core CPU – No enclosure -> no active cooling -> no expensive datacenter-ish environment needed • Low per PB storage price (70 nodes/PB), car batteries as UPS, commodity (real low-cost) HW (esp. disks) • Still some reasonable computational power
  • 31. Conclusions • Conclusions – Data refinery platform – Customizable, extensible – Large scale • Future work – Incorporating external secondary indices to filter HBase rows/cells • Full text index filtering • Temporal filtering – Larger (100s TBs) scale deployment
  • 32. Acknowledgments • European Union projects: – LAWA: Longitudinal Analytics of Web Archive data – SCAPE - SCAlable Preservation Environments