A Big Data Refinery Built on HBase

        Stanislav Barton
   Internet Memory Research
A Content-oriented platform
• Mignify = a content-oriented platform which
    – Continuously (almost) ingests Web documents
    – Stores and preserves these documents as such AND
    – Produces structured content extracted (from single
      documents) and aggregated (from groups of documents)
    – Organizes both raw documents, extracted and aggregated
      information in a consistent information space.
=> Original documents and extracted information are uniformly
   stored in HBase
A Service-oriented platform
• Mignify = a service-oriented platform
  – Physical storage layer on “custom hardware”
  – Crawl on-demand, with sophisticated navigation options
  – Able to host third-party extractors, aggregators and
    classifiers
  – Run new algorithms on existing collections as they arrive
  – Supports search, navigation, on-demand query and
    extraction
=> A « Web Observatory » built around Hadoop/Hbase.



   –
Customers/Users
• Web Archivists – store and organize, live
  search
• Search engineers – organize and refine, big
  throughput
• Data miners – refine, secondary indices
• Researchers – store, organize and/or refine
Talk Outline
•   Architecture Overview
•   Use Cases/Scenarios
•   Data model
•   Queries/ Query language / Query processing
•   Usage Examples
•   Alternative HW platform
Overview of Mignify
Typical scenarios
• Full text indexers
   – Collect documents, with metadata and graph
     information
• Wrapper extraction
   – Get structured information from web sites
• Entity annotation
   – Annotate documents with entity references
• Classification
   – Aggregate subsets (e.g., domains); assign them to
     topics
Data in HBase
• HBase as a first choice data store:
  – Inherent versioning (timestamped values)
  – Real time access (Cache, index, key ordering)
  – Column oriented storage
  – Seamless integration with Hadoop
  – Big community
  – Production ready/mature implementation
Data model
• Data stored in one big table along with
  metadata and extraction results
• Though separated in column families
  – CFs as secondary indices
• Raw data stored in HBase ( < 10 MB, HDFS
  otherwise)
• Data stored as rows (versioned)
• Values are typed
Types and schema,
               where and why
• Initially, our collections consists of purely non-
  structured data
• All the point for Mignify is to produce a
  backbone of structured information,
• Done through an iterative process which
  progressively builds a rich set of interrelated
  annotations
• Typing = important to ensure the safety of the
  whole process, automation
Data model II
                                                                    A= <CF, Qualifier>:Type
                                                  Schema            Type:Writable or byte[]


          Resource               Version1:(A1.. Ak), Version2:(Ak+1, … Am),…
mignify




           Versions          <t’,{V1, … Vk}>, <t’’,{Vk+1, … Vm}>,…<t’’’, Vm+1,… Vl>


                            <(CF1,Qa,t’),v1>,<(CF1,Qb,t’’),v2>, … <(CFn,Qz,t’’’),vm>
           HTable                               v1,… vm:byte[]
HBase




           HFiles     CF1                CF2                               CFn
Extraction Platform
                   Main Principles
 A framework for specifying data extraction from very
  large datasets
    Easy integration and application of new extractors
 High level of genericity in terms of (i) data sources,
  (ii) extractors, and (iii) data sinks
    An extraction process specification combines these
     elements
 [currently] A single extractor engine
    Based on the specification, data extraction is processed by
     a single, generic MapReduce job.
Extraction Platform
                   Main Concepts
Important: typing (We care about types and schemas!)
 Input and Output Pipes
    Declare data source (e.g., a HBase collection) and data
     sinks (e.g, Hbase, HDFS, csv, …)
 Filters (Boolean operators that apply to input data)
 Extractors
    Takes an input Resource, produce Features
 Views
    Combination of input and output pipes, filters and
     extractors
Data Queries
• Various data sources (HTable, data files,…)
• Projections using column families and qualifiers
• Selections by HBase filters
FilterList ret = new FilterList(FilterList.Operator.MUST_PASS_ONE);
Filter f1 = new SingleColumnValueFilter(Bytes.toBytes("meta"), Bytes.toBytes("detectedMime"),
CompareFilter.CompareOp.EQUAL, Bytes.toBytes(“text/html”));
ret.addFilter(f1);

• Query results either flushed to files or back to
  HBase –> materialized views
• Views are defined per collection as a set of pairs
  of Extractors (user defined function) and Filters
Query Language (DML)
• For each employee with salary higher than 2,000,
  compute total costs
  – SELECT f(hire_date, salary) FROM employees WHERE
    salary >= 2000
  – f(hire_date, salary)= mon(today-hire_date)*salary
• For each web page detect mime type, language
  – For each RSS feed, get summary,
  – For each HTML page extract plain text
• Currently a wizzard producing a JSON doc
User functions
• List<byte[]> f(Row r)
• May calculate new attribute values, stored
  with Row and reused by other function
• Execution plan: order matters!
• Function associates description of input and
  output fields
  – Fields dependencies give order
Input Pipes
• Defines how to get the data
  – Archive files, Text files, HBase table
  – Format
  – The Mappers have always Resource on input,
    several custom InputFormats and RecordReaders
Output pipes
• Defines what to do with (where to store) the
  query result
  – File, Table
  – Format
  – What columns to materialize
• Most of the time PutSortReducer used so OP
  defines OutputFormat and what to emit
Query Processing
• With typed data, Resource wrapper and IPs and OPs – one
  universal MapReduce job to execute/process queries!!
• Most efficient (for data insertion): MapReduce job with
  Custom Mapper and PutSortReducer
• Job init: build combined filter, IP and OP to define input and
  output formats
• Mapper set-up: Init plan of user function applications, init
  of functions themselves
• Mapper map: apply functions on a row according to plan,
  use OP to emit values
• Not all combinations can leverage the PutSortReducer
  (writing to one table at a time, …)
Query Processing II
Archive
  File


HBase




                       Reduce
                                             HBase
             Map
 Data
 File                                         File




            Views               Co-scanner
Data Subscriptions / Views
• Data-in-a-View satisfaction can be checked at
  the ingestion time, before the data is inserted
• Mimicking client side co-processors – allowing
  the use of bulk loading (no coprocessors for
  bulk load in the moment)
• When new data arrives, user functions/actions
  are triggered
  – On demand crawls, focused crawls
Second Level Triggers
• User code ran in the reduce phase (when
  ingesting):
  – Put f(Put p, Row r)
  – Previous versions on the input of the code Can alter
    the processed Put object before final flush to HFile
• Co-scanner: User-defined scanner traversing the
  processed region aligned with the keys in the
  created HFile
• Example: Change detection on a re-crawled
  resource
Aggregation Queries
• Compute frequency of mime types in the
  collection
• For a web domain, compute spammicity using
  word distribution and out-links
• Based on a two-step aggregation process
  – Data is extracted in the Map(), emitted with the
    View signature
  – Data is collected in the Reduce(), grouped on the
    combination (view sig., aggr. key) and aggregated.
Aggregation Queries Processing
• Processed using MapReduce, multiple
  (compatible) aggregations at once (reading is
  the most expensive)
• Aggregation map phase: List<Pairs> map(Row
  r), Pair=<Agg_key, value>, Agg_key=<agg_id,
  agg_on_value,…>
• Aggregation reduce phase: reduce(Agg_key,
  values, context)
Aggregation Processing Numbers
Compute mime type distribution of web pages per PLD:
SELECT pld, mime, count(*) FROM web_collection
      GROUP BY extract_pld(url), mime
Data Ingestion
Our crawler asynchronously writes to Hbase:

Input: archive files (ARC, WARC) in HDFS
Output: Htable

SELECT *, f1(*),..Fn(*) FROM hdfs://path/*.warc.gz


1.    Pre-compute the split region boundaries on a data sample
     –    MapReduce on a data input sample
2.    Process batch (~0.5TB) MapReduce ingestion
3.    Split manually too big regions (or candidates)
4.    If there is still input go to 2.
Data Ingestion Numbers
• Store indexable web resources from WARC
  files to HBase, detect mime type, language,
  extract plain text and analyze RSS feeds
• Reaching steady 40MB/s including extraction
• Upper bound 170MB/s (distributed reading of
  archive files in HDFS)
• HBase is idle most of the time!
  – Allows compacting store files in the meanwhile
Web Archive Collection
• Column families (basic set):
  1. Raw content (payload)
  2. Meta data (mime, IP, …)
  3. Baseline analytics (plain text, detected mime, …)
• Usually one another CF per analytics result
• CFs as secondary indices:
  – All analyzed feeds at one place (no need for filter
    if I am interested in all such rows)
Web Archive Collection II
• More than 3,000 regions (in one collection)
• 12TB of compressed of indexable data (and
  counting)
• Crawl to store/process machine ratio is 1:1.2
• Storage scales out
HW Architecture
• Tens of small low-consumption nodes with a lot
  of disk space:
  – 15TB per node, 8GB RAM, dual core CPU
  – No enclosure -> no active cooling -> no expensive
    datacenter-ish environment needed
• Low per PB storage price (70 nodes/PB), car
  batteries as UPS, commodity (real low-cost) HW
  (esp. disks)
• Still some reasonable computational power
Conclusions
• Conclusions
  – Data refinery platform
  – Customizable, extensible
  – Large scale
• Future work
  – Incorporating external secondary indices to filter
    HBase rows/cells
     • Full text index filtering
     • Temporal filtering
  – Larger (100s TBs) scale deployment
Acknowledgments
• European Union projects:
  – LAWA: Longitudinal Analytics of Web Archive data
  – SCAPE - SCAlable Preservation Environments

HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory Research

  • 1.
    A Big DataRefinery Built on HBase Stanislav Barton Internet Memory Research
  • 2.
    A Content-oriented platform •Mignify = a content-oriented platform which – Continuously (almost) ingests Web documents – Stores and preserves these documents as such AND – Produces structured content extracted (from single documents) and aggregated (from groups of documents) – Organizes both raw documents, extracted and aggregated information in a consistent information space. => Original documents and extracted information are uniformly stored in HBase
  • 3.
    A Service-oriented platform •Mignify = a service-oriented platform – Physical storage layer on “custom hardware” – Crawl on-demand, with sophisticated navigation options – Able to host third-party extractors, aggregators and classifiers – Run new algorithms on existing collections as they arrive – Supports search, navigation, on-demand query and extraction => A « Web Observatory » built around Hadoop/Hbase. –
  • 4.
    Customers/Users • Web Archivists– store and organize, live search • Search engineers – organize and refine, big throughput • Data miners – refine, secondary indices • Researchers – store, organize and/or refine
  • 5.
    Talk Outline • Architecture Overview • Use Cases/Scenarios • Data model • Queries/ Query language / Query processing • Usage Examples • Alternative HW platform
  • 6.
  • 7.
    Typical scenarios • Fulltext indexers – Collect documents, with metadata and graph information • Wrapper extraction – Get structured information from web sites • Entity annotation – Annotate documents with entity references • Classification – Aggregate subsets (e.g., domains); assign them to topics
  • 8.
    Data in HBase •HBase as a first choice data store: – Inherent versioning (timestamped values) – Real time access (Cache, index, key ordering) – Column oriented storage – Seamless integration with Hadoop – Big community – Production ready/mature implementation
  • 9.
    Data model • Datastored in one big table along with metadata and extraction results • Though separated in column families – CFs as secondary indices • Raw data stored in HBase ( < 10 MB, HDFS otherwise) • Data stored as rows (versioned) • Values are typed
  • 10.
    Types and schema, where and why • Initially, our collections consists of purely non- structured data • All the point for Mignify is to produce a backbone of structured information, • Done through an iterative process which progressively builds a rich set of interrelated annotations • Typing = important to ensure the safety of the whole process, automation
  • 11.
    Data model II A= <CF, Qualifier>:Type Schema Type:Writable or byte[] Resource Version1:(A1.. Ak), Version2:(Ak+1, … Am),… mignify Versions <t’,{V1, … Vk}>, <t’’,{Vk+1, … Vm}>,…<t’’’, Vm+1,… Vl> <(CF1,Qa,t’),v1>,<(CF1,Qb,t’’),v2>, … <(CFn,Qz,t’’’),vm> HTable v1,… vm:byte[] HBase HFiles CF1 CF2 CFn
  • 12.
    Extraction Platform Main Principles  A framework for specifying data extraction from very large datasets  Easy integration and application of new extractors  High level of genericity in terms of (i) data sources, (ii) extractors, and (iii) data sinks  An extraction process specification combines these elements  [currently] A single extractor engine  Based on the specification, data extraction is processed by a single, generic MapReduce job.
  • 13.
    Extraction Platform Main Concepts Important: typing (We care about types and schemas!)  Input and Output Pipes  Declare data source (e.g., a HBase collection) and data sinks (e.g, Hbase, HDFS, csv, …)  Filters (Boolean operators that apply to input data)  Extractors  Takes an input Resource, produce Features  Views  Combination of input and output pipes, filters and extractors
  • 14.
    Data Queries • Variousdata sources (HTable, data files,…) • Projections using column families and qualifiers • Selections by HBase filters FilterList ret = new FilterList(FilterList.Operator.MUST_PASS_ONE); Filter f1 = new SingleColumnValueFilter(Bytes.toBytes("meta"), Bytes.toBytes("detectedMime"), CompareFilter.CompareOp.EQUAL, Bytes.toBytes(“text/html”)); ret.addFilter(f1); • Query results either flushed to files or back to HBase –> materialized views • Views are defined per collection as a set of pairs of Extractors (user defined function) and Filters
  • 15.
    Query Language (DML) •For each employee with salary higher than 2,000, compute total costs – SELECT f(hire_date, salary) FROM employees WHERE salary >= 2000 – f(hire_date, salary)= mon(today-hire_date)*salary • For each web page detect mime type, language – For each RSS feed, get summary, – For each HTML page extract plain text • Currently a wizzard producing a JSON doc
  • 16.
    User functions • List<byte[]>f(Row r) • May calculate new attribute values, stored with Row and reused by other function • Execution plan: order matters! • Function associates description of input and output fields – Fields dependencies give order
  • 17.
    Input Pipes • Defineshow to get the data – Archive files, Text files, HBase table – Format – The Mappers have always Resource on input, several custom InputFormats and RecordReaders
  • 18.
    Output pipes • Defineswhat to do with (where to store) the query result – File, Table – Format – What columns to materialize • Most of the time PutSortReducer used so OP defines OutputFormat and what to emit
  • 19.
    Query Processing • Withtyped data, Resource wrapper and IPs and OPs – one universal MapReduce job to execute/process queries!! • Most efficient (for data insertion): MapReduce job with Custom Mapper and PutSortReducer • Job init: build combined filter, IP and OP to define input and output formats • Mapper set-up: Init plan of user function applications, init of functions themselves • Mapper map: apply functions on a row according to plan, use OP to emit values • Not all combinations can leverage the PutSortReducer (writing to one table at a time, …)
  • 20.
    Query Processing II Archive File HBase Reduce HBase Map Data File File Views Co-scanner
  • 21.
    Data Subscriptions /Views • Data-in-a-View satisfaction can be checked at the ingestion time, before the data is inserted • Mimicking client side co-processors – allowing the use of bulk loading (no coprocessors for bulk load in the moment) • When new data arrives, user functions/actions are triggered – On demand crawls, focused crawls
  • 22.
    Second Level Triggers •User code ran in the reduce phase (when ingesting): – Put f(Put p, Row r) – Previous versions on the input of the code Can alter the processed Put object before final flush to HFile • Co-scanner: User-defined scanner traversing the processed region aligned with the keys in the created HFile • Example: Change detection on a re-crawled resource
  • 23.
    Aggregation Queries • Computefrequency of mime types in the collection • For a web domain, compute spammicity using word distribution and out-links • Based on a two-step aggregation process – Data is extracted in the Map(), emitted with the View signature – Data is collected in the Reduce(), grouped on the combination (view sig., aggr. key) and aggregated.
  • 24.
    Aggregation Queries Processing •Processed using MapReduce, multiple (compatible) aggregations at once (reading is the most expensive) • Aggregation map phase: List<Pairs> map(Row r), Pair=<Agg_key, value>, Agg_key=<agg_id, agg_on_value,…> • Aggregation reduce phase: reduce(Agg_key, values, context)
  • 25.
    Aggregation Processing Numbers Computemime type distribution of web pages per PLD: SELECT pld, mime, count(*) FROM web_collection GROUP BY extract_pld(url), mime
  • 26.
    Data Ingestion Our crawlerasynchronously writes to Hbase: Input: archive files (ARC, WARC) in HDFS Output: Htable SELECT *, f1(*),..Fn(*) FROM hdfs://path/*.warc.gz 1. Pre-compute the split region boundaries on a data sample – MapReduce on a data input sample 2. Process batch (~0.5TB) MapReduce ingestion 3. Split manually too big regions (or candidates) 4. If there is still input go to 2.
  • 27.
    Data Ingestion Numbers •Store indexable web resources from WARC files to HBase, detect mime type, language, extract plain text and analyze RSS feeds • Reaching steady 40MB/s including extraction • Upper bound 170MB/s (distributed reading of archive files in HDFS) • HBase is idle most of the time! – Allows compacting store files in the meanwhile
  • 28.
    Web Archive Collection •Column families (basic set): 1. Raw content (payload) 2. Meta data (mime, IP, …) 3. Baseline analytics (plain text, detected mime, …) • Usually one another CF per analytics result • CFs as secondary indices: – All analyzed feeds at one place (no need for filter if I am interested in all such rows)
  • 29.
    Web Archive CollectionII • More than 3,000 regions (in one collection) • 12TB of compressed of indexable data (and counting) • Crawl to store/process machine ratio is 1:1.2 • Storage scales out
  • 30.
    HW Architecture • Tensof small low-consumption nodes with a lot of disk space: – 15TB per node, 8GB RAM, dual core CPU – No enclosure -> no active cooling -> no expensive datacenter-ish environment needed • Low per PB storage price (70 nodes/PB), car batteries as UPS, commodity (real low-cost) HW (esp. disks) • Still some reasonable computational power
  • 31.
    Conclusions • Conclusions – Data refinery platform – Customizable, extensible – Large scale • Future work – Incorporating external secondary indices to filter HBase rows/cells • Full text index filtering • Temporal filtering – Larger (100s TBs) scale deployment
  • 32.
    Acknowledgments • European Unionprojects: – LAWA: Longitudinal Analytics of Web Archive data – SCAPE - SCAlable Preservation Environments