HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory Research

A Big Data Refinery Built on HBase

Stanislav Barton
Internet Memory Research

A Content-oriented platform
• Mignify = a content-oriented platform which
– Continuously (almost) ingests Web documents
– Stores and preserves these documents as such AND
– Produces structured content extracted (from single
documents) and aggregated (from groups of documents)
– Organizes both raw documents, extracted and aggregated
information in a consistent information space.
=> Original documents and extracted information are uniformly
stored in HBase

A Service-oriented platform
• Mignify = a service-oriented platform
– Physical storage layer on “custom hardware”
– Crawl on-demand, with sophisticated navigation options
– Able to host third-party extractors, aggregators and
classifiers
– Run new algorithms on existing collections as they arrive
– Supports search, navigation, on-demand query and
extraction
=> A « Web Observatory » built around Hadoop/Hbase.

–

Customers/Users
• Web Archivists – store and organize, live
search
• Search engineers – organize and refine, big
throughput
• Data miners – refine, secondary indices
• Researchers – store, organize and/or refine

Talk Outline
• Architecture Overview
• Use Cases/Scenarios
• Data model
• Queries/ Query language / Query processing
• Usage Examples
• Alternative HW platform

Typical scenarios
• Full text indexers
– Collect documents, with metadata and graph
information
• Wrapper extraction
– Get structured information from web sites
• Entity annotation
– Annotate documents with entity references
• Classification
– Aggregate subsets (e.g., domains); assign them to
topics

Data in HBase
• HBase as a first choice data store:
– Inherent versioning (timestamped values)
– Real time access (Cache, index, key ordering)
– Column oriented storage
– Seamless integration with Hadoop
– Big community
– Production ready/mature implementation

Data model
• Data stored in one big table along with
metadata and extraction results
• Though separated in column families
– CFs as secondary indices
• Raw data stored in HBase ( < 10 MB, HDFS
otherwise)
• Data stored as rows (versioned)
• Values are typed

Types and schema,
where and why
• Initially, our collections consists of purely non-
structured data
• All the point for Mignify is to produce a
backbone of structured information,
• Done through an iterative process which
progressively builds a rich set of interrelated
annotations
• Typing = important to ensure the safety of the
whole process, automation

Data model II
A= <CF, Qualifier>:Type
Schema Type:Writable or byte[]

Resource Version1:(A1.. Ak), Version2:(Ak+1, … Am),…
mignify

Versions <t’,{V1, … Vk}>, <t’’,{Vk+1, … Vm}>,…<t’’’, Vm+1,… Vl>

<(CF1,Qa,t’),v1>,<(CF1,Qb,t’’),v2>, … <(CFn,Qz,t’’’),vm>
HTable v1,… vm:byte[]
HBase

HFiles CF1 CF2 CFn

Extraction Platform
Main Principles
 A framework for specifying data extraction from very
large datasets
 Easy integration and application of new extractors
 High level of genericity in terms of (i) data sources,
(ii) extractors, and (iii) data sinks
 An extraction process specification combines these
elements
 [currently] A single extractor engine
 Based on the specification, data extraction is processed by
a single, generic MapReduce job.

Extraction Platform
Main Concepts
Important: typing (We care about types and schemas!)
 Input and Output Pipes
 Declare data source (e.g., a HBase collection) and data
sinks (e.g, Hbase, HDFS, csv, …)
 Filters (Boolean operators that apply to input data)
 Extractors
 Takes an input Resource, produce Features
 Views
 Combination of input and output pipes, filters and
extractors

Data Queries
• Various data sources (HTable, data files,…)
• Projections using column families and qualifiers
• Selections by HBase filters
FilterList ret = new FilterList(FilterList.Operator.MUST_PASS_ONE);
Filter f1 = new SingleColumnValueFilter(Bytes.toBytes("meta"), Bytes.toBytes("detectedMime"),
CompareFilter.CompareOp.EQUAL, Bytes.toBytes(“text/html”));
ret.addFilter(f1);

• Query results either flushed to files or back to
HBase –> materialized views
• Views are defined per collection as a set of pairs
of Extractors (user defined function) and Filters

Query Language (DML)
• For each employee with salary higher than 2,000,
compute total costs
– SELECT f(hire_date, salary) FROM employees WHERE
salary >= 2000
– f(hire_date, salary)= mon(today-hire_date)*salary
• For each web page detect mime type, language
– For each RSS feed, get summary,
– For each HTML page extract plain text
• Currently a wizzard producing a JSON doc

User functions
• List<byte[]> f(Row r)
• May calculate new attribute values, stored
with Row and reused by other function
• Execution plan: order matters!
• Function associates description of input and
output fields
– Fields dependencies give order

Input Pipes
• Defines how to get the data
– Archive files, Text files, HBase table
– Format
– The Mappers have always Resource on input,
several custom InputFormats and RecordReaders

Output pipes
• Defines what to do with (where to store) the
query result
– File, Table
– Format
– What columns to materialize
• Most of the time PutSortReducer used so OP
defines OutputFormat and what to emit

Query Processing
• With typed data, Resource wrapper and IPs and OPs – one
universal MapReduce job to execute/process queries!!
• Most efficient (for data insertion): MapReduce job with
Custom Mapper and PutSortReducer
• Job init: build combined filter, IP and OP to define input and
output formats
• Mapper set-up: Init plan of user function applications, init
of functions themselves
• Mapper map: apply functions on a row according to plan,
use OP to emit values
• Not all combinations can leverage the PutSortReducer
(writing to one table at a time, …)

Query Processing II
Archive
File

HBase

Reduce
HBase
Map
Data
File File

Views Co-scanner

Data Subscriptions / Views
• Data-in-a-View satisfaction can be checked at
the ingestion time, before the data is inserted
• Mimicking client side co-processors – allowing
the use of bulk loading (no coprocessors for
bulk load in the moment)
• When new data arrives, user functions/actions
are triggered
– On demand crawls, focused crawls

Second Level Triggers
• User code ran in the reduce phase (when
ingesting):
– Put f(Put p, Row r)
– Previous versions on the input of the code Can alter
the processed Put object before final flush to HFile
• Co-scanner: User-defined scanner traversing the
processed region aligned with the keys in the
created HFile
• Example: Change detection on a re-crawled
resource

Aggregation Queries
• Compute frequency of mime types in the
collection
• For a web domain, compute spammicity using
word distribution and out-links
• Based on a two-step aggregation process
– Data is extracted in the Map(), emitted with the
View signature
– Data is collected in the Reduce(), grouped on the
combination (view sig., aggr. key) and aggregated.

Aggregation Queries Processing
• Processed using MapReduce, multiple
(compatible) aggregations at once (reading is
the most expensive)
• Aggregation map phase: List<Pairs> map(Row
r), Pair=<Agg_key, value>, Agg_key=<agg_id,
agg_on_value,…>
• Aggregation reduce phase: reduce(Agg_key,
values, context)

Aggregation Processing Numbers
Compute mime type distribution of web pages per PLD:
SELECT pld, mime, count(*) FROM web_collection
GROUP BY extract_pld(url), mime

Data Ingestion
Our crawler asynchronously writes to Hbase:

Input: archive files (ARC, WARC) in HDFS
Output: Htable

SELECT *, f1(*),..Fn(*) FROM hdfs://path/*.warc.gz

1. Pre-compute the split region boundaries on a data sample
– MapReduce on a data input sample
2. Process batch (~0.5TB) MapReduce ingestion
3. Split manually too big regions (or candidates)
4. If there is still input go to 2.

Data Ingestion Numbers
• Store indexable web resources from WARC
files to HBase, detect mime type, language,
extract plain text and analyze RSS feeds
• Reaching steady 40MB/s including extraction
• Upper bound 170MB/s (distributed reading of
archive files in HDFS)
• HBase is idle most of the time!
– Allows compacting store files in the meanwhile

Web Archive Collection
• Column families (basic set):
1. Raw content (payload)
2. Meta data (mime, IP, …)
3. Baseline analytics (plain text, detected mime, …)
• Usually one another CF per analytics result
• CFs as secondary indices:
– All analyzed feeds at one place (no need for filter
if I am interested in all such rows)

Web Archive Collection II
• More than 3,000 regions (in one collection)
• 12TB of compressed of indexable data (and
counting)
• Crawl to store/process machine ratio is 1:1.2
• Storage scales out

HW Architecture
• Tens of small low-consumption nodes with a lot
of disk space:
– 15TB per node, 8GB RAM, dual core CPU
– No enclosure -> no active cooling -> no expensive
datacenter-ish environment needed
• Low per PB storage price (70 nodes/PB), car
batteries as UPS, commodity (real low-cost) HW
(esp. disks)
• Still some reasonable computational power

Conclusions
• Conclusions
– Data refinery platform
– Customizable, extensible
– Large scale
• Future work
– Incorporating external secondary indices to filter
HBase rows/cells
• Full text index filtering
• Temporal filtering
– Larger (100s TBs) scale deployment

Acknowledgments
• European Union projects:
– LAWA: Longitudinal Analytics of Web Archive data
– SCAPE - SCAlable Preservation Environments

HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory Research

More Related Content

What's hot

Viewers also liked

Similar to HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory Research

More from Cloudera, Inc.

Recently uploaded

HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory Research