More Related Content


More from Cloudera, Inc.(20)


HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory Research

  1. A Big Data Refinery Built on HBase Stanislav Barton Internet Memory Research
  2. A Content-oriented platform • Mignify = a content-oriented platform which – Continuously (almost) ingests Web documents – Stores and preserves these documents as such AND – Produces structured content extracted (from single documents) and aggregated (from groups of documents) – Organizes both raw documents, extracted and aggregated information in a consistent information space. => Original documents and extracted information are uniformly stored in HBase
  3. A Service-oriented platform • Mignify = a service-oriented platform – Physical storage layer on “custom hardware” – Crawl on-demand, with sophisticated navigation options – Able to host third-party extractors, aggregators and classifiers – Run new algorithms on existing collections as they arrive – Supports search, navigation, on-demand query and extraction => A « Web Observatory » built around Hadoop/Hbase. –
  4. Customers/Users • Web Archivists – store and organize, live search • Search engineers – organize and refine, big throughput • Data miners – refine, secondary indices • Researchers – store, organize and/or refine
  5. Talk Outline • Architecture Overview • Use Cases/Scenarios • Data model • Queries/ Query language / Query processing • Usage Examples • Alternative HW platform
  6. Overview of Mignify
  7. Typical scenarios • Full text indexers – Collect documents, with metadata and graph information • Wrapper extraction – Get structured information from web sites • Entity annotation – Annotate documents with entity references • Classification – Aggregate subsets (e.g., domains); assign them to topics
  8. Data in HBase • HBase as a first choice data store: – Inherent versioning (timestamped values) – Real time access (Cache, index, key ordering) – Column oriented storage – Seamless integration with Hadoop – Big community – Production ready/mature implementation
  9. Data model • Data stored in one big table along with metadata and extraction results • Though separated in column families – CFs as secondary indices • Raw data stored in HBase ( < 10 MB, HDFS otherwise) • Data stored as rows (versioned) • Values are typed
  10. Types and schema, where and why • Initially, our collections consists of purely non- structured data • All the point for Mignify is to produce a backbone of structured information, • Done through an iterative process which progressively builds a rich set of interrelated annotations • Typing = important to ensure the safety of the whole process, automation
  11. Data model II A= <CF, Qualifier>:Type Schema Type:Writable or byte[] Resource Version1:(A1.. Ak), Version2:(Ak+1, … Am),… mignify Versions <t’,{V1, … Vk}>, <t’’,{Vk+1, … Vm}>,…<t’’’, Vm+1,… Vl> <(CF1,Qa,t’),v1>,<(CF1,Qb,t’’),v2>, … <(CFn,Qz,t’’’),vm> HTable v1,… vm:byte[] HBase HFiles CF1 CF2 CFn
  12. Extraction Platform Main Principles  A framework for specifying data extraction from very large datasets  Easy integration and application of new extractors  High level of genericity in terms of (i) data sources, (ii) extractors, and (iii) data sinks  An extraction process specification combines these elements  [currently] A single extractor engine  Based on the specification, data extraction is processed by a single, generic MapReduce job.
  13. Extraction Platform Main Concepts Important: typing (We care about types and schemas!)  Input and Output Pipes  Declare data source (e.g., a HBase collection) and data sinks (e.g, Hbase, HDFS, csv, …)  Filters (Boolean operators that apply to input data)  Extractors  Takes an input Resource, produce Features  Views  Combination of input and output pipes, filters and extractors
  14. Data Queries • Various data sources (HTable, data files,…) • Projections using column families and qualifiers • Selections by HBase filters FilterList ret = new FilterList(FilterList.Operator.MUST_PASS_ONE); Filter f1 = new SingleColumnValueFilter(Bytes.toBytes("meta"), Bytes.toBytes("detectedMime"), CompareFilter.CompareOp.EQUAL, Bytes.toBytes(“text/html”)); ret.addFilter(f1); • Query results either flushed to files or back to HBase –> materialized views • Views are defined per collection as a set of pairs of Extractors (user defined function) and Filters
  15. Query Language (DML) • For each employee with salary higher than 2,000, compute total costs – SELECT f(hire_date, salary) FROM employees WHERE salary >= 2000 – f(hire_date, salary)= mon(today-hire_date)*salary • For each web page detect mime type, language – For each RSS feed, get summary, – For each HTML page extract plain text • Currently a wizzard producing a JSON doc
  16. User functions • List<byte[]> f(Row r) • May calculate new attribute values, stored with Row and reused by other function • Execution plan: order matters! • Function associates description of input and output fields – Fields dependencies give order
  17. Input Pipes • Defines how to get the data – Archive files, Text files, HBase table – Format – The Mappers have always Resource on input, several custom InputFormats and RecordReaders
  18. Output pipes • Defines what to do with (where to store) the query result – File, Table – Format – What columns to materialize • Most of the time PutSortReducer used so OP defines OutputFormat and what to emit
  19. Query Processing • With typed data, Resource wrapper and IPs and OPs – one universal MapReduce job to execute/process queries!! • Most efficient (for data insertion): MapReduce job with Custom Mapper and PutSortReducer • Job init: build combined filter, IP and OP to define input and output formats • Mapper set-up: Init plan of user function applications, init of functions themselves • Mapper map: apply functions on a row according to plan, use OP to emit values • Not all combinations can leverage the PutSortReducer (writing to one table at a time, …)
  20. Query Processing II Archive File HBase Reduce HBase Map Data File File Views Co-scanner
  21. Data Subscriptions / Views • Data-in-a-View satisfaction can be checked at the ingestion time, before the data is inserted • Mimicking client side co-processors – allowing the use of bulk loading (no coprocessors for bulk load in the moment) • When new data arrives, user functions/actions are triggered – On demand crawls, focused crawls
  22. Second Level Triggers • User code ran in the reduce phase (when ingesting): – Put f(Put p, Row r) – Previous versions on the input of the code Can alter the processed Put object before final flush to HFile • Co-scanner: User-defined scanner traversing the processed region aligned with the keys in the created HFile • Example: Change detection on a re-crawled resource
  23. Aggregation Queries • Compute frequency of mime types in the collection • For a web domain, compute spammicity using word distribution and out-links • Based on a two-step aggregation process – Data is extracted in the Map(), emitted with the View signature – Data is collected in the Reduce(), grouped on the combination (view sig., aggr. key) and aggregated.
  24. Aggregation Queries Processing • Processed using MapReduce, multiple (compatible) aggregations at once (reading is the most expensive) • Aggregation map phase: List<Pairs> map(Row r), Pair=<Agg_key, value>, Agg_key=<agg_id, agg_on_value,…> • Aggregation reduce phase: reduce(Agg_key, values, context)
  25. Aggregation Processing Numbers Compute mime type distribution of web pages per PLD: SELECT pld, mime, count(*) FROM web_collection GROUP BY extract_pld(url), mime
  26. Data Ingestion Our crawler asynchronously writes to Hbase: Input: archive files (ARC, WARC) in HDFS Output: Htable SELECT *, f1(*),..Fn(*) FROM hdfs://path/*.warc.gz 1. Pre-compute the split region boundaries on a data sample – MapReduce on a data input sample 2. Process batch (~0.5TB) MapReduce ingestion 3. Split manually too big regions (or candidates) 4. If there is still input go to 2.
  27. Data Ingestion Numbers • Store indexable web resources from WARC files to HBase, detect mime type, language, extract plain text and analyze RSS feeds • Reaching steady 40MB/s including extraction • Upper bound 170MB/s (distributed reading of archive files in HDFS) • HBase is idle most of the time! – Allows compacting store files in the meanwhile
  28. Web Archive Collection • Column families (basic set): 1. Raw content (payload) 2. Meta data (mime, IP, …) 3. Baseline analytics (plain text, detected mime, …) • Usually one another CF per analytics result • CFs as secondary indices: – All analyzed feeds at one place (no need for filter if I am interested in all such rows)
  29. Web Archive Collection II • More than 3,000 regions (in one collection) • 12TB of compressed of indexable data (and counting) • Crawl to store/process machine ratio is 1:1.2 • Storage scales out
  30. HW Architecture • Tens of small low-consumption nodes with a lot of disk space: – 15TB per node, 8GB RAM, dual core CPU – No enclosure -> no active cooling -> no expensive datacenter-ish environment needed • Low per PB storage price (70 nodes/PB), car batteries as UPS, commodity (real low-cost) HW (esp. disks) • Still some reasonable computational power
  31. Conclusions • Conclusions – Data refinery platform – Customizable, extensible – Large scale • Future work – Incorporating external secondary indices to filter HBase rows/cells • Full text index filtering • Temporal filtering – Larger (100s TBs) scale deployment
  32. Acknowledgments • European Union projects: – LAWA: Longitudinal Analytics of Web Archive data – SCAPE - SCAlable Preservation Environments