Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory Research

Mignify is a platform for collecting, storing and analyzing Big Data harvested from the web. It aims at providing an easy access to focused and structured information extracted from Web data flows. It consists of a distributed crawler, a resource-oriented storage based on HDFS and HBase, and an extraction framework that produces filtered, enriched, and aggregated data from large document collections, including the temporal aspect. The whole system is deployed in an innovative hardware architecture comprising of a high number of small (low-consumption) nodes. This talk will tackle the decisions made along the design and development of the platform, both under a technical and functional perspective. It will introduce the cloud infrastructure, the LTE-like ingestion of the crawler output into HBase/HDFS, and the triggering mechanism of analytics based on a declarative filter/extraction specification. The design choices will be illustrated with a pilot application targeting Daily Web Monitoring in the context of a national domain.

  • Login to see the comments

HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory Research

  1. 1. A Big Data Refinery Built on HBase Stanislav Barton Internet Memory Research
  2. 2. A Content-oriented platform• Mignify = a content-oriented platform which – Continuously (almost) ingests Web documents – Stores and preserves these documents as such AND – Produces structured content extracted (from single documents) and aggregated (from groups of documents) – Organizes both raw documents, extracted and aggregated information in a consistent information space.=> Original documents and extracted information are uniformly stored in HBase
  3. 3. A Service-oriented platform• Mignify = a service-oriented platform – Physical storage layer on “custom hardware” – Crawl on-demand, with sophisticated navigation options – Able to host third-party extractors, aggregators and classifiers – Run new algorithms on existing collections as they arrive – Supports search, navigation, on-demand query and extraction=> A « Web Observatory » built around Hadoop/Hbase. –
  4. 4. Customers/Users• Web Archivists – store and organize, live search• Search engineers – organize and refine, big throughput• Data miners – refine, secondary indices• Researchers – store, organize and/or refine
  5. 5. Talk Outline• Architecture Overview• Use Cases/Scenarios• Data model• Queries/ Query language / Query processing• Usage Examples• Alternative HW platform
  6. 6. Overview of Mignify
  7. 7. Typical scenarios• Full text indexers – Collect documents, with metadata and graph information• Wrapper extraction – Get structured information from web sites• Entity annotation – Annotate documents with entity references• Classification – Aggregate subsets (e.g., domains); assign them to topics
  8. 8. Data in HBase• HBase as a first choice data store: – Inherent versioning (timestamped values) – Real time access (Cache, index, key ordering) – Column oriented storage – Seamless integration with Hadoop – Big community – Production ready/mature implementation
  9. 9. Data model• Data stored in one big table along with metadata and extraction results• Though separated in column families – CFs as secondary indices• Raw data stored in HBase ( < 10 MB, HDFS otherwise)• Data stored as rows (versioned)• Values are typed
  10. 10. Types and schema, where and why• Initially, our collections consists of purely non- structured data• All the point for Mignify is to produce a backbone of structured information,• Done through an iterative process which progressively builds a rich set of interrelated annotations• Typing = important to ensure the safety of the whole process, automation
  11. 11. Data model II A= <CF, Qualifier>:Type Schema Type:Writable or byte[] Resource Version1:(A1.. Ak), Version2:(Ak+1, … Am),…mignify Versions <t’,{V1, … Vk}>, <t’’,{Vk+1, … Vm}>,…<t’’’, Vm+1,… Vl> <(CF1,Qa,t’),v1>,<(CF1,Qb,t’’),v2>, … <(CFn,Qz,t’’’),vm> HTable v1,… vm:byte[]HBase HFiles CF1 CF2 CFn
  12. 12. Extraction Platform Main Principles A framework for specifying data extraction from very large datasets  Easy integration and application of new extractors High level of genericity in terms of (i) data sources, (ii) extractors, and (iii) data sinks  An extraction process specification combines these elements [currently] A single extractor engine  Based on the specification, data extraction is processed by a single, generic MapReduce job.
  13. 13. Extraction Platform Main ConceptsImportant: typing (We care about types and schemas!) Input and Output Pipes  Declare data source (e.g., a HBase collection) and data sinks (e.g, Hbase, HDFS, csv, …) Filters (Boolean operators that apply to input data) Extractors  Takes an input Resource, produce Features Views  Combination of input and output pipes, filters and extractors
  14. 14. Data Queries• Various data sources (HTable, data files,…)• Projections using column families and qualifiers• Selections by HBase filtersFilterList ret = new FilterList(FilterList.Operator.MUST_PASS_ONE);Filter f1 = new SingleColumnValueFilter(Bytes.toBytes("meta"), Bytes.toBytes("detectedMime"),CompareFilter.CompareOp.EQUAL, Bytes.toBytes(“text/html”));ret.addFilter(f1);• Query results either flushed to files or back to HBase –> materialized views• Views are defined per collection as a set of pairs of Extractors (user defined function) and Filters
  15. 15. Query Language (DML)• For each employee with salary higher than 2,000, compute total costs – SELECT f(hire_date, salary) FROM employees WHERE salary >= 2000 – f(hire_date, salary)= mon(today-hire_date)*salary• For each web page detect mime type, language – For each RSS feed, get summary, – For each HTML page extract plain text• Currently a wizzard producing a JSON doc
  16. 16. User functions• List<byte[]> f(Row r)• May calculate new attribute values, stored with Row and reused by other function• Execution plan: order matters!• Function associates description of input and output fields – Fields dependencies give order
  17. 17. Input Pipes• Defines how to get the data – Archive files, Text files, HBase table – Format – The Mappers have always Resource on input, several custom InputFormats and RecordReaders
  18. 18. Output pipes• Defines what to do with (where to store) the query result – File, Table – Format – What columns to materialize• Most of the time PutSortReducer used so OP defines OutputFormat and what to emit
  19. 19. Query Processing• With typed data, Resource wrapper and IPs and OPs – one universal MapReduce job to execute/process queries!!• Most efficient (for data insertion): MapReduce job with Custom Mapper and PutSortReducer• Job init: build combined filter, IP and OP to define input and output formats• Mapper set-up: Init plan of user function applications, init of functions themselves• Mapper map: apply functions on a row according to plan, use OP to emit values• Not all combinations can leverage the PutSortReducer (writing to one table at a time, …)
  20. 20. Query Processing IIArchive FileHBase Reduce HBase Map Data File File Views Co-scanner
  21. 21. Data Subscriptions / Views• Data-in-a-View satisfaction can be checked at the ingestion time, before the data is inserted• Mimicking client side co-processors – allowing the use of bulk loading (no coprocessors for bulk load in the moment)• When new data arrives, user functions/actions are triggered – On demand crawls, focused crawls
  22. 22. Second Level Triggers• User code ran in the reduce phase (when ingesting): – Put f(Put p, Row r) – Previous versions on the input of the code Can alter the processed Put object before final flush to HFile• Co-scanner: User-defined scanner traversing the processed region aligned with the keys in the created HFile• Example: Change detection on a re-crawled resource
  23. 23. Aggregation Queries• Compute frequency of mime types in the collection• For a web domain, compute spammicity using word distribution and out-links• Based on a two-step aggregation process – Data is extracted in the Map(), emitted with the View signature – Data is collected in the Reduce(), grouped on the combination (view sig., aggr. key) and aggregated.
  24. 24. Aggregation Queries Processing• Processed using MapReduce, multiple (compatible) aggregations at once (reading is the most expensive)• Aggregation map phase: List<Pairs> map(Row r), Pair=<Agg_key, value>, Agg_key=<agg_id, agg_on_value,…>• Aggregation reduce phase: reduce(Agg_key, values, context)
  25. 25. Aggregation Processing NumbersCompute mime type distribution of web pages per PLD:SELECT pld, mime, count(*) FROM web_collection GROUP BY extract_pld(url), mime
  26. 26. Data IngestionOur crawler asynchronously writes to Hbase:Input: archive files (ARC, WARC) in HDFSOutput: HtableSELECT *, f1(*),..Fn(*) FROM hdfs://path/*.warc.gz1. Pre-compute the split region boundaries on a data sample – MapReduce on a data input sample2. Process batch (~0.5TB) MapReduce ingestion3. Split manually too big regions (or candidates)4. If there is still input go to 2.
  27. 27. Data Ingestion Numbers• Store indexable web resources from WARC files to HBase, detect mime type, language, extract plain text and analyze RSS feeds• Reaching steady 40MB/s including extraction• Upper bound 170MB/s (distributed reading of archive files in HDFS)• HBase is idle most of the time! – Allows compacting store files in the meanwhile
  28. 28. Web Archive Collection• Column families (basic set): 1. Raw content (payload) 2. Meta data (mime, IP, …) 3. Baseline analytics (plain text, detected mime, …)• Usually one another CF per analytics result• CFs as secondary indices: – All analyzed feeds at one place (no need for filter if I am interested in all such rows)
  29. 29. Web Archive Collection II• More than 3,000 regions (in one collection)• 12TB of compressed of indexable data (and counting)• Crawl to store/process machine ratio is 1:1.2• Storage scales out
  30. 30. HW Architecture• Tens of small low-consumption nodes with a lot of disk space: – 15TB per node, 8GB RAM, dual core CPU – No enclosure -> no active cooling -> no expensive datacenter-ish environment needed• Low per PB storage price (70 nodes/PB), car batteries as UPS, commodity (real low-cost) HW (esp. disks)• Still some reasonable computational power
  31. 31. Conclusions• Conclusions – Data refinery platform – Customizable, extensible – Large scale• Future work – Incorporating external secondary indices to filter HBase rows/cells • Full text index filtering • Temporal filtering – Larger (100s TBs) scale deployment
  32. 32. Acknowledgments• European Union projects: – LAWA: Longitudinal Analytics of Web Archive data – SCAPE - SCAlable Preservation Environments