Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Xadoop - new approaches to data analytics


Published on

Overview of our data analytics work given to Microsoft SQL Server guys during their visit to Systems Group, ETH Zurich

Published in: Technology
  • Be the first to comment

Xadoop - new approaches to data analytics

  1. 1. Systems Group Dept. Computer Science ETH Zurich - Switzerland Xadoop – new approaches to data analytics Lukas Blunschi, Maxim Grinev , Maria Grineva, Donald Kossmann, Georg Polzer, Kurt Stockinger (Credit Suisse)
  2. 2. Credit Suisse Project <ul><li>Task: Analyze Oracle query logs for audit purposes </li></ul><ul><ul><li>Log size: 6 TB new data every 6 month s </li></ul></ul><ul><ul><li>Typical query: who queried column A in table B in the second quarter of 2009 </li></ul></ul><ul><ul><li>A few queries like this twice a year </li></ul></ul><ul><li>Issues: </li></ul><ul><ul><li>Storing logs in Oracle tables is slow => Storing in XML files instead </li></ul></ul><ul><ul><li>Scan-intensive queries because of complex log processing (SQL parsing) </li></ul></ul>
  3. 3. Possible Solutions <ul><li>Build a warehouse </li></ul><ul><ul><li>Not cost effective for a few queries twice a year </li></ul></ul><ul><li>Use Hadoop </li></ul><ul><ul><li>Open source but proven software </li></ul></ul><ul><ul><li>Logs are already in files </li></ul></ul><ul><ul><li>Easy to implement the queries and to deploy </li></ul></ul>
  4. 4. Hadoop Solution 1: Using Pig <ul><li>Pig – High-level data processing language compiled to MapReduce </li></ul><ul><li>Advantages: </li></ul><ul><ul><li>It is easy to develop in Pig </li></ul></ul><ul><ul><li>Extendable via User Defined Functions in Java </li></ul></ul><ul><ul><li>Widely used by Web companies (Twitter, etc) </li></ul></ul><ul><li>Disadvantages: </li></ul><ul><ul><li>Have to write a format-specific data loader to parse XML </li></ul></ul><ul><ul><li>Restricted support for nested queries </li></ul></ul>
  5. 5. Hadoop Solution 1: Pig Example <ul><li>Get users who queried table “LOGON_INFO” after the date and sorted by number of requests : </li></ul><ul><li>register ./pigxml.jar </li></ul><ul><li>define DATECOMP ch.ethz.xadoop.udf.DATECOMP(); </li></ul><ul><li>define XMLLoader ch.ethz.xadoop.loader.XMLLoader() ; </li></ul><ul><li>A = load 'audit.xml' using XMLLoader() as (action, audit_type, comment_text, db_user, entry_id, instance_number, object_name, object_schema, os_process, os_user, return_code, scn, session_id, sql_bind, sql_text, terminal, user_host, extended_timestamp); </li></ul><ul><li>B = filter A by sql_text matches '.*LOGON_INFO.*' and DATECOMP((chararray)extended_timestamp, '2010-03-04T10:00:43.775225') > 0; </li></ul><ul><li>B1 = group B by db_user; </li></ul><ul><li>B2 = foreach B1 generate group, COUNT(B.sql_text) as num_of_queries; </li></ul><ul><li>B3 = order B2 by num_of_queries desc; </li></ul><ul><li>dump B3; </li></ul>
  6. 6. Hadoop Solution 1: Experiments 38m 10s 26m 20s 11m 05s 5 workers 59m 20s 40m 30s 19m 00s 3 workers 90 Gb 60 Gb 30 Gb
  7. 7. Hadoop Solution 2: Using XQuery <ul><li>Xadoop is an integration of XQuery (Zorba) and Hadoop: </li></ul><ul><ul><li>Map and Reduce are implemented in XQuery </li></ul></ul><ul><li>Advantages: </li></ul><ul><ul><li>Don’t need to write a loader for XML input </li></ul></ul><ul><ul><li>XQuery is a powerful data processing and transformation language with support for UDFs </li></ul></ul><ul><li>Disadvantages: </li></ul><ul><ul><li>You have to think in terms of two programming models: MapReduce and XQuery – that is quite natural and useful in practice though </li></ul></ul>
  8. 8. Hadoop Solution 2: Using XQuery <ul><li>declare function xadoop:map($ record ) { </li></ul><ul><li>for $r in $record </li></ul><ul><li>where fn:contains($r/sql_text, “ LOGON_INFO ”) and xs:date( $r/ extended_timestamp) > xs:date(&quot;2000-0 3 -0 4 &quot;) </li></ul><ul><li>return (<key> {$r/db_user} </key>,<value>1</value>) </li></ul><ul><li>}; </li></ul><ul><li>declare function xadoop:reduce($key, $ num ) { </li></ul><ul><li>($key,<value>{fn:count($ num /value)}</value>) </li></ul><ul><li>}; </li></ul>
  9. 9. Future Work: Vision <ul><li>You cannot merge traditional OLAP and OLTP systems: </li></ul><ul><ul><li>OLAP – pre-aggregated data with redundancy </li></ul></ul><ul><ul><li>OLTP – tend to be normalized </li></ul></ul><ul><li>There are two trends on the Web </li></ul><ul><ul><li>Hadoop is often used for analytic processing instead of warehouses </li></ul></ul><ul><ul><li>Key-value store is used for OLTP </li></ul></ul><ul><li>MapReduce and key-value store are good match </li></ul><ul><ul><li>MapReduce takes raw operational data and does aggregation on-the-fly </li></ul></ul><ul><ul><li>Key-value store is a natural input for MapReduce </li></ul></ul>
  10. 10. Future work: Issues <ul><li>Running Hadoop MapReduce over Cassandra key-value store: </li></ul><ul><ul><li>“ SQL/XQuery” over Cassandra/BigTable data model compiled to M/R </li></ul></ul><ul><ul><li>How to share resources (CPU, I/O) to support both transactional and analytical workloads over the same store </li></ul></ul><ul><li>Real-time analytics: </li></ul><ul><ul><li>From pull (batch) to push (online) processing models </li></ul></ul><ul><ul><li>Hadoop is slow but can be optimized (e.g. checkpointing into main memory of another cloud machine instead of local disk) </li></ul></ul>