Data Warehouse Offload
Upcoming SlideShare
Loading in...5
×
 

Data Warehouse Offload

on

  • 721 views

Presented at BigData.SG, October 2013

Presented at BigData.SG, October 2013

Statistics

Views

Total Views
721
Views on SlideShare
716
Embed Views
5

Actions

Likes
2
Downloads
35
Comments
0

1 Embed 5

http://nzee.pancakeapps.com 5

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • ----- Meeting Notes (3/22/13 11:57) -----Add a before and afterbroader data sources…. data

Data Warehouse Offload Data Warehouse Offload Presentation Transcript

  • 1©MapR Technologies - Confidential Data Warehouse Offload (ETL and ELT and Preprocessing, Oh My!)
  • 2©MapR Technologies - Confidential Introduce Myself John Berns, Solutions Architect, APAC for MapR I’ve been involed in Big Data for three years, using Hadoop for two. (I go waaaaay back!) I’m also co-founder of BigData.SG and Hadoop.SG  http://bigdata.sg  http://hadoop.sg I’m a Hadoop nerd—and proud of it.
  • 3©MapR Technologies - Confidential Traditional Data Warehouse View slide
  • 4©MapR Technologies - Confidential Arrival of Big Data impacts DW BIG DATA Volume Variety Velocity Prohibitively expensive storage costs Inability to process unstructured formats Faster arrival and processing needs DW needs to accommodate Big Data View slide
  • 5©MapR Technologies - Confidential Scaling the Data Warehouse-MPP Databases
  • 6©MapR Technologies - Confidential But There Are Some Problems Scaling  Cost – Data Warehouse costs $$$,000’s per terabyte  Works only on relational data; doesn’t like unstructured data  Fixed schema—you can only query the data in ways that are predefined by the existing schema
  • 7©MapR Technologies - Confidential Accommodating Big Data RDBMS Sensor Data Web Logs Hadoop RDBMS • Only structured data • $50K – 100K per TB • Limited Analytics Both structured and unstructured data 50x-100x cost savings: $1K per TB Expanded analytics with MapReduce, NoSQL etc. FROM TO DW DW ETL + Long Term Storage Query + Present Hadoop ETL + Long Term Storage
  • 8©MapR Technologies - Confidential Data Warehouse Meets Big Data  Use ELT to handle semi-structured (or even unstructured) data  ELT applies structure after the data is loaded  Use compute power to do the transformation  Can be done in parallel—that’s what Hadoop is good for!  ELT for ETL – process semi-structured data & save structured data  Connect via ODBC or JDBC and execute queries on the fly
  • 9©MapR Technologies - Confidential ELT: Applying Schema on Load CREATE TABLE apachelog ( host STRING, identity STRING, user STRING, time STRING, request STRING, status STRING, size STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES ( "input.regex" = "([^]*) ([^]*) ([^]*) (-|[^]*]) ([^ "]*|"[^"]*") (-|[0-9]*) (-|[0-9]*)", "output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s" ) STORED AS TEXTFILE;
  • 10©MapR Technologies - Confidential Read Semi-Structured Data & CreateStructure 127.0.0.1 user-identifier frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 host 127.0.0.1 identity 1001 user frank time 10/Oct/2000:13:55:36 -0700 request GET /apache_pb.gif HTTP/1.0 status 200 size 2326
  • 11©MapR Technologies - Confidential Accommodating Big Data RDBMS Sensor Data Web Logs Hadoop RDBMS • Only structured data • $50K – 100K per TB • Limited Analytics Both structured and unstructured data 50x-100x cost savings: $1K per TB Expanded analytics with MapReduce, NoSQL etc. FROM TO DW DW ETL + Long Term Storage Query + Present Hadoop ETL + Long Term Storage
  • 12©MapR Technologies - Confidential MapR Strengths for DW Offload Best ROI • 2x Performance • No custom connectors • Unlimited scale Easiest Integration • Works with existing tools • Streaming ingestion and extraction Enterprise Grade Platform • 99.999% HA • Full data protection • Disaster recovery
  • 13©MapR Technologies - Confidential MapR Customer Case Study Teradata Teradata OLD NEW • All ETL steps done in Teradata • Cost prohibitive scaling • Data warehouse team not able to handle new data formats • Replaced 5 out of 7 ETL steps • Only hot data is stored in EDW • Existing applications not affected • Extensively leverage NFS to directly ingest data into Teradata Large Telecom Company Deployed Billing applications using Teradata Hundreds of users and applications across the enterprise Hadoop
  • 14©MapR Technologies - Confidential  Lots of Data  Lots of Scans Across Large Sets  Throughput Important Data ShapeTelecom
  • 15©MapR Technologies - Confidential ETL CDR billing records Billing reports Data Warehouse Customer bills Original Flow – ELTL
  • 16©MapR Technologies - Confidential ETL CDR billing records Billing reports Data Warehouse Customer billing With ETL Offload
  • 17©MapR Technologies - Confidential Price Performance  EDW strategy –1.5x performance –$30 million  MapR Strategy –3x performance –$3 million  20x cost/performance advantage for MapR strategy
  • 18©MapR Technologies - Confidential Business Impact:  Saved $30M in 5 year TCO  Able to store all data and have a scalable architecture for future  Do not have to maintain any special connectors  A happy Ops team enhancing services for its internal customers with MapReduce  Implemented the change without impacting internal users MapR Customer Case Study continued
  • 19©MapR Technologies - Confidential Wrapping It Up… My contact info: jberns@maprtech.com http://www.linkedin.com/in/jfxberns Find the slides at: http://www.slideshare.net Whitepaper with mode details on Data Warehouse Offload: http://www.mapr.com/solutions/data-warehouse-offload