Using Hadoop to Offload Data Warehouse Processing and More - Brad Anserson


Published on

Published in: Technology
1 Comment
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • MapR’s innovations have also expanded the use cases that are possible with Hadoop. Not only do we support the full Hadoop API set. MapR provides support for NFS so any file-based application can access the cluster with no changes or rewrites required. MapR provides ODBC support, so any database application or SQL-based tool can access and manipulate data in a MapR cluster. MapR supports real-time streaming access. This greatly expands the applications that are possible with Hadoop moving beyond a batch limitation. Finally, the full HA, DR and data protection capabilities of MapR allow mission critical apps to be deployed safely and allows administrators to meet stringent SLA targets.
  • Because only MapR can reliably run both operational and analytical applications on one platform/cluster, MapR enables a faster closed-loop process between operational applications and analytics. This means:interactive marketers and algorithms can update the rules engines more quickly and provide more real-time targeting of offers and relevant content to consumersFraud models are kept more up to date with the latest patterns to better detect anomalies and take action more quickly on bad actors
  • Using Hadoop to Offload Data Warehouse Processing and More - Brad Anserson

    1. 1. © 2014 MapR Technologies 1© 2014 MapR Technologies
    2. 2. © 2014 MapR Technologies 2 Agenda • Data Warehouse Offload Use Case • How Is This Achieved? • Please Do More!
    3. 3. © 2014 MapR Technologies 3 BIG DATA
    4. 4. © 2014 MapR Technologies 4 AnalyticsETL Your Enterprise Data Warehouse (in reality)
    5. 5. © 2014 MapR Technologies 5 Clean Conform Normalize Present AccessTransformExtract Billing Systems Source Data Current ETL Pipeline Data Warehouse Staging Extract Clean Conform Transform Normalize Present Access Proposed Hybrid Solution Pipeline Hadoop Data Warehouse Data Warehouse Optimization
    6. 6. © 2014 MapR Technologies 6 Leveraging Big Data with Hadoop RDBMS • Only structured data • $10K to $60K per TB • Limited Analytics • 70% cycles for ETL FROM DW Sensor Data Web Logs Hadoop RDBMS Both structured and unstructured data 50x-100x cost savings: ~$333 per TB Claim 20-30% of your data warehouse space back Expanded analytics with MapReduce, NoSQL etc. TO ETL + Long Term Storage DW Query + Present Hadoop ETL + Long Term Storage • No SPOF • Fully protected • Mirrored
    7. 7. © 2014 MapR Technologies 7  CapEx: Cost avoidance for annual Data Warehouse adds  Storage: 20x storage good for next 5 years  Cost: 100x cost reduction  Scale-out Architecture: New nodes can be added on the fly  No Disruption: Hybrid solution ensures no change to upstream/downstream business systems One time Hadoop investment of ~$6.5M provides $33.9M cost savings Results of TCO Evaluation Solution Technology 5 Year Contract Existing Data Warehouse $67M New Hybrid: Data Warehouse+ Hadoop $33M Total Cost Savings $34M
    8. 8. © 2014 MapR Technologies 8© 2014 MapR Technologies How is this Achieved?
    9. 9. © 2014 MapR Technologies 9 Step 1: Admit You Have A Problem EVERYTHING IS AWESOME!
    10. 10. © 2014 MapR Technologies 10 Start Playing Around • Dump some of your raw data into Hadoop – Just use ‘cp’ • Convert your ETL SQL to HiveQL – 90% unchanged – 5% HiveQL semantics – 5% Optimization • Bulk Load Cleansed Data into EDW – Use existing bulk loaders
    11. 11. © 2014 MapR Technologies 11 What Changed? SAN/NAS data data data data data data daa data data data data data function RDBMS Traditional Architecture data function data function data function data function data function data function data function data function data function data function data function data function Distributed Computing function App function App function App
    12. 12. © 2014 MapR Technologies 12 Business Reasons • ETL Window – 60 hours of load time… every day – Embarrassingly Parallel • Cost of EDW – 20x, 50x, 100x reduction • Complex analytics – Compute is Essentially Free – Some models / algorithms / queries don’t fit relational models
    13. 13. © 2014 MapR Technologies 14 Easy Integration with the Enterprise Real-time applications NFS for file-based applications Hadoop APIs for Hadoop applications ODBC & JDBC for SQL-based applications Mission critical and SLA dependent applications
    14. 14. © 2014 MapR Technologies 15 Drill 1.0 Hive 0.13 with Tez Impala 1.x Presto 0.56 Shark 0.8 Vertica Latency Low Medium Low Low Medium Low Files Yes (all Hive file formats) Yes (all Hive file formats) Yes (Parquet, Sequence, …) Yes (RC, Sequence, Text) Yes (all Hive file formats) Yes (all Hive file formats) HBase/M7 Yes Yes Various issues No Yes No Schema Hive or schema- less Hive Hive Hive Hive Proprietary or Hive SQL support ANSI SQL HiveQL HiveQL (subset) ANSI SQL HiveQL ANSI SQL + advanced analytics Client support ODBC/JDBC ODBC/JDBC ODBC/JDBC ODBC/JDBC ODBC/JDBC ODBC/JDBC, ADO.NET, … Large joins Yes Yes No No No Yes Nested data Yes Limited No Limited Limited Limited Hive UDFs Yes Yes Limited No Yes No Transactions No No No No No Yes Optimizer Limited Limited Limited Limited Limited Yes Concurrency Limited Limited Limited Limited Limited Yes Interactive SQL-on-Hadoop: You Have Options! SQL
    15. 15. © 2014 MapR Technologies 16 Structured and Semi-structured - JOIN trades.csv ITT,11/01/2011,08:46:01.827,17.44,200,P,T,00,2323,N,C,,, ITT,11/01/2011,09:04:01.185,17.29,250,P,T,00,2804,N,C,,, ITT,11/01/2011,09:08:08.997,16.97,200,T,FT,00,2950,N,C,,, ITT,11/01/2011,09:30:00.375,17.02,700,T,O X,00,5216,N,C,,, ITT,11/01/2011,09:30:00.375,17.02,700,T,Q,00,5217,N,C,,X, ITT,11/01/2011,09:30:30.160,16.95,100,P,F,00,9247,N,C,,, ITT,11/01/2011,09:30:33.362,16.95,200,P,@,00,9590,N,C,,, ITT,11/01/2011,09:30:33.362,16.98,400,P,@,00,9591,N,C,,, ITT,11/01/2011,09:30:33.362,16.99,100,P,@,00,9592,N,C,,, ITT,11/01/2011,09:30:33.366,16.99,800,P,@,00,9594,N,C,,, equities.json { "symbol" : "ITT", "exchange" : "NYSE", "company" : { "name" : "ITT Corporation", "country" : "United States" } }
    16. 16. © 2014 MapR Technologies 17 Structured and Semi-structured - JOIN ADD JAR /home/ec2-user/brad/csv-serde-1.1.2-0.11.0-all.jar; ADD JAR /home/ec2-user/brad/json-serde-1.1.7.jar; SELECT, sum(t.volume) as total_volume FROM trades t INNER JOIN equities e ON t.symbol=e.symbol GROUP BY ;
    17. 17. © 2014 MapR Technologies 18© 2014 MapR Technologies Please Do More.
    18. 18. © 2014 MapR Technologies 19 Real-time ad targeting Web application serverMobile application server Analytics + Operational Apps Operational applications Real-time and actionable analytics Customer 360 dashboard Data exploration (SQL) Real-time churn prevention Product/service optimization and personalization • User profiles and state • User interactions • Real-time location data • Web and mobile session state • Comments/rankings Cloud services Hadoop (MapR) Real-time
    19. 19. © 2014 MapR Technologies 20 Financial Services Fraud detection Personalized offers Fraud investigation tool Fraud investigator Fraud model Recommendations table Clickstream analysis Online transactions MapR Distribution for Hadoop Analytics Real-time Operational Applications Interactive marketer
    20. 20. © 2014 MapR Technologies 21 Waste & Recycling Leader—Architecture Truck Truck Truck . . . MapR Geolocation Geolocation Geolocation Online alerts Batch processing (MapReduce) Tax reduction reporting Shortest path graph algorithm (Titan) Route optimization Real-time stream processing (Apache Storm)
    21. 21. © 2014 MapR Technologies 22
    22. 22. © 2014 MapR Technologies 23 Please do more! Q&A @mapr maprtech MapR maprtech mapr-technologies