How One Company Offloaded Data Warehouse ETL To Hadoop and Saved $30 Million

  • 3,428 views
Uploaded on

A Fortune 100 company recently introduced Hadoop into their data warehouse environment and ETL workflow to save $30 Million. This session examines the specific use case to illustrate the design …

A Fortune 100 company recently introduced Hadoop into their data warehouse environment and ETL workflow to save $30 Million. This session examines the specific use case to illustrate the design considerations, as well as the economics behind ETL offload with Hadoop. Additional information about how the Hadoop platform was leveraged to support extended analytics will also be referenced.

More in: Technology , Business
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
3,428
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
0
Comments
0
Likes
5

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • MapR combines the best of the open source technology with our own deep innovations to provide the most advanced distribution for Apache Hadoop.MapR’s team has a deep bench of enterprise software experience with proven success across storage, networking, virtualization, analytics, and open source technologies.Our CEO has driven multiple companies to successful outcomes in the analytic, storage, and virtualization spaces.Our CTO and co-founder M.C. Srivas was most recently at Google in BigTable. He understands the challenges of MapReduce at huge scale. Srivas was also the chief software architect at Spinnaker Networks which came out of stealth with the fastest NAS storage on the market and was acquired quickly by NetAppThe team includes experience with enterprise storage at Cisco, VmWare, IBM and EMC. Our VP of Engineering was the senior vice president at Informatica where he built and managed a large R&D team of 250 that spanned four geographies with annual revenues of $300M. We also have experience in Business Intelligence and Analytic companies and open source committers in Hadoop, Zookeeper and Mahout including PMC members.MapR is proven technology with installs by leading Hadoop installations across industries and OEM by EMC and Cisco.
  • Need a Platform that serves the broadest sets of use cases….
  • Map Reduce is a paradigm shift. It’s moving the processing to the data.Apache Hadoop is a software framework that supports data-intensive distributed applications. Hadoop was inspired by a published Google MapReduce whitepaper. Apache Hadoop provides a new platform to analyze and process Big Data. With data growth exploding and new unstructured sources of data expanding a new approach is required to handle the volume, variety and velocity of this growing data. Hadoop clustering exploits commodity servers and increasingly less expensive compute, network and storage.Google is the Poster Child for the power of MapReduce. They were the 19th search engine to enter the market. There were 18 companies more successful and within 2 years, Google was the dominant player. That’s the power of the MapReduce framework.---------------------------Long versionA poster child for this is Google. We now take Google’s dominance for granted, but when Google launched their beta in 1998 they were late. They were at least the 19 search engines on the market. Yahoo was dominant, there was infoseek, excite, Lycos, Ask Jeeves, AltaVista (which had the technical cred). It wasn’t until Google published a paper in 2003 that we got a glimpse at their back end architecture. Google was able to reach dominance because they recognized early on the paradigm shift and they were able to index more data, get better results and do it much much more efficiently and cost effectively than their competitors. They went from 19th to first in a few short years because of MapReduce.A Yahoo engineer by the name of Doug Cutting read that same paper in 2003 and developed a Java implementation of MapReduce named after his son’s stuffed elephant that became the basis for the open source Hadoop project. Now when we say Hadoop we’re talking about a robust ecosystem. There are now multiple commercial versions of Hadoop. There’s a complete stack that includes job management, development tools, schedulers, machine learning libraries, etc. MapR’s co-founder and CTO was at Google he was in charge of the BigTable group and understands MapReduce at scale. Our charter was to fix the underlying flaws of the hadoop implementation to make it appropriate more a broader set of applications and work for most organizations.
  • Let’s start with this chart. To reinforce you’re in the right room you picked the right session…Hadoop Not only is it the fastest growing Big Data technology…It is one of the fastest technologies period….Hadoop adoption is happening across industries and across a wide range of application areas.What’s driving this adoption
  • Databases and data warehouses are growing & exceeding capacity too quicklyInactive data consuming storage and degrading performanceLow density & low priority data disproportionately consuming storage & processing capacityBatch windows hitting their limits putting SLAs at riskExtracts put too much load on source systems adding to expenseNot all data required is in the data warehouse
  • With MapR Hadoop is Lights out Data Center ReadyMapR provides 5 99999’s of availability including support for rolling upgrades, self –healing and automated stateful failover. MapR is the only distribution that provides these capabilities, MapR also provides dependable data storage with full data protection and business continuity features. MapR provides point in time recovery to protect against application and user errors. There is end to end check summing so data corruption is automatically detected and corrected with MapR’s self healing capabilities. Mirroring across sites is fully supported.All these features support lights out data center operations. Every two weeks an administrator can take a MapR report and a shopping cart full of drives and replace failed drives.

Transcript

  • 1. 1©MapR Technologies. All rights reserved. How One Company Offloaded Data Warehouse ETL To Hadoop and Saved $30 Million Rob Rosen Sr. Director, Americas Systems Engineering MapR Technologies
  • 2. 2©MapR Technologies. All rights reserved. MapR Overview  Enterprise-grade platform for Hadoop  Deployed at thousands of companies – Including 12 of the Fortune 100  MapR is the preferred analytics platform – Hundreds of billions of events daily – 90% of the world’s Internet population monthly – $1 trillion in retail purchases annually
  • 3. 3©MapR Technologies. All rights reserved. Arrival of Big Data Impacts Data Warehouse Data Warehouse Volume Variety Velocity Prohibitively expensive storage costs Inability to process unstructured formats Faster arrival and processing needs
  • 4. 4©MapR Technologies. All rights reserved. Top Concern for Big Data Multiple data sources Multiple technologies Multiple copies of data “Too many different types, sources, and formats of critical data”
  • 5. 5©MapR Technologies. All rights reserved. The Hadoop Advantage  Fueling an industry revolution by providing infinite capability to store and process Big Data  Expanding analytics across data types  Compelling economics – 20 to 100X more cost effective than alternatives Pioneered at
  • 6. 6©MapR Technologies. All rights reserved. Important Drivers for Hadoop  Data on compute drives efficiencies and better analytics  With Hadoop you don’t need to know what questions to ask beforehand  Simple algorithms on Big Data outperform complex models  Powerful ability to analyze unstructured data
  • 7. 7©MapR Technologies. All rights reserved. Hadoop is the Technology of Choice for Big Data
  • 8. 8©MapR Technologies. All rights reserved. Source Data Social Media, Web Logs Machine Device, Scientific Documents and Emails Batch ETL Transactions, OLTP, OLAP Enterprise Data Warehouse Raw data or infrequently used data consuming capacity Batch windows hitting their limits putting SLAs at risk Databases and data warehouses are exceeding their capacity too quickly How Do You Lower and Control Data Warehouse Costs? Datamarts ODS Traditional Targets
  • 9. 9©MapR Technologies. All rights reserved. Source Data Traditional Targets Social Media, Web Logs Machine Device, Scientific Documents and Emails Transactions, OLTP, OLAP Enterprise Data Warehouse Lower Data Management Costs RDBMS MDM
  • 10. 10©MapR Technologies. All rights reserved. Bottom-Line Impact Sensor Data Web Logs Hadoop RDBMS Benefits:  Both structured and unstructured data  Expanded analytics with MapReduce, NoSQL, etc. DW Query + PresentETL + Long Term StorageETL + Long Term Storage Solution Cost / Terabyte Hadoop Advantage Hadoop $333 Teradata Warehouse Appliance $16,500 50x savings Oracle Exadata $14,000 42x savings IBM Netezza $10,000 30x savings
  • 11. 11©MapR Technologies. All rights reserved. What is the Best Way to Deploy Hadoop? vs. • Highly available and fully protected data • Works with existing tools • Real-time ingestion and extraction • Archive data from data warehouse Transitory Data Store • No long-term scale advantages • Unprotected data • ETL Tool focus Permanent Data Store Enterprise Data Hub
  • 12. 12©MapR Technologies. All rights reserved. An Enterprise Data Hub  Combine different data sources  Minimize data movement  One platform for analytics Sales SCM CRM Public Web Logs Production Data Sensor DataClick Streams Location Social Media Billing Enterprise Data Hub
  • 13. 13©MapR Technologies. All rights reserved. Key Elements of Enterprise Data Hub 99.999% HA Data Protection Disaster Recovery Scalability & Performance Enterprise Integration Multi- tenancy Enterprise-grade platform for the long term • Reliability to support stringent SLAs • Protection from data loss and user or application errors • Support business continuity and meet recovery objectives
  • 14. 14©MapR Technologies. All rights reserved. High Availability and Dependability Reliable Compute Dependable Storage  Automated stateful failover  Automated re-replication  Self-healing from HW and SW failures  Load balancing  Rolling upgrades  No lost jobs or data  99999s of uptime • Business continuity with snapshots and mirrors • Recover to a point in time • End-to-end check summing • Strong consistency • Data safe • Mirror across sites to meet Recovery Time Objectives
  • 15. 15©MapR Technologies. All rights reserved. Enterprise Data Hub Supports a Range of Applications 99.999% HA Data Protection Disaster Recovery Scalability & Performance Enterprise Integration Multi- tenancy Batch Interactive Real-time Self-healing Instant recovery Snapshots for point in time recovery from user or application errors Unlimited files & tables Record setting performance Direct data ingestion and access Fully compliant ODBC access and SQL-92 support Mirroring across clusters and the WAN Secure access to multiple users and groups
  • 16. 16©MapR Technologies. All rights reserved. Business Impact  Saved millions in TCO  10x faster, 100x cheaper  Maintain the same SLAs  Implemented the change without impacting users Summary
  • 17. 17©MapR Technologies. All rights reserved. Q & A Engage with us! @mapr mapr- technologies maprtech MapR maprtech rrosen@maprtech.com