• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Offload the Data Warehouse in the Age of Hadoop

Offload the Data Warehouse in the Age of Hadoop



The data warehouse is likely your most expensive CAPEX and OPEX expense -- and if you haven't checked your warehouse capacity & utlization, it's likely running low. ...

The data warehouse is likely your most expensive CAPEX and OPEX expense -- and if you haven't checked your warehouse capacity & utlization, it's likely running low.

Thanks to Big Data & the advent of Hadoop, it no longer makes economic sense to process bulk data transformations (often called ELT -- Extract, Load & Transform) using data warehouse compute.

Join others who have already offloaded storage & processing from Teradata, Oracle, Netezza & DB2 onto Hadoop to save millions by avoiding upgrades!

Offloading makes your data warehouse run faster for critical end-user queries & frees up storage for Big Data -- but how do you make the jump? What transformations are costing you the most? What data in your warehouse are you not using?
Learn how you can:

Find dormant data. Up to 50% of the data in your data warehouse and data marts is never queried by business users -- but you need the right tools to find it.
Identifty transformations to offload. Quickly find out which ELT transformations you should shift to Hadoop.
Manage data movement & processing to Hadoop. Easily collect, process & distribute data in Hadoop with an intuitive graphical user interface. No coding or scripting required.
Deliver faster Hadoop performance per node. Find out how capabilities in the Apache core can help you accelerate batch Hadoop processing by up to 30% on existing hardware with no code changes, & without risk.



Total Views
Views on SlideShare
Embed Views



1 Embed 1

http://www.linkedin.com 1



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment
  • Jennifer to address this slide- announce the session and introduce the speakers. Instruct on Q& A format.
  • Back when I started my career in Data Warehousing in the 90’s this is what the business was promised.An Enterprise data warehouse would bring together data from every different source system across an organization to create a single, trusted, source of information. Data would be extracted transformed and loaded into the warehouse using ETL tools – these would be used instead of hand coding SQL or COBOL or other scripts because they would provide a graphical user interface that allowed anyone to develop flows and no need for rocket scientists scalability to handle the growing data volumesmetadata to enable re-use and sharingand connectivity to the different sources and targetsETL would then be used to move data from the EDW to marts and delivered to reporting tools.
  • So Here’s the reality of Data warehouses today – as one customer recently described it to me their Data Warehouse has become like a huge oil tanker – slow moving and incredibly difficult to change direction. Because of data volume growth the majority of ETL tools commercial and open source were unable to handle the processing within the batch windows – as a result the only engine capable of handling the data volumes were the database engines thanks to their optimizers. So Transformation was pushed into the source, target and especially the enterprise data warehouse databases as hand coded SQL or BTEQ– this so called ELT meant that many ETL tools became little more than expensive schedulers. The usage of ELT resulted in a spaghetti like architecture and was clearly visible to end users by the fact that requests for new reports or the addition of a new column from the warehouse team involves on average a 6 month delay. With so much hand coded SQL adding a new column becomes incredibly complex – it requires adding to the enterprise data model, updating in the warehouse schema, all the existing ELT scripts need to be modified and SLA’s get abandoned.
  • As you can see in the chart – as ELT has grown then end user reporting and analytics have had to compete for the database storage and capacity – Databases are great when you have the classic use for SQL – Big Data Input, Big Data input, Small result set – exactly what you want to create an aggregated view in a reporting tool but SQL is not ideal for ETL where it’s typically Big input, Big Input, Even bigger output.At first there was less contention as the analysts and warehouse business users ran queries during the day and ELT could be run at night during the overnight batch window but as data volumes increase the batch runs started running into the day and then during the day – today many companies have to do more ELT than can fit into their overnight batch window so they are always trying to catch up and if a load fails it can literally be months before they can recover. It’s also creates a death spiral because you move your best resources to tuning the ELT SQL to improve performance so your less skilled resources hand code new ELT which then needs to be tuned by your best resources. Every step of which hinders agility and increases cost...
  • Steve you certainly bring up excellent points on how ELT processes are driving up data warehousing costs.  Our experience in analyzing data usage at large organizations shows that a significant amount of data is not being used – but is continuously loaded on a daily basis. Dormant data not only is taking up storage capacity, but the bigger impact is the processing capacity in terms of CPU and I/O that is wasted on running ELT on the data warehouse - to load data that the business does not actively use. Admittedly, in many situations – organizations are required by regulatory reasons to maintain a history of data – even if it is not being used. So the best approach here to significantly cut data warehousing costs is to : Eliminate batch loads for data that is not used and not needed. More importantly offload the ELT processes for unused data that needs to be maintained – do it all on on Hadoop and actively archive that unused data on Hadoop. This way you can recover all the wasted capacity from your expensive data warehouse systems. 
  • Thanks santoshSo just to summarize there are 4 dimensions to this problem.First you’ll see that your missing SLA’s – as ELT competes with end use queries and analytics in the warehouseNext you’ll see that the warehouse team implements a data retention window – this is because there’s not enough space and it’s not cost effective to store all the data people want – so instead of the entire history you keep a rolling retention window sometimes as small as a few days or weeksOn average today it takes 6 months to add a new report or a column to the warehouse – customers describe this as the onion effect because each layer gets added because nobody wants to change the layer beneath but when you have to it makes everyone involved cry.Then finally you have the constant upgrade cycle – because of data growth the second you’ve completed an upgrade your already planning for your next one – but the tough thing is selling this to your CFO – if you have to explain you need to spend another $3Million on the warehouse and he asks why and you have to explain it’s so the same report that ran yesterday will still run tomorrow – that’s not a good business case
  • So as we’ve discussed there’s the reality of what happens today in most data warehouses – the before seen here where ETL and ELT in the database are the norm. But as Teradata’s CEO Mike Koehler remarked on a recent earnings call they have found that ETL consumes about 20 to 40% of the workload of their Teradata data warehouses, with some outliers below and above that range. Teradata thinks that 20% of the 20% to 40% of ETL workload being done on Teradata is a good candidate for moving to Hadoop.Now I personally have been involved with ETL my entire career for over 15 years now and in my experience the ELT workload of most data warehouse databases is at least double that so between 40 and 60% and many of the customers we’re working with aren’t looking to move 20% but rather 100% of that ELT into Hadoop but even if you could free 20% of your capacity – that still means you could postpone any major multi million dollar upgrades of Teradata, Db2, Oracle etc.. For a long time.So we’re seeing more and more customers adopt an architecture where the staging area – the dirty secret of every data warehouse which is were the drops from data sources get stored and a lot of the heavy lifting as ELT occurs gets migrated to an enterprise data hub in Hadoop and then the result is moved to the existing data warehouse now with more capacity or direct to reporting tools.
  • Now what’s really interesting about ETL and ELT is that the workload tends to be very transformation intensive – sorts, joins, merges, aggregations, partitioning, compression etc..But the 80/20 rule applies – 20% of your ETL and ELT is what consumes 80% of your batch window, resources, tuning etc…The screen shot on the right is actually from a real customer and this diagram (which they called the battlestargalacticaCylonmothership diagram because of the way it looks from a distance is actually their nightly batch run sequence and every box on the diagram is a Teradata ELT SQL script of several thousand lines of code. They actually found that 10% of their flows consumed 90% of their batch window so its’ not that you have to migrate everything – you just start with the 20% and you’ll see a huge amount of benefit immediately
  • At the end of the day, time and resources consumed by inefficient processes have significant tangible costs.But Hadoop is quickly becoming a disruptive technology that presents a tremendous opportunity for enterprises. The economics of Hadoop when compared to the Enterprise Data Warehouse is quite remarkable. Today the cost of a Terabyte on the Data warehouse can vary from $15k on the low end to more than $80k per Terabyte of fully burdened costs per Terabyte. Enterprises are finding that the cost on Hadoop can be 10 times LESS expensive than the data warehouse.So the question is How can you take advantage of this opportunity and where and how do you begin this process? Appfluent and Synscort together have the complete solution you need.
  • Before we discuss and demonstrate the solution – let us briefly introduce Appfluent and Synscort. Appfluent is a software company whose mission is to transform the economics of Big Data and Hadoop.  Appfluent is the only company that can completely analyze how data is used and enables large enterprises across various vertical industries to reduce costs and optimize performance.
  • The Appfluent Visibility product gives you the ability to asses and analyze expensive transformations and workloads as well as identify unused data – that can serve as the blueprint to begin the process of offloading your data warehouse to Hadoop. The product non-intrusively monitors and correlates users’ application activity and ELT processes with data usage and the associated resource consumption.  The solution provides this visibility across multiple platforms including Teradata, Oracle/Exadata, DB2/Netezza and Hadoop.
  • So by now some of you may be wondering Who is Syncsort. We are a leading Big Data company, dedicated to help our customers to collect, process and distribute extreme data volumes. We provide the fastest sort technology and the fastest Data processing engine in the market, and most recently we released the first truly integrated approach to extract, transform and load data with Hadoop and even on the cloud.Now, if you have a mainframe in your organizations, then you probably know Syncsort, because we run on nearly 50% of the world’s mainframes, we’re the most trusted 3rd party software for mainframes. But our customers have been using us for over 10 years to accelerate ETL and ELT processing – our product has a unique Optimizer (similar to a database SQL optimizer) designed specifically to accelerate ETL and ELT processing. Our customers are companies who deal with some of the largest and most sophisticated data volumes – that’s the reason they’ve come to us, because we solve data problems that no one else can.
  • Every organization is trying to technically and yet economically build infrastructure to keep up with modern data by storing and managing it in a single system, regardless of it’s format. The name people are giving to this is an Enterprise Data Hub and in most cases it’s based on Hadoop, but to deliver on the business requirements for data, an Enterprise Data Hub requires components to Access, Offload and Accelerate Data while also providing Extract Transform and Load (ETL) like functionality along with user productivity that doesn’t require a rocket scientist to do simple tasks and complete enterprise level security. Syncsort enables all this whether your running on Hadoop, Cloud, Mainframes, Unix, Windows or Linux and thanks to it’s unique transformation optimizer can scale with no manual tuning.
  • Now that you know a little about Appfluent and Syncsort – lets look at the process for offloading the data warehouse. You begin with using Appfluent to identify expensive transformations as well as dormant data that are loaded unnecessarily into the warehouse. Once you have identified what can be offloaded – keeping in mind the 80-20 rule where you focus your efforts on identifying the 20% of processing/data that is impacting 80% of your capacity constraints…..  You can use Synscort to re-write the expensive transformations in DMX-H on Hadoop before loading the data into the data warehouse.  You can also move the dormant data to Hadoop and use DMX-h for the transforming and loading this data- if you need to keep updating the unused data.  This way – you can eliminate all of the ELT related to unused data from the data warehouse and run it on Hadoop and store that data on Hadoop. Finally – this is typically not a one time event. You can view Hadoop as an extension of your data warehouse and they will co-exist for the forseeable future. You can repeat this process continually to maximize performance and minimize costs of your overall infrastructure.
  • Before we go into a demonstration for the solution, lets take a look at some of the features that Appfluent provides to get started.  Appfluent’s software parses all the activity on your data warehouse at very granular levels of detail. This enable you to obtain actionable information using the Appfluent Visibility web application. You can identify all of the ELT processes that are most expensive on your system that can be offloaded. Second, since all the SQL activity is parsed, you can identify unused data at a Table and Column level of granularity over specified time periods.  Appfluent also parses the data functions being used to query data so you can assess the amount of history being queried by users – to guide your data retention policies. And finally, in addition to expensive ELT transformations, you can also identify end-user workloads and associated data sets that can be run just as well on Hadoop – freeing up capacity on your data warehouse. 
  • Lets take a look at some real-world examples. In this example, Appfluent was used to identify expensive data extracts being performed by users running SAS on a high-end data warehouse system.  As you can see, the Appfluent Visibility web app was used to select applications that have the name ‘sas’ and focus on workloads that had no ‘constraints’ – meaning only data extracts.  What we found were ‘SAS’ was generated from 5 servers, and … just 42 unique SQL statements were consuming over 300 hours of server time. You can then use Appfluent to easily drill down on this information – and find details such as what data sets were involved and which users were associated with this activity.What we found was that, this activity was related to just 7 Tables, and accessed by a handful of SAS users which Appfluent identified. In this way you can identify data sets to offload to Hadoop and re-direct the application activity to Hadoop. – enabling you to recover wasted data warehouse capacity. 
  • The next example shows expensive ELT transformations. In this case, the ELT processes only constituted less than 2% of their query workload – but was consuming over 60% of CPU and I/O capacity. Think about this skew for a moment! Appfluent can identify the most expensive ELT by both resource consumption and complexity of ELT – for example by Number of Joins, Sub Queries and other inefficiencies – and provide details about the ELT to enable you to begin the offloading process.
  • Finally, here is an example of identifying Unused or Dormant data. You can identify unused Databases, Schemas, Tables and even specific fields within Tables – over specified time periods that are relevant for you. In this case, large Tables were not only unused, but more data was continuing to be loaded into these Tables on a daily basis – taking up wasted ELT processing capacity and unnecessary storage capacity. These 3 examples hopefully gave you a brief glimpse of how Appfluent provides the first step in exposing relevant information that can be used as a blue-print to begin offloading your data warehouse.  Syncsort will now discuss the next two steps in this process.  
  • ThanksSantoshThe second stage in the framework for off-loading data and workloads into Hadoop is Access & Move.Once you’ve identified the data you then have to move it – while Hadoop provides a number of different utilities to move dataThe reality is you will need to use multiple different tools and they don’t have a graphical user interface so you’ll end up manually coding all the scripts and for many critical sources e.g. mainframe – Hadoop offers no connectivitySyncsort provides one solution, that can access data regardless of where it resides - for example we have native high performance connectors to Teradata, Db2, Oracle, IBM Mainframes, Cloud, SalesForce etc..These connectors allow you to extract data and load it natively into the hadoop cluster on each node – or load the data warehouse or marts directly in parallel from Hadoop.We also see a lot of customers are pre-processing and compressing the data before loading into Hadoop – one customers comScore who loads 1.5 Trillion events that’s about 40% of the internet page views through our product DMX-h into Hadoop and Greenplum literally saves Terabytes of storage every month just by sorting the data prior to compression.
  • Once the data is in Hadoop, you will need a way to easily replicate the same workloads that previously ran in the DWH – typically sorts, joins, CDC, aggregations – but now in Hadoop. Now, sure you can manually write tons of scripts with HiveQL, Pig and Java, but that means you will have to re-train a lot of your staff to scale the development process. A steep learning curve awaits you, so getting productive will take some time. Besides Why re-invent the wheel, when you can easily leverage your existing staff and skills? Syncsort helps you get results quickly and with minimum effort with an intuitive graphical user interface where you can create sophisticated data flows without writing a single line of code. You can even develop and test  locally in Windows before deploying into Hadoop. In addition, we provide a set of Use Case Accelerators for common ETL use cases such as CDC, connectivity, aggregations and more.Finally, once you offload from expensive legacy and data warehouses, you need enterprise-grade tools to manage, secure and operationalize the enterprise data hub. With Syncsort you have file-based metadata, this means you can build once and reuse many times. We also provide full integration with management tools such as Cloudera Manager and Hadoop Job tracker – to easily deploy, monitor and administer your Hadoop cluster. And of course, iron security with leading support for Kerberos.When you put all these pieces together, it is what really makes this solution enterprise-ready!
  • Now Santosh and Jeff from Syncsort will do a quick demo of the combined solution
  • Now that you have seen a brief demo on how you can use Appfluent and Syncsort to offload your data warehouse, lets talk about some customers who have done this successfully in production systems. A large financial organization we worked found that their data growth and business needs had begun to grow at a rate that made it economically unsustainable to continue adding more capacity to their Enterprise Data Warehouse. Once they determined that managing data on Hadoop would be more than 5 times cheaper than what it cost them on their data warehouse….they decided to cap the the existing capacity on the data wareahouse and implemented a strategy to deploy Hadoop to extend their data warehouse.  They started a data warehouse modernization project – and systematically began analyzing and identify data sets and expensive transformations – using Appfluent – and offloaded to Cloudera.  The result was that they successfully capped the existing capacity on the data warehouse. They estimated that if they had not done so – they would have had to spend in excess of $15 million on additional capacity over a 18 month period. Instead the Hadoop environment which is now an extension of their data warehouse costs 6-8 times less in total cost of ownership per Terabyte. 
  • This is anther financial institutions one of the largest in the world – the bank had a significant amount of data hosted and batch processed on Teradata. But for them like many Teradata customers – the cost was becoming unsustainable and they were faced with yet another multi million dollar upgrade. So having heard about Hadoop and the significantly lower cost per Gb of data they decided to migrate a loan marketing application to Cloudera’s distribution of Hadoop.While it proved the viability and massive cost savings of the Hadoop platform, they have hundreds more applications that need to be migrated. The loan application they moved across was initially using Hive and HQL and resulted in meeting the SLA but had much slower performance than Teradata and many maintainability concernsThe bank sought tools that could leverage existing staff skills (ETL) to facilitate the migration of the remaining applications and avoid the need to add significant staff with new skills (MapReduce). TheResults were striking - Significantly less development time was required for the DMX-h implementation of the Loan project. 12 man weeks for HiveQL implementation, 4 for DMXhSimplified process with over 140 HIVEQL scripts replaced with twelve graphicalDMX-h jobsMost importantly they Reduced the processing time from6 hrs to 15 minutes
  • So there are three key takeawaysYou should be aware of the warehouse cost and capacity impacts from ELT and dormant data and the way it impacts your end usersOff-loading ELT and un-used data from your EDW to Hadoop has been proven as the lowest risk highest return first project for your new hadoop cluster and the cost savings can justify further Hadoop investment and more moon-shot like projects.It’s 3 simple steps – Identify, Access and Deploy
  • By following these simple steps you can really use an Enterprise Data Hub based on Hadoop and your Enterprise Data Warehouse together with Syncsort and Appfluent to deliver something even better than the original vision of the Enterprise Data Warehouse Today.