Rethink data management and learn how to break down barriers to Big Data insight with Cloudera's enterprise data hub (EDH), Syncsort offload solutions, and Tableau Software visualization and analytics.
STEVESo what is this?If you are thinking it’s a 3.5 inch floppy disk and it stored 1.44Mb of your data you were born after 1998In 1998 the imac was launched and it was the first home computer not to have one of these as standard just a CD driveAnd to anyone born after then this is the save button in most applications – and you have no idea why it’s the save button and certainly would not call it a floppySo over christmas my mum was sitting with my 4 year old nephew using his ipad and there’s clearly some sort of confusion – so I see the two of them sitting there trying to figure out which slot on the ipad mum can insert a floppy disk with her christmas pudding recipe into.So I can tell you that getting data from a floppy disk onto an ipad is not fun at all and my mum is not sure this whole computer thing is really working out for her son or grandchild because we were largely uselessSo what’s funny is that I guess the equivalent today of a floppy disk is a memory stick – today it can store a lot more data but if I personally want to get a large file from one machine to another like a mac I use dropbox or box and it happens instantly and it’s constantly kept in sync.Technology evolution has completely changed our approach to solving a problem and that’s an important theme
Steve + PaulBack when I started my career in Data Warehousing in the 90’s this is what the business was promised.An Enterprise data warehouse would bring together data from every different source system across an organization to create a single trusted source of information.Data would be extracted transformed and loaded into the warehouse using ETL tools – these would be used instead of hand coding SQL or COBOL or other scripts because they would provide a graphical user interface that allowed anyone even a graduate that just joined your team to develop flows and no rocket scientists required scalability to handle the growing data volumesmetadata to enable re-use and sharing and governanceand transparent connectivity to the different sources and targets including mainframeETL would then be used to move data from the EDW to marts and delivered to reporting tools.
Steve + PaulThis is the reality of most Data Warehouses today. A spaghetti like architecture has evolved because the market leading ETL tools couldn’t cope with the data volumes on core operations like sort, join, merge, aggregation so that workload was pushed into the only place that could handle it – the databases with their optimizer. But that meant ELT hand coded or generated SQL that became impossible to maintain – a customer told me they called this the onion effect because their staging had become layers of SQL that nobody wanted to touch so they just added another layer on top. But if you ever really had to take the onion apart it would make everyone cry - TDWI estimates it takes upwards of 8 weeks to add a column to a table and in my experience that’s low – most times you have to wait a couple of months before they get to your request and start making the change because of the back-logToday the average cost of an integration project runs between $250K and $1M, according to Gartner
So there’s a massive disconnect between the original vision of the warehouse and the realityBut it’s important to note that business users are getting great information from warehouses but they still want fresher data, longer history data, faster analytics, more sources all at a lower costWhile they are seeing longer batch windows – many companies have people sitting around drinking coffee in the mroning until the warehouse is avialableThey have a small subset of a customers lifecycle
So the first thing we all need to recognize is that Mainframes today play a very important role in many organizations. Top telcos, retailers, insurance, healthcare and financial organizations of the world – still rely on mainframes for their most critical applications. When talking to these organizations, it’s not unusual to hear that up to 80% of their corporate data originates in the mainframe. Now, that is some serious Big Data, and organizations cannot afford to neglect it. But Can you afford to analyze it? Well, Mainframes today, costs an average of $16M a year for the typical $10B organization!That’s why many of these organizations are now looking at Hadoop and making mainframes a core piece of their Big Data strategy. Just imagine for a second the kind of insights that you could get by combining detail transactional data from mainframes with clickstream data, web logs, and sentiment anallysis…
Today we're in the middle of a shift in how businesses use information. In the past, you'd define a set of business processes, build applications around each of them, and then go about gathering, conforming, and merging the necessary data sets to support those applications. From an infrastructure perspective, you'd be bringing the data over to the compute, often in relational databases. But you'd be leaving quite a lot on the table.The modern realities of business demand a new approach. Today companies need, more than ever, to become information-driven, but given the amount and diversity of information available, and the rate of change in business, it's simply unsustainable to keep moving around and transforming huge volumes of data.
The foundational platform that's addressing this wide range of problems today is Apache Hadoop, an open source platform for scalable, fault-tolerant data storage and processing that runs on a cluster of industry-standard servers. But Hadoop, in the beginning, wasn't capable of solving these problems. Originally, Hadoop was just a scalable distributed system for storing and processing large amounts of data. You could bring workloads to an effectively limitless amount and variety of data, provided the only kind of work you wanted to do was batch processing by writing Java code, and provided you liked hiring highly-skilled computer scientists to operate it.
Cloudera solved the latter problem with Cloudera Manager, the leading system management application for Apache Hadoop. Customers love Cloudera manager because it makes the complex simple. Hadoop is more than a dozen services running across many machines, with limitless configuration permutations. With Cloudera Manager, customers can centrally manage and monitor their clusters from a single tool. It provides automated installation and configuration of your cluster. Cloudera Manager is really our many years of Hadoop experience realized in software, and helps you get up and running quickly.
Our customers liked the scalability, flexibility, and economic properties of the platform, but, for example, didn't like that they had to move data out to other MPP analytic databases just to run fast SQL queries, so we built Impala, the world's first open source MPP analytic SQL query engine expressly designed for Hadoop. With Impala, you now have a viable open source alternative to proprietary MPP analytic databases, one that also delivers the core scalability, flexibility, and economic benefits of Hadoop.Now, over the past year we've continued to add to the platform, with Search, and Spark for interactive iterative analytics and stream processing. You also get HBase, the online key-value store, to enable real-time applications on the platform. With this range of diverse ways to access your data in Hadoop, far beyond just Java and MapReduce, you can now bring your existing tools and skill sets to the platform. What's even more exciting is that we've recently made it possible for our partners and other 3rd parties to deploy, manage, and monitor their apps in the platform, again leveraging exciting your investments while letting you access an even greater breadth and depth of data, all in one place.
Of course, none of this would matter if the platform weren't reliable, secure, and manageable. * Hadoop today is highly available and Cloudera provides extensions for automated backup and disaster recovery. * Hadoop has had perimeter security for some time but there was a significant gap in the area of fine-grained role-based access controls, the kind you'd expect from a DBMS. That's why, together with the community, we built and contributed the Apache Sentry project which delivers this security for Hive and Impala today, and why we developed Cloudera Navigator to support metadata management, including things like rights auditing, data lineage, and data discovery native to Hadoop. * And all this in addition to the industry-leading system management and customer support you expect from Cloudera.
So you can see a lot has happened in just a few short years. Ultimately what you have here is an enterprise data hub, which has four necessary attributes: * It's Secure and Compliant. In addition to perimeter security and encryption, an EDH offers fine-grained (row and column-level) role-based access controls over data, just like your data warehouse. * It's Governed. You need to understand what data is in your EDH and how it’s used, so an EDH must offer data discovery, data auditing, and data lineage. * It's Unified and Manageable. You need to be able to trust that your data is safe, so an EDH must provide not only native high-availability, fault-tolerance and self-healing storage, but also automated replication and disaster recovery. It also much provide advanced system and management to enable distributed multi-tenant performance. * And it's Open. As an EDH makes it possible to cost-effectively retain data for decades, you need to ensure that the foundational infrastructure is based on open source software and an open platform for 3rd parties. Open source ensures that you are not locked in to any particular vendor’s license agreement; nobody can hold your data or applications hostage. An open platform ensures that you’re not locked into a particular vendor’s stack and that you have a choice of what tools to use with the EDH, for example over 200 ISV products – in particular, Syncsort and Tableau - work with Cloudera today.With an enterprise data hub, our customers are able to store and drive real business impactfrom more data than they'd ever thought possible.
And beyond just the technology, Cloudera provides everything you need to be successful with Hadoop in the enterprise, including training, professional services, the backing of the industry’s only predictive and proactive global support team, and partnership with the experts who actually build Hadoop.So where do you begin? An enterprise data hub offers the utmost flexibility to start small while thinking big. Many organizations start by using an EDH for storage or active archiving, or to accelerate ETL by offloading that processing from their data warehouse or mainframe environment. Others use an EDH to enable rapid exploration of new and interesting data sets that don’t fit well into relational systems. The best part of an EDH is regardless of where you start, the flexibility of the platform allows you to evolve it over time and move from one use case to another so in the end, you have transformed your data management infrastructure to enable your enterprise to become information-driven.You can get started for free today by visiting cloudera.com.
So this is the “Before” BI ArchitectureData sources feeding into a staging layer that has ETL and ELT – but that ELT is using up valuable database resources delivering data out to BI toolsBut business users experience the long wait – with an average of 8 weeks to add a single column
ELT consumes capacitySlow response timesUp to 80% of capacity used for ELT less resources and storage available for end user reports.Only Freshest Data is stored “on-line”Historical data archived (as low as 3 months)Granularity is lost Hot / Warm / Cold / DeadLack of agility6 months (average) to add a new data source / column & generate a new reportBest resources on SQL tuning not new SQL creation.Constant UpgradesData volume growth absorbs all resources to keep existing analysis running / perform upgradesExploration of data a wish list item
Data Warehouse as a practice has no linkage to a particular technology
Tableau mission is to Help people see and understand their data. We have had this mission for over 10 years, and remain completely committed to helping business users discover new insights.
The volume of data is a challenge that faces all customers today. Too much data, too many people needing it. We can see from this chart produced by IDC that the growth of data is going to continue to skyrocket in the coming years.
Next is the issue of the diversity of data. It’s tough when there are so many sources.
Finally, even if you have your big data under control and know how it belongs together, you’re dealing with old school software – hard to use, heavy, complex.
That’s what sparked “The Tableau Revolution” – a new type of business intelligence platform. One that was built from the ground up by people focused on making data easier to make sense of. We started by making it intuitive. We wanted you to be able to mash up any type of data. Slice it, filter it, scan it, select, parse it. We wanted it fast. And more than anything we wanted you to leverage the data from its source. This meant you’d no longer need silos, army of engineers, high priest, lots of time, software customizations, stale reports.
We made it flexible. First we give you the option to connect to any kind of data whether that is in spreadsheets and files, databases and cubes or in a data warehouse. We also give you the option to connect to your data live or to pull it in memory. If you have data that updates a lot, you’ll want to always have the freshest data. Use a live connection. Or maybe your company has invested a ton of dollars in a fast, state of the art performing database. You’ll want to leverage that. You can choose either, or you can even toggle between the two, switching between live and extracts as you go. Tableau is flexible and allows you to work with any data in the way that makes sense for your environment.
We made if for everyone. We made it easy so that anyone would want to adopt it.
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight