Today we're in the middle of a shift in how businesses use information. In the past, you'd define a set of business processes, build applications around each of them, and then go about gathering, conforming, and merging the necessary data sets to support those applications. From an infrastructure perspective, you'd be bringing the data over to the compute, often in relational databases. But you'd be leaving quite a lot on the table.The modern realities of business demand a new approach. Today companies need, more than ever, to become information-driven, but given the amount and diversity of information available, and the rate of change in business, it's simply unsustainable to keep moving around and transforming huge volumes of data.
The foundational platform that's addressing this wide range of problems today is Apache Hadoop, an open source platform for scalable, fault-tolerant data storage and processing that runs on a cluster of industry-standard servers. But Hadoop, in the beginning, wasn't capable of solving these problems. Originally, Hadoop was just a scalable distributed system for storing and processing large amounts of data. You could bring workloads to an effectively limitless amount and variety of data, provided the only kind of work you wanted to do was batch processing by writing Java code, and provided you liked hiring highly-skilled computer scientists to operate it.
Cloudera solved the latter problem with Cloudera Manager, the leading system management application for Apache Hadoop. Customers love Cloudera manager because it makes the complex simple. Hadoop is more than a dozen services running across many machines, with limitless configuration permutations. With Cloudera Manager, customers can centrally manage and monitor their clusters from a single tool. It provides automated installation and configuration of your cluster. Cloudera Manager is really our many years of Hadoop experience realized in software, and helps you get up and running quickly.
Our customers liked the scalability, flexibility, and economic properties of the platform, but, for example, didn't like that they had to move data out to other MPP analytic databases just to run fast SQL queries, so we built Impala, the world's first open source MPP analytic SQL query engine expressly designed for Hadoop. With Impala, you now have a viable open source alternative to proprietary MPP analytic databases, one that also delivers the core scalability, flexibility, and economic benefits of Hadoop.Now, over the past year we've continued to add to the platform, with Search, and Spark for interactive iterative analytics and stream processing. You also get HBase, the online key-value store, to enable real-time applications on the platform. With this range of diverse ways to access your data in Hadoop, far beyond just Java and MapReduce, you can now bring your existing tools and skill sets to the platform. What's even more exciting is that we've recently made it possible for our partners and other 3rd parties to deploy, manage, and monitor their apps in the platform, again leveraging exciting your investments while letting you access an even greater breadth and depth of data, all in one place.
Of course, none of this would matter if the platform weren't reliable, secure, and manageable. * Hadoop today is highly available and Cloudera provides extensions for automated backup and disaster recovery. * Hadoop has had perimeter security for some time but there was a significant gap in the area of fine-grained role-based access controls, the kind you'd expect from a DBMS. That's why, together with the community, we built and contributed the Apache Sentry project which delivers this security for Hive and Impala today, and why we developed Cloudera Navigator to support metadata management, including things like rights auditing, data lineage, and data discovery native to Hadoop. * And all this in addition to the industry-leading system management and customer support you expect from Cloudera.
So you can see a lot has happened in just a few short years. Ultimately what you have here is an enterprise data hub, which has four necessary attributes: * It's Secure and Compliant. In addition to perimeter security and encryption, an EDH offers fine-grained (row and column-level) role-based access controls over data, just like your data warehouse. * It's Governed. You need to understand what data is in your EDH and how it’s used, so an EDH must offer data discovery, data auditing, and data lineage. * It's Unified and Manageable. You need to be able to trust that your data is safe, so an EDH must provide not only native high-availability, fault-tolerance and self-healing storage, but also automated replication and disaster recovery. It also much provide advanced system and management to enable distributed multi-tenant performance. * And it's Open. As an EDH makes it possible to cost-effectively retain data for decades, you need to ensure that the foundational infrastructure is based on open source software and an open platform for 3rd parties. Open source ensures that you are not locked in to any particular vendor’s license agreement; nobody can hold your data or applications hostage. An open platform ensures that you’re not locked into a particular vendor’s stack and that you have a choice of what tools to use with the EDH, for example over 200 ISV products – such as Tableau Software - work with Cloudera today.With an enterprise data hub, our customers are able to store and drive real business impactfrom more data than they'd ever thought possible.
The expansive capabilities of Hadoop, and an enterprise data hub – the ability to store, process, and analyze huge quantities of data with varying levels of sensitivity from many different sources – structured, semi-structured, and unstructured - require a robust security capability to manage the range of vulnerabilities that may arise.As data proliferates, many new users of different types require access, and many different types of tools will access the data, raising concerns about ongoing management and compliance. Organizations will need to anticipate how they will ensure data quality throughout the information pipeline, enforce controls that guarantee appropriate access and rights, and move from ungoverned data systems with full administration, visibility, and security that allow them to discovery, explore, and consume data with full confidence.
Enter Cloudera Navigator, the first fully integrated data management application for Apache Hadoop designed to provide all of the capabilities required for administrators, data managers and analysts to secure, govern, classify and explore the large amounts of diverse data in their Hadoop clusters. Control: Navigator provides the system and data control necessary for compliance and risk management teams to ensure that their organization’s policies extend to critical and sensitive data within Hadoop., visibility, productivity, and reliability extend to critical and sensitive data within Hadoop. IT professionals benefit from the simple, centralized management functions offered by Cloudera Manager, so they gain both system and data control from an integrated end-to-end experienceVisibility – Navigator establishes a centralized system for verifying access permissions across all files and directories within Hadoop. Administrators and operations teams can validate their usage and data access policies by confirming individual and group rights and access. Productivity – Analysts, data scientists and business users easily identify data sets of interest and familiarize themselves with the various structures and formats. As a result, they can more quickly generate insights that benefit the business. Reliability – Navigator Lineage capabilities offer the ability to visually trace the progression of a data set from original source(s) to current state. This gives compliance officers, quality managers, executives and anyone else concerned with data cleanliness a high degree of confidence in the reliability of the data they use for reporting or to make decisions.
Tableau mission is to Help people see and understand their data. We have had this mission for over 10 years, and remain completely committed to helping business users discover new insights.
Data discovery has evolved. It has always been part to businesses, but it was typically done on the desktop or on “business server” environments. Business analysts spend most of their time preparing data to do work, rather than doing the work. Governance was/is Broken! Business users print, email, duplicate, and extract data assets from all over the organization… in a attempt to get their job done. The requirements process of traditional BI tools has failed organizations: 1) To Slow; 2) Requirements Change; 3) rely on a limited few; 4) to inflexible for the needs of the business; 5) costly; and 6) reactive.
We made if for everyone. We made it easy so that anyone would want to adopt it.
Transcript of "Govern This! Data Discovery and the application of data governance with new stack technologies"
Data Discovery & the Application of Data Governance
Cloudera and Tableau Software Online Webinar
May 1, 2014
Paul Lilford, Tableau Software
Marc Lobree, Tableau Software
Arlene Boyd, Cloudera
Mark Donsky, Cloudera
Do you use Hadoop for data discovery?
1. Yes, currently use Hadoop
2. No, but planning to start
3. Currently have no plans
Hadoop/EDH Data Management:
Lots of data landing in the enterprise data hub
Huge quantities with varying levels of sensitivity
Many different sources – structured & unstructured
Many users working with the data in multiple ways
Users: Compliance Officers, Analysts, Data Scientists, LOB
Tools: BI tools, ETL tools, Hue, and more
Need to effectively control & consume data
Get visibility & control over the environment
Discover, explore and consume data
Data Management Challenges
•View, granting and revoke permissions across the Hadoop stack
•Identify access to a data asset around the time of security breach
•Generate alert when a restricted data asset is accessed
Auditing and Access
•Given a data set, trace back to the original source
•Understand the downstream impact of purging/modifying a data setLineage
•Search through metadata to find data sets of interest
•Given a data set, view schema, metadata and policies
Data Management Suite for Hadoop and Cloudera’s EDH
Audit & Access
Ensuring appropriate permissions & auditing
on data access
Discovery & Exploration
Finding out what data is available and
what it looks like
Tracing data back to its original source
Enterprise Metadata Repository
HDFS HBASE HIVE
• Support the process of discovery, and new insights through
direct access to data by subject experts
• LOB Subject Experts (empowered for their subject area)
• Active IT support and engagement
• Security still fundamental and Data is still protected.
• Flexibility in governance, this is discovery not production.
• Better vetted requirements feed production and more highly
governed data types.
• Help organizations in the move to become data driven.
Data Discovery the new way!
But don’t take our word for it!
• The new normal:
• Business Driven
• Ease of use
• Self reliance