Broad adoption of unified Data Lake architectures, will require information governance, meta data management and information lifecycle management capabilities.
Data democracy can be a game changer. But what does it mean to truly implement it? We cannot just dump users into the data lake and expect them to use it. We have to face the fact that the data lake in its native state, is almost impossibly complex to navigate for anyone without a technical background. Unlike most traditional warehouses, it is built out of many storage and processing platforms, each with its own lexicon for data manipulation. Also, the 3V’s of Big Data further complicate the problem, as an unbounded variety of data is ingested into the system with no clear use case.
Data Lake will have increasing amounts of data ingested at scale, if users don’t know is available, it will be useless They need to find it by different means, searches etc, with full governance, not canned queries. Discovery can happen of both data and its context and services
When they find something they are interested in they should be immediately able to get it within the bounds of governance and work with it. Discoverability and accessibility have to go hand in hand, one builds on other but is not usable with other.
The first step in this direction is Data Discovery. Discovery helps answer the question “How do I find data in the data lake and who can help me understand it?”.
Data lakes by design are mean to ingest and hold all the organization’s historical and current data. Its vast, cheap storage space, eliminates the need to curate data and thus optimize storage. No longer is it necessary to understand what benefit can be extracted from the data and weigh it against storage costs. We can now store everything and keep it indefinitely as we look for new ways of extracting insights from it. Today’s junk data could lead to valuable new information in the future.
This however brings new challenges. By the same design that helps the Data Lake ingest whatever data is dumped into it, it also becomes challenging to inventory and to understand it’s contents. Gone are the days when a Database IT team served as the gatekeeper to the storage. Data is created and integrated in a much more decentralized model. As “storing all the data” becomes the new mantra, no single person or team seems to know what all of it is. Gartner estimates that “Through 2018, 90% of deployed data lakes will be useless as they are overwhelmed with information assets captured for uncertain use cases”
As a user looking to find data relevant to a business problem or just an area of interest, navigating the storage and identifying what is relevant is very difficult. And it is exactly here that Discovery can come to the rescue. A Data Discovery solution can:
Help organizations find and understand the data through contributions from data stewards as well as by applying data mining techniques Establish a consolidated store of meta information about the data which can help answer questions about its technical, business as well as governance aspects Help users browse and search for data and collaborate with each other to define and understand the data better
Using Data Discovery, users can get a clear and precise answer to any question they have about the Data Lake’s content
Currently many ease of use aspects are also being thought of as part of discovery. And we agree that without discovery usability is hard to achieve so most such solutions do address it in some way. However in our experience, the core discovery capabilities are the ones highlighted above and subsets of these are found in many other solutions. These in turn can power many features we see in the market today. We will examine this in some detail when we discuss the discovery tools.
Once users know what data they need, next is the problem of getting it. Data access and provisioning is a well known paradigm in the world of traditional data warehouses. Though often it is very simple compared to what the data lake demands. Data Access is often managed through SLA driven approval processes and provisioning may be as simple as granting a select privilege on a table. This does not work in the Data Lake. The quantity and variety of data, the multiple storage platforms, the uncurated nature of data all make answering key questions about data very difficult. As someone who needs data
How do I know who owns a particular data set? Who can grant me access to it? I need several data sets for my use case, does that mean I have to negotiate access with several people? What are the governance concerns around this data I need? Maybe it needs to be encrypted in storage, who can help me set that up? How is it to be provisioned? Can I simply read it from where it is? Does it needs to come to a Sandbox? Does it need to move across storage types?
Central IT teams can have a very difficult time in tracking and mediating between owner and requester, in dealing with provisioning intricacies of each one of the several storage platforms, in understanding the nature of the data and what governance policies may be applicable to it. These problems can seem unsolvable at the Data Lake scales. And lead IT to believe that the only way they can manage the lake is to keep its usage restricted to predefined use cases.
This however is no solution. The data lake must not be avoided because it is complex. We need to focus on making it accessible. What is needed is a Data Accessibility solution for the Data Lake. Such a solution needs to:
Help users easily request data Maintain track of data ownership and applicable governance policies Support request tracking, routing and fulfillment workflows Create tools that can aid or even automate data provisioning
Any Data Accessibility solution should ensure that data is accessible easily. At the same time it should ensure compliance with governance policies. In this way users can focus less on logistics more on being productive. Also given the time sensitive nature of information and the impact it can create for the business, it is important for this process to be as fast and as automated as possible. Automation also has the happy consequence of brings down IT’s support cost leading to less concerns about an open to all data lake.
Lastly we come to very important step of making the Data usable. After users know what data they need and have access to it, the final challenge lies in trying to do what they want with it. This too is more complicated in the data lake vs the traditional RDBMS driven warehouses The number of storage and processing platforms is greatly increased Relevant data may be fragmented and scattered across a variety of stores Its not just SQL anymore. Many more technical paradigms are involved in manipulating the data The accuracy or veracity of data that has been not curated is also suspect
So how then is the Data lake usage to be simplified enough so business users can start feeling comfortable in it? This question does not have a single answer. But we have found that some of the best solutions for this are: Wrapping the technical plumbing of the data lake under abstractions like virtualized consolidated stores, which provide views depicting the data in the shapes of the business entities they apply to. Bringing down the learning curve by making the data usable through visual and familiar paradigms; charts and graphs, or languages like SQL Wrap data science algorithms for things like data cleansing and statistical analysis into easy to use tools that can be put into the hands of business users
Importing Initial Schema. 1.1 Metadata is available. 1.2 Partial Information is available. ( Table names, file header etc) 1.3 No information is available.
Standardizing data types conversion. 2.1 Identify standard data types. 2.2 Automatic data types conversion between different formats. 2.3 Human assisted – intelligent system.
Building data dictionary 3.1 Authoritative tables, synonyms. 3.2 Attributes and metadata matching.
Self-Service Business Intelligence puts the power of analytics in the hands of end users to create their own reports and analysis of the data sets they want, on an as needed basis. The goal is to utilize data wrangling / blending and other capabilities to reduce IT’s involvement and expedite information to business users by delivering what Gartner refers to as “faster, more user-friendly and more relevant BI.”
It is an evolutionary paradigm, does not indulge the IT vs/ Business divide , IT really gets to play the enabler here, reinforcing governance, structuring user autonomy, accounting for user differentiation, and transforming IT’s role from serving business to offering cross-functional support, it is important to realize that self-service BI should not be considered a replacement for traditional BI tools and warehousing. By utilizing a hybrid approach of centralized and decentralized models and restructuring the organization accordingly, self-service BI functions best as a supplement to the conventional methods in which data is accessed more expediently and put in the hands of those who need it most.
How data landscape is evolving.
Big Data, Fast Data, New Data and True Data current EDW are choking on cost, capacity and capability to handle these newer types of demands
They do this to get platforms better suited to advanced analytics with the migrated data (and other specialized workloads). In fact, this movement toward multi-platform data warehouse environments is one of the strongest trends in data architecture today. For these users, the multi-platform environment is the warehouse, not just the relational warehouse platform. The relational warehouse platform continues its life cycle, but only with the data that absolutely requires the mature relational functionality of that platform.
Where Impetus fits in.
Swimming Across the Data Lake, Lessons learned and keys to success