The goal in most organizations is to build multi-use data infrastructure that is not subject to past constraints. This session will discuss hidden design assumptions, review design principles to apply when building multi-use data infrastructure, and provide a reference architecture to use as you work to unify your analytics infrastructure.
The focus in our market has been on acquiring technology, and that ignores the more important part: the larger IT landscape within which this technology lives and the data architecture that lies at its core. If one expects longevity from a platform then it should be a designed rather than accidental architecture.
Architecture is more than just software. It starts from use and includes the data, technology, methods of building and maintaining, and organization of people. What are the design principles that lead to good design and a functional data architecture? What are the assumptions that limit older approaches? How can one integrate with, migrate from or modernize an existing data environment? How will this affect an organization's data management practices? This tutorial will help you answer these questions.
Topics covered:
* A brief history of data infrastructure and past design assumptions
* Categories of data and data use in organizations
* Analytic workload characteristics and constraints
* Data architecture
* Functional architecture
* Tradeoffs between different classes of technology
* Technology planning assumptions and guidance
#strataconf
The Black Box: Interpretability, Reproducibility, and Data Managementmark madsen
The growing complexity of data science leads to black box solutions that few people in an organization understand. You often hear about the difficulty of interpretability—explaining how an analytic model works—and that you need it to deploy models. But people use many black boxes without understanding them…if they’re reliable. It’s when the black box becomes unreliable that people lose trust.
Mistrust is more likely to be created by the lack of reliability, and the lack of reliability is often the result of misunderstanding essential elements of analytics infrastructure and practice. The concept of reproducibility—the ability to get the same results given the same information—extends your view to include the environment and the data used to build and execute models.
Mark Madsen examines reproducibility and the areas that underlie production analytics and explores the most frequently ignored and yet most essential capability, data management. The industry needs to consider its practices so that systems are more transparent and reliable, improving trust and increasing the likelihood that your analytic solutions will succeed.
This talk will treat the black boxed of ML the way management perceives them, as black boxes.
There is much work on explainable models, interpretability, etc. that are important to the task of reproducibility. Much of that is relevant to the practitioner, but the practitioner can become too focused on the part they are most familiar with and focused on. Reproducing the results needs more.
Operationalizing Machine Learning in the Enterprisemark madsen
TDWI Munich 2019
What does it take to operationalize machine learning and AI in an enterprise setting?
Machine learning in an enterprise setting is difficult, but it seems easy. All you need is some smart people, some tools, and some data. It’s a long way from the environment needed to build ML applications to the environment to run them in an enterprise.
Most of what we know about production ML and AI come from the world of web and digital startups and consumer services, where ML is a core part of the services they provide. These companies have fewer constraints than most enterprises do.
This session describes the nature of ML and AI applications and the overall environment they operate in, explains some important concepts about production operations, and offers some observations and advice for anyone trying to build and deploy such systems.
Pay no attention to the man behind the curtain - the unseen work behind data ...mark madsen
Goal: explain the nature of the work of an analytics team to a manager, and enable people on those teams to explain what a data science team needs to a manager.
It seems as if every organization wants to enable analytical-decision making and embed analytics into operational processes. What can you do with analytics? It looks like anything is possible. What can you really do? Probably a lot less than you expect. Why is this? Vendors promise easy-to-use analytics tools and services but they rarely deliver. The products may be easy but the work is still hard.
Using analytics to solve problems depends on many factors beyond the math: people, processes, the skills of the analyst, the technology used, the data. Technology is the easy part. Figuring out what to do and how to do it is a lot harder. Despite this, fancy new tools get all the attention and budget.
People and data are the truly hard parts. People, because many believe that data is absolute rather than relative, and that analytic models produce an answer rather than a range of answers with varying degrees of truth, accuracy and applicability. Data, because managing data for analytics is a nuanced, detail-oriented and seemingly dull task left to back-office IT.
If your goal is to build a repeatable analytics capability rather than a one-off analytics project then you will need to address the parts that are rarely mentioned. This talk will explain some of the unseen and little-discussed aspects involved when building and deploying analytics.
Solve User Problems: Data Architecture for Humansmark madsen
We are bombarded with stories of the latest products to hit the market – products that will change everything we do. This causes us to focus on the latest technology, building IT for the sake of building IT. Meanwhile, the world still seems to run on Excel.
The “big innovators” who have and use unimaginably large amounts of data are not the norm. Aspiring to use the same complex technologies and patterns they do leads to poor investments and tradeoffs. This is an age-old problem rooted in the over-emphasis of technology as the agent of change. Technology isn’t the answer – it’s the platform on which people build answers.
To emphasize technology is to ignore the way tools change people and practices. The design focus in our market was on storing and making data accessible. If we want to make progress then we need to step back from the details and look at data from the perspective of the organization. Our design focus shifts to people learning and applying new insights, asking questions about how an organization can be more resilient, more efficient, or faster to sense and respond to changing conditions.
In this talk you will learn how to put your data architecture into a human frame of reference. Drawing inspiration from the history of technology and urban planning, we will see that the services provided by the things we build are what drive success, not the latest shiny distraction.
Architecting a Data Platform For Enterprise Use (Strata NY 2018)mark madsen
Building a data lake involves more than installing Hadoop or putting data into AWS. The goal in most organizations is to build multi-use data infrastructure that is not subject to past constraints. This tutorial covers design assumptions, design principles, and how to approach the architecture and planning for multi-use data infrastructure in IT.
Long:
The goal in most organizations is to build multi-use data infrastructure that is not subject to past constraints. This session will discuss hidden design assumptions, review design principles to apply when building multi-use data infrastructure, and provide a reference architecture to use as you work to unify your analytics infrastructure.
The focus in our market has been on acquiring technology, and that ignores the more important part: the larger IT landscape within which this technology lives and the data architecture that lies at its core. If one expects longevity from a platform then it should be a designed rather than accidental architecture.
Architecture is more than just software. It starts from use and includes the data, technology, methods of building and maintaining, and organization of people. What are the design principles that lead to good design and a functional data architecture? What are the assumptions that limit older approaches? How can one integrate with, migrate from or modernize an existing data environment? How will this affect an organization's data management practices? This tutorial will help you answer these questions.
Topics covered:
* A brief history of data infrastructure and past design assumptions
* Categories of data and data use in organizations
* Data architecture
* Functional architecture
* Technology planning assumptions and guidance
How to understand trends in the data & software marketmark madsen
The big challenge most analytics and IT professionals face today is dealing with complexity. Trends are still not clear. It helps to look at the past and current state to understand what’s really happening in the data technology market – a whole lot of reinvention and some innovation, but not where you expect it.
We have the (well-understood) problems that we have, with their (well-understood) limitations and intractabilities.
We deal with them in the world in which they were first codified and framed. Paradigms (world views) change as a function of political, economic, technological, cultural, use and growth, however, and when the world changes we’ll have a criteria for framing not just the problems/shortcomings/intractabilities of the prior paradigm, but that paradigm itself.
At that point, however, it will have ceased to matter because we’ll be dealing with fundamentally new problems/shortcomings/intractabilities.
Building a Data Platform Strata SF 2019mark madsen
Building a data lake involves more than installing Hadoop or putting data into AWS. The goal in most organizations is to build multi-use data infrastructure that is not subject to past constraints. This tutorial covers design assumptions, design principles, and how to approach the architecture and planning for multi-use data infrastructure in IT.
[This is a new, changed version of the presentations of the same title from last year's Strata]
Data Architecture: OMG It’s Made of Peoplemark madsen
Do you have data? Do you have users? Do they use that data to solve problems? Then you have a data architecture. Maybe your architecture is organic and accidental, or maybe it’s an accumulation of the latest practices and technologies you heard about on Stack Overflow.
Spoiler: data architecture is about people and how they use data, not the latest pipeline framework or AI model. Data architecture is about enabling users to be productive, not adding the next “shiny object” and then blaming the users for using it wrong. What you design needs to focus on a different subject than either technology or data.
Join Kevin Bogusch, Ecosystem Architect, as he talks with Mark Madsen, Fellow at the Technology Innovation Office, on the crucial elements you’re missing in a successful data architecture: people and process. Find out why Mark says, “don’t buy one problem to solve another problem.”
The Black Box: Interpretability, Reproducibility, and Data Managementmark madsen
The growing complexity of data science leads to black box solutions that few people in an organization understand. You often hear about the difficulty of interpretability—explaining how an analytic model works—and that you need it to deploy models. But people use many black boxes without understanding them…if they’re reliable. It’s when the black box becomes unreliable that people lose trust.
Mistrust is more likely to be created by the lack of reliability, and the lack of reliability is often the result of misunderstanding essential elements of analytics infrastructure and practice. The concept of reproducibility—the ability to get the same results given the same information—extends your view to include the environment and the data used to build and execute models.
Mark Madsen examines reproducibility and the areas that underlie production analytics and explores the most frequently ignored and yet most essential capability, data management. The industry needs to consider its practices so that systems are more transparent and reliable, improving trust and increasing the likelihood that your analytic solutions will succeed.
This talk will treat the black boxed of ML the way management perceives them, as black boxes.
There is much work on explainable models, interpretability, etc. that are important to the task of reproducibility. Much of that is relevant to the practitioner, but the practitioner can become too focused on the part they are most familiar with and focused on. Reproducing the results needs more.
Operationalizing Machine Learning in the Enterprisemark madsen
TDWI Munich 2019
What does it take to operationalize machine learning and AI in an enterprise setting?
Machine learning in an enterprise setting is difficult, but it seems easy. All you need is some smart people, some tools, and some data. It’s a long way from the environment needed to build ML applications to the environment to run them in an enterprise.
Most of what we know about production ML and AI come from the world of web and digital startups and consumer services, where ML is a core part of the services they provide. These companies have fewer constraints than most enterprises do.
This session describes the nature of ML and AI applications and the overall environment they operate in, explains some important concepts about production operations, and offers some observations and advice for anyone trying to build and deploy such systems.
Pay no attention to the man behind the curtain - the unseen work behind data ...mark madsen
Goal: explain the nature of the work of an analytics team to a manager, and enable people on those teams to explain what a data science team needs to a manager.
It seems as if every organization wants to enable analytical-decision making and embed analytics into operational processes. What can you do with analytics? It looks like anything is possible. What can you really do? Probably a lot less than you expect. Why is this? Vendors promise easy-to-use analytics tools and services but they rarely deliver. The products may be easy but the work is still hard.
Using analytics to solve problems depends on many factors beyond the math: people, processes, the skills of the analyst, the technology used, the data. Technology is the easy part. Figuring out what to do and how to do it is a lot harder. Despite this, fancy new tools get all the attention and budget.
People and data are the truly hard parts. People, because many believe that data is absolute rather than relative, and that analytic models produce an answer rather than a range of answers with varying degrees of truth, accuracy and applicability. Data, because managing data for analytics is a nuanced, detail-oriented and seemingly dull task left to back-office IT.
If your goal is to build a repeatable analytics capability rather than a one-off analytics project then you will need to address the parts that are rarely mentioned. This talk will explain some of the unseen and little-discussed aspects involved when building and deploying analytics.
Solve User Problems: Data Architecture for Humansmark madsen
We are bombarded with stories of the latest products to hit the market – products that will change everything we do. This causes us to focus on the latest technology, building IT for the sake of building IT. Meanwhile, the world still seems to run on Excel.
The “big innovators” who have and use unimaginably large amounts of data are not the norm. Aspiring to use the same complex technologies and patterns they do leads to poor investments and tradeoffs. This is an age-old problem rooted in the over-emphasis of technology as the agent of change. Technology isn’t the answer – it’s the platform on which people build answers.
To emphasize technology is to ignore the way tools change people and practices. The design focus in our market was on storing and making data accessible. If we want to make progress then we need to step back from the details and look at data from the perspective of the organization. Our design focus shifts to people learning and applying new insights, asking questions about how an organization can be more resilient, more efficient, or faster to sense and respond to changing conditions.
In this talk you will learn how to put your data architecture into a human frame of reference. Drawing inspiration from the history of technology and urban planning, we will see that the services provided by the things we build are what drive success, not the latest shiny distraction.
Architecting a Data Platform For Enterprise Use (Strata NY 2018)mark madsen
Building a data lake involves more than installing Hadoop or putting data into AWS. The goal in most organizations is to build multi-use data infrastructure that is not subject to past constraints. This tutorial covers design assumptions, design principles, and how to approach the architecture and planning for multi-use data infrastructure in IT.
Long:
The goal in most organizations is to build multi-use data infrastructure that is not subject to past constraints. This session will discuss hidden design assumptions, review design principles to apply when building multi-use data infrastructure, and provide a reference architecture to use as you work to unify your analytics infrastructure.
The focus in our market has been on acquiring technology, and that ignores the more important part: the larger IT landscape within which this technology lives and the data architecture that lies at its core. If one expects longevity from a platform then it should be a designed rather than accidental architecture.
Architecture is more than just software. It starts from use and includes the data, technology, methods of building and maintaining, and organization of people. What are the design principles that lead to good design and a functional data architecture? What are the assumptions that limit older approaches? How can one integrate with, migrate from or modernize an existing data environment? How will this affect an organization's data management practices? This tutorial will help you answer these questions.
Topics covered:
* A brief history of data infrastructure and past design assumptions
* Categories of data and data use in organizations
* Data architecture
* Functional architecture
* Technology planning assumptions and guidance
How to understand trends in the data & software marketmark madsen
The big challenge most analytics and IT professionals face today is dealing with complexity. Trends are still not clear. It helps to look at the past and current state to understand what’s really happening in the data technology market – a whole lot of reinvention and some innovation, but not where you expect it.
We have the (well-understood) problems that we have, with their (well-understood) limitations and intractabilities.
We deal with them in the world in which they were first codified and framed. Paradigms (world views) change as a function of political, economic, technological, cultural, use and growth, however, and when the world changes we’ll have a criteria for framing not just the problems/shortcomings/intractabilities of the prior paradigm, but that paradigm itself.
At that point, however, it will have ceased to matter because we’ll be dealing with fundamentally new problems/shortcomings/intractabilities.
Building a Data Platform Strata SF 2019mark madsen
Building a data lake involves more than installing Hadoop or putting data into AWS. The goal in most organizations is to build multi-use data infrastructure that is not subject to past constraints. This tutorial covers design assumptions, design principles, and how to approach the architecture and planning for multi-use data infrastructure in IT.
[This is a new, changed version of the presentations of the same title from last year's Strata]
Data Architecture: OMG It’s Made of Peoplemark madsen
Do you have data? Do you have users? Do they use that data to solve problems? Then you have a data architecture. Maybe your architecture is organic and accidental, or maybe it’s an accumulation of the latest practices and technologies you heard about on Stack Overflow.
Spoiler: data architecture is about people and how they use data, not the latest pipeline framework or AI model. Data architecture is about enabling users to be productive, not adding the next “shiny object” and then blaming the users for using it wrong. What you design needs to focus on a different subject than either technology or data.
Join Kevin Bogusch, Ecosystem Architect, as he talks with Mark Madsen, Fellow at the Technology Innovation Office, on the crucial elements you’re missing in a successful data architecture: people and process. Find out why Mark says, “don’t buy one problem to solve another problem.”
Assumptions about Data and Analysis: Briefing room webcast slidesmark madsen
In many ways, moving data is like moving furniture: it's an unpleasant process dubbed an occasional necessary evil. But as the data pipelines of old decay, a new reality is taking shape: the data-native architecture. Unlike traditional data processing for BI and Analytics, this approach works on data right where it lives, thus eliminating the pain of forklifting, narrowing the margin of error, and expediting the time to business benefit. The new architecture embodies new assumptions, some of which we will talk about here.
Register for this episode of The Briefing Room to hear veteran Analyst Mark Madsen of Third Nature explain why this shift is truly tectonic. He'll be briefed by Steve Wooledge of Arcadia Data who will showcase his company's technology, which leverages a data-native architecture to fuel rapid-fire visualization and analysis of both big data and small.
Strata Data Conference 2019 : Scaling Visualization for Big Data in the CloudJaipaul Agonus
This deck deals with scaling visualizations for big data in the cloud.
Approaching this problem on two fronts, beginning on the engineering side of things, looking at different scaling strategies that can be used on cloud resources.
Then about strategies that we use for turning data into visualizations and the usage of proven visualization blueprints for market surveillance.
Everything Has Changed Except Us: Modernizing the Data Warehousemark madsen
Keynote, Munich, June 2016
The way we make decisions has changed. The data we use has changed. The techniques we can apply to data and decisions have changed. Yet what we build and how we build it has barely changed in 20 years.
The definition of madness is doing more of what you already do and expecting different results. The threat to the data warehouse is not from new technology that will replace the data warehouse. It is from destabilization caused by new technology as it changes the architecture, and from failure to adapt to those changes.
The technology that we use is problematic because it constrains and sometimes prevents necessary activities. We don’t need more technology and bigger machines. We need different technology that does different things. More product features from the same vendors won’t solve the problem.
The data we want to use is challenging. We can’t model and clean and maintain it fast enough. We don’t need more data modeling to solve this problem. We need less modeling and more metadata.
And lastly, a change in scale has occurred. It isn’t a simple problem of “big”. The problem with current workloads has been solved, despite the performance problems that many people still have today. Scale has many dimensions – important among them are the number of discrete sources and structures, the rate of change of individual structures, the rate of change in data use, the variety of uses and the concurrency of those uses.
In short, we need new architecture that is not focused on creating stability in data, but one that is adaptable to continuous and rapidly changing uses of data.
Analytics 3.0 Measurable business impact from analytics & big dataMicrosoft
Presentación del evento de Harvard Business Review sobre Analítica y Big Data
(15 de Octubre 2013)
"Featuring analytics expert Tom Davenport, author of Competing on Analytics, Analytics at Work, and the just-released Keeping Up with the Quants" 
These are the slides from the Gramener webinar conducted on 16-Jan-2020.
- What skills & roles will help you deliver your analytics and data visualization projects?
- What skills do most teams miss to hire for?
In a Gartner survey, CIOs reported 'team skills' as their biggest barrier ⚠️ to data science. They have trouble deciding the skill mix ⚗️needed or in finding the right people for the job.
This webinar will show the skills and roles you must plan for. You will learn how to tailor this based on your organization's data maturity. It will help you decide whether to upskill teams or hire externally. The session will show you how and where to find talent.
Throughout the webinar you will learn:
- Critical skills & roles needed in your data science team?
- Tips for data science hiring. What aspirants should know about the jobs?
- Insights presented using real-world examples
This session describes the roles and skill sets required when building a Data Science team, and starting a data science initiative, including how to develop Data Science capabilities, select suitable organizational models for Data Science teams, and understand the role of executive engagement for enhancing analytical maturity at an organization.
Objective 1: Understand the knowledge and skills needed for a Data Science team and how to acquire them.
After this session you will be able to:
Objective 2: Learn about the different organizational models for forming a Data Science team and how to choose the best for your organization.
Objective 3: Understand the importance of Executive support for Data Science initiatives and role it plays in their successful deployment.
What you till learn:
GOALS - What is the bar for data science teams
PITFALLS - What are common data science struggles
DIAGNOSES - Why so many of our efforts fail to deliver value
RECOMMENDATIONS - How to address these struggles with best practices
Presented by Mac Steele
Director of Product at Domino Data Lab
Disruptive Innovation: how do you use these theories to manage your IT?mark madsen
The term disruptive innovation was popularized by Harvard professor Clayton Christensen in his 1997 book “The Innovator’s Dilemma.” Nearly 20 years later “Disrupt!” is a popular leadership mantra that is more frequently uttered than experienced. You can't productize it. You can't always control it – at least what effects it has in practice. You aren't necessarily going to like every product of innovation. So are you sure you want it? If so, how do you promote a culture in which innovation can flower – and, potentially, thrive? Because that's probably the best that you can do.
Perhaps there's a better framing for innovation than just "disruption.“ This session is an overview of commmoditization and innovation theories followed by basic things you can do to apply that theory to your daily job architecting, choosing and managing a data environment in your company.
Intro to Data Science for Non-Data ScientistsSri Ambati
Erin LeDell and Chen Huang's presentations from the Intro to Data Science for Non-Data Scientists Meetup at H2O HQ on 08.20.15
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
This talk is an introduction to Data Science. It explains Data Science from two perspectives - as a profession and as a descipline. While covering the benefits of Data Science for business, It explaints how to get started for embracing data science in business.
Big data is a big part of the disruption hitting this market, but not in the way most people think. It's not replacing the data warehouse, but it is changing the technology stack. It doesn't eliminate data management, but it does redefine enterprise data architecture. Big data is and isn't many things. It's important to understand which information uses are well supported and which have yet to be addressed. Otherwise you risk replacing one set of problems with another. Come to this session to hear some observations on what big data is, isn't and aspires to be.
A video is available, starts at 1:03 into this Strata online event: http://www.youtube.com/watch?v=gLsHI1ZglKw
Bi isn't big data and big data isn't BI (updated)mark madsen
Big data is hyped, but isn't hype. There are definite technical, process and business differences in the big data market when compared to BI and data warehousing, but they are often poorly understood or explained. BI isn't big data, and big data isn't BI. By distilling the technical and process realities of big data systems and projects we can separate fact from fiction. This session examines the underlying assumptions and abstractions we use in the BI and DW world, the abstractions that evolved in the big data world, and how they are different. Armed with this knowledge, you will be better able to make design and architecture decisions. The session is sometimes conceptual, sometimes detailed technical explorations of data, processing and technology, but promises to be entertaining regardless of the level.
Yes, it’s about the data normally called “big”, but it’s not Hadoop for the database crowd, despite the prominent role Hadoop plays. The session will be technical, but in a technology preview/overview fashion. I won’t be teaching you to write MapReduce jobs or anything of the sort.
The first part will be an overview of the types, formats and structures of data that aren’t normally in the data warehouse realm. The second part will cover some of the basic technology components, vendors and architecture.
The goal is to provide an overview of the extent of data available and some of the nuances or challenges in processing it, coupled with some examples of tools or vendors that may be a starting point if you are building in a particular area.
The pioneers in the big data space have battle scars and have learnt many of the lessons in this report the hard way. But if you are a general manger & just embarking on the big data journey, you should now have what they call the 'second mover advantage’. My hope is that this report helps you better leverage your second mover advantage. The goal here is to shed some light on the people & process issues in building a central big data analytics function
O'Reilly ebook: Machine Learning at Enterprise Scale | QuboleVasu S
Real-world data science practitioners offer perspectives and advice on six common Machine Learning problems
https://www.qubole.com/resources/ebooks/oreilly-ebook-machine-learning-at-enterprise-scale
Briefing room: An alternative for streaming data collectionmark madsen
Knowing what’s happening in your enterprise right now can mark the difference between success and failure. The key is to have a rich view of activity, such that analysts and others can explore in a fully multidimensional fashion. Benefiting from such a detailed perspective can help professionals identify the exact nature of problems or opportunities, thus enabling precise actions that make a difference quickly.
Register for this episode of The Briefing Room to hear veteran Analyst Mark Madsen of Third Nature explain how a nexus of innovations for analyzing network traffic can help companies stay on top of their game. He’ll be briefed by Erik Giesa of ExtraHop, who will showcase his company’s stream analytics technology for wire data, which provides real-time, multidimensional views of network traffic. He’ll share success stories of how ExtraHop has solved otherwise intractable problems and enabled a new level of root-cause analysis.
The way we make decisions has changed. The data we use has changed. The techniques we can apply to data and decisions have changed. Yet what we build and how we build it has barely changed in 20 years.
The definition of madness is doing more of what you already do and expecting different results. The threat to the data warehouse is not from new technology that will replace the data warehouse. It is from destabilization caused by new technology as it changes the architecture, and from failure to adapt to those changes.
The technology that we use is problematic because it constrains and sometimes prevents necessary activities. We don’t need more technology and bigger machines. We need different technology that does different things. More product features from the same vendors won’t solve the problem.
The data we want to use is challenging. We can’t model and clean and maintain it fast enough. We don’t need more data modeling to solve this problem. We need less modeling and more metadata.
And lastly, a change in scale has occurred. It isn’t a simple problem of “big”. The problem with current workloads has been solved, despite the performance problems that many people still have today. Scale has many dimensions – important among them are the number of discrete sources and structures, the rate of change of individual structures, the rate of change in data use, the variety of uses and the concurrency of those uses.
In short, we need new architecture that is not focused on creating stability in data, but one that is adaptable to continuous and rapidly changing uses of data.
The Role of Data Wrangling in Driving Hadoop AdoptionInside Analysis
The Briefing Room with Mark Madsen and Trifacta
Live Webcast September 1, 2015
Watch the archive: https://bloorgroup.webex.com/bloorgroup/onstage/g.php?MTID=eb655874d04ba7d560be87a9d906dd2fd
Like all enterprise software solutions, Hadoop must deliver business value in order to be a success. Much of the innovation around the big data industry these days therefore addresses usability. While there will always be a technical side to the Hadoop equation, the need for user-friendly tools to manage the data will continue to focus on business users. That’s why self-service data preparation or "data wrangling" is a serious and growing trend, one which promises to move Hadoop beyond the early adopter phase and more into the mainstream of business.
Register for this episode of The Briefing Room to hear veteran Analyst Mark Madsen of Third Nature explain why business users will play an increasingly important role in the evolution of big data. He’ll be briefed by Trifacta's Will Davis and Alon Bartur, who will demonstrate how Trifacta's solution empowers business users to “wrangle" data of all shapes and sizes faster and easier than ever before. They’ll discuss why a new approach to accessing and preparing diverse data is required and how it can accelerate and broaden the use of big data within organizations.
Visit InsideAnalysis.com for more information.
Whether you are interested in healthcare data analytics or looking to get started with big data and marketing, these fundamental principles from data experts will contribute to your success. http://www.qubole.com/new-series-big-data-tips/
Assumptions about Data and Analysis: Briefing room webcast slidesmark madsen
In many ways, moving data is like moving furniture: it's an unpleasant process dubbed an occasional necessary evil. But as the data pipelines of old decay, a new reality is taking shape: the data-native architecture. Unlike traditional data processing for BI and Analytics, this approach works on data right where it lives, thus eliminating the pain of forklifting, narrowing the margin of error, and expediting the time to business benefit. The new architecture embodies new assumptions, some of which we will talk about here.
Register for this episode of The Briefing Room to hear veteran Analyst Mark Madsen of Third Nature explain why this shift is truly tectonic. He'll be briefed by Steve Wooledge of Arcadia Data who will showcase his company's technology, which leverages a data-native architecture to fuel rapid-fire visualization and analysis of both big data and small.
Strata Data Conference 2019 : Scaling Visualization for Big Data in the CloudJaipaul Agonus
This deck deals with scaling visualizations for big data in the cloud.
Approaching this problem on two fronts, beginning on the engineering side of things, looking at different scaling strategies that can be used on cloud resources.
Then about strategies that we use for turning data into visualizations and the usage of proven visualization blueprints for market surveillance.
Everything Has Changed Except Us: Modernizing the Data Warehousemark madsen
Keynote, Munich, June 2016
The way we make decisions has changed. The data we use has changed. The techniques we can apply to data and decisions have changed. Yet what we build and how we build it has barely changed in 20 years.
The definition of madness is doing more of what you already do and expecting different results. The threat to the data warehouse is not from new technology that will replace the data warehouse. It is from destabilization caused by new technology as it changes the architecture, and from failure to adapt to those changes.
The technology that we use is problematic because it constrains and sometimes prevents necessary activities. We don’t need more technology and bigger machines. We need different technology that does different things. More product features from the same vendors won’t solve the problem.
The data we want to use is challenging. We can’t model and clean and maintain it fast enough. We don’t need more data modeling to solve this problem. We need less modeling and more metadata.
And lastly, a change in scale has occurred. It isn’t a simple problem of “big”. The problem with current workloads has been solved, despite the performance problems that many people still have today. Scale has many dimensions – important among them are the number of discrete sources and structures, the rate of change of individual structures, the rate of change in data use, the variety of uses and the concurrency of those uses.
In short, we need new architecture that is not focused on creating stability in data, but one that is adaptable to continuous and rapidly changing uses of data.
Analytics 3.0 Measurable business impact from analytics & big dataMicrosoft
Presentación del evento de Harvard Business Review sobre Analítica y Big Data
(15 de Octubre 2013)
"Featuring analytics expert Tom Davenport, author of Competing on Analytics, Analytics at Work, and the just-released Keeping Up with the Quants" 
These are the slides from the Gramener webinar conducted on 16-Jan-2020.
- What skills & roles will help you deliver your analytics and data visualization projects?
- What skills do most teams miss to hire for?
In a Gartner survey, CIOs reported 'team skills' as their biggest barrier ⚠️ to data science. They have trouble deciding the skill mix ⚗️needed or in finding the right people for the job.
This webinar will show the skills and roles you must plan for. You will learn how to tailor this based on your organization's data maturity. It will help you decide whether to upskill teams or hire externally. The session will show you how and where to find talent.
Throughout the webinar you will learn:
- Critical skills & roles needed in your data science team?
- Tips for data science hiring. What aspirants should know about the jobs?
- Insights presented using real-world examples
This session describes the roles and skill sets required when building a Data Science team, and starting a data science initiative, including how to develop Data Science capabilities, select suitable organizational models for Data Science teams, and understand the role of executive engagement for enhancing analytical maturity at an organization.
Objective 1: Understand the knowledge and skills needed for a Data Science team and how to acquire them.
After this session you will be able to:
Objective 2: Learn about the different organizational models for forming a Data Science team and how to choose the best for your organization.
Objective 3: Understand the importance of Executive support for Data Science initiatives and role it plays in their successful deployment.
What you till learn:
GOALS - What is the bar for data science teams
PITFALLS - What are common data science struggles
DIAGNOSES - Why so many of our efforts fail to deliver value
RECOMMENDATIONS - How to address these struggles with best practices
Presented by Mac Steele
Director of Product at Domino Data Lab
Disruptive Innovation: how do you use these theories to manage your IT?mark madsen
The term disruptive innovation was popularized by Harvard professor Clayton Christensen in his 1997 book “The Innovator’s Dilemma.” Nearly 20 years later “Disrupt!” is a popular leadership mantra that is more frequently uttered than experienced. You can't productize it. You can't always control it – at least what effects it has in practice. You aren't necessarily going to like every product of innovation. So are you sure you want it? If so, how do you promote a culture in which innovation can flower – and, potentially, thrive? Because that's probably the best that you can do.
Perhaps there's a better framing for innovation than just "disruption.“ This session is an overview of commmoditization and innovation theories followed by basic things you can do to apply that theory to your daily job architecting, choosing and managing a data environment in your company.
Intro to Data Science for Non-Data ScientistsSri Ambati
Erin LeDell and Chen Huang's presentations from the Intro to Data Science for Non-Data Scientists Meetup at H2O HQ on 08.20.15
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
This talk is an introduction to Data Science. It explains Data Science from two perspectives - as a profession and as a descipline. While covering the benefits of Data Science for business, It explaints how to get started for embracing data science in business.
Big data is a big part of the disruption hitting this market, but not in the way most people think. It's not replacing the data warehouse, but it is changing the technology stack. It doesn't eliminate data management, but it does redefine enterprise data architecture. Big data is and isn't many things. It's important to understand which information uses are well supported and which have yet to be addressed. Otherwise you risk replacing one set of problems with another. Come to this session to hear some observations on what big data is, isn't and aspires to be.
A video is available, starts at 1:03 into this Strata online event: http://www.youtube.com/watch?v=gLsHI1ZglKw
Bi isn't big data and big data isn't BI (updated)mark madsen
Big data is hyped, but isn't hype. There are definite technical, process and business differences in the big data market when compared to BI and data warehousing, but they are often poorly understood or explained. BI isn't big data, and big data isn't BI. By distilling the technical and process realities of big data systems and projects we can separate fact from fiction. This session examines the underlying assumptions and abstractions we use in the BI and DW world, the abstractions that evolved in the big data world, and how they are different. Armed with this knowledge, you will be better able to make design and architecture decisions. The session is sometimes conceptual, sometimes detailed technical explorations of data, processing and technology, but promises to be entertaining regardless of the level.
Yes, it’s about the data normally called “big”, but it’s not Hadoop for the database crowd, despite the prominent role Hadoop plays. The session will be technical, but in a technology preview/overview fashion. I won’t be teaching you to write MapReduce jobs or anything of the sort.
The first part will be an overview of the types, formats and structures of data that aren’t normally in the data warehouse realm. The second part will cover some of the basic technology components, vendors and architecture.
The goal is to provide an overview of the extent of data available and some of the nuances or challenges in processing it, coupled with some examples of tools or vendors that may be a starting point if you are building in a particular area.
The pioneers in the big data space have battle scars and have learnt many of the lessons in this report the hard way. But if you are a general manger & just embarking on the big data journey, you should now have what they call the 'second mover advantage’. My hope is that this report helps you better leverage your second mover advantage. The goal here is to shed some light on the people & process issues in building a central big data analytics function
O'Reilly ebook: Machine Learning at Enterprise Scale | QuboleVasu S
Real-world data science practitioners offer perspectives and advice on six common Machine Learning problems
https://www.qubole.com/resources/ebooks/oreilly-ebook-machine-learning-at-enterprise-scale
Briefing room: An alternative for streaming data collectionmark madsen
Knowing what’s happening in your enterprise right now can mark the difference between success and failure. The key is to have a rich view of activity, such that analysts and others can explore in a fully multidimensional fashion. Benefiting from such a detailed perspective can help professionals identify the exact nature of problems or opportunities, thus enabling precise actions that make a difference quickly.
Register for this episode of The Briefing Room to hear veteran Analyst Mark Madsen of Third Nature explain how a nexus of innovations for analyzing network traffic can help companies stay on top of their game. He’ll be briefed by Erik Giesa of ExtraHop, who will showcase his company’s stream analytics technology for wire data, which provides real-time, multidimensional views of network traffic. He’ll share success stories of how ExtraHop has solved otherwise intractable problems and enabled a new level of root-cause analysis.
The way we make decisions has changed. The data we use has changed. The techniques we can apply to data and decisions have changed. Yet what we build and how we build it has barely changed in 20 years.
The definition of madness is doing more of what you already do and expecting different results. The threat to the data warehouse is not from new technology that will replace the data warehouse. It is from destabilization caused by new technology as it changes the architecture, and from failure to adapt to those changes.
The technology that we use is problematic because it constrains and sometimes prevents necessary activities. We don’t need more technology and bigger machines. We need different technology that does different things. More product features from the same vendors won’t solve the problem.
The data we want to use is challenging. We can’t model and clean and maintain it fast enough. We don’t need more data modeling to solve this problem. We need less modeling and more metadata.
And lastly, a change in scale has occurred. It isn’t a simple problem of “big”. The problem with current workloads has been solved, despite the performance problems that many people still have today. Scale has many dimensions – important among them are the number of discrete sources and structures, the rate of change of individual structures, the rate of change in data use, the variety of uses and the concurrency of those uses.
In short, we need new architecture that is not focused on creating stability in data, but one that is adaptable to continuous and rapidly changing uses of data.
The Role of Data Wrangling in Driving Hadoop AdoptionInside Analysis
The Briefing Room with Mark Madsen and Trifacta
Live Webcast September 1, 2015
Watch the archive: https://bloorgroup.webex.com/bloorgroup/onstage/g.php?MTID=eb655874d04ba7d560be87a9d906dd2fd
Like all enterprise software solutions, Hadoop must deliver business value in order to be a success. Much of the innovation around the big data industry these days therefore addresses usability. While there will always be a technical side to the Hadoop equation, the need for user-friendly tools to manage the data will continue to focus on business users. That’s why self-service data preparation or "data wrangling" is a serious and growing trend, one which promises to move Hadoop beyond the early adopter phase and more into the mainstream of business.
Register for this episode of The Briefing Room to hear veteran Analyst Mark Madsen of Third Nature explain why business users will play an increasingly important role in the evolution of big data. He’ll be briefed by Trifacta's Will Davis and Alon Bartur, who will demonstrate how Trifacta's solution empowers business users to “wrangle" data of all shapes and sizes faster and easier than ever before. They’ll discuss why a new approach to accessing and preparing diverse data is required and how it can accelerate and broaden the use of big data within organizations.
Visit InsideAnalysis.com for more information.
Whether you are interested in healthcare data analytics or looking to get started with big data and marketing, these fundamental principles from data experts will contribute to your success. http://www.qubole.com/new-series-big-data-tips/
From Lab to Factory: Or how to turn data into valuePeadar Coyle
We've all heard of 'big data' or data science, but how do we convert these trends into actual business value. I share case studies, and technology tips and talk about the challenges of the data science process. This is all based on two years of in-the-field research of deploying models, and going from prototypes to production.
These are slides from my talk at PyCon Ireland 2015
Best Practices for Scaling Data Science Across the OrganizationChasity Gibson
Effective data science in the enterprise is about aligning the right model, data, and infrastructure with the right outcomes. Most organizations today struggle to unlock the potential of data science to enhance decision-making and drive business value.
Join Forrester and Anaconda for a webinar to learn best practices for scaling data science across your entire organization. Guest speaker Kjell Carlsson, a Forrester Senior Analyst, and Peter Wang, Anaconda CTO, will share their unique perspectives on how to tackle five key challenges facing organizations today:
- Identifying, defining, and prioritizing valuable problems
- Building the right teams
- Leveraging the proper tools and platforms
- Iterating and deploying effectively
- Reaching end-users to generate value
The Right Data Warehouse: Automation Now, Business Value ThereafterInside Analysis
The Briefing Room with Dr. Robin Bloor and WhereScape
Live Webcast on April 1, 2014
Watch the archive: https://bloorgroup.webex.com/bloorgroup/lsr.php?RCID=7b23b14b532bd7be60a70f6bd5209f03
In the Big Data shuffle, everyone is looking at Hadoop as “the answer” to collect interesting data from a new set of sources. While Hadoop has given organizations the power to gather more information assets than ever before, the question still looms: which data, regardless of source, structure, volume and all the rest, are significant for affecting business value – and how do we harness it? One effective approach is to bolster the data warehouse environment with a solution capable of integrating all the data sources, including Hadoop, and automating delivery of key information into the rights hands.
Register for this episode of The Briefing Room to hear veteran Analyst Robin Bloor as he explains how a rapidly changing information landscape impacts data management. He will be briefed by Mark Budzinski of WhereScape, who will tout his company’s data warehouse automation solutions. Budzinski will discuss how automation can be the cornerstone for closing the gap between those responsible for data management and the people driving business decisions.
Visit InsideAnlaysis.com for more information.
Enabling data scientists within an enterprise requires a well-thought out approach from an organization, technology, and business results perspective. In this talk, Tim and Hussain will share common pitfalls to data science enablement in the enterprise and provide their recommendations to avoid them. Taking an example, actionable use case from the financial services industry, they will focus on how Anaconda plays a pivotal role in setting up big data infrastructure, integrating data science experimentation and production environments, and deploying insights to production. Along the way, they will highlight opportunities for leveraging open source and unleashing data science teams while meeting regulatory and compliance challenges.
The 3 Key Barriers Keeping Companies from Deploying Data Products Dataiku
Getting from raw data to deploying data-driven solutions requires technology, data, and people. All of which exist. So why aren’t we seeing more truly data-driven companies: what's missing and why? During Strata Hadoop World Singapore 2015, Pauline Brown, Director of Marketing at Dataiku, explains how lack of collaboration is what is keeping companies from building and deploying data products effectively. Learn more about Dataiku and Data Science Studio: www.dataiku.com
Python's Role in the Future of Data AnalysisPeter Wang
Why is "big data" a challenge, and what roles do high-level languages like Python have to play in this space?
The video of this talk is at: https://vimeo.com/79826022
introduction, data mining, why data mining, application of data mining, steps of data mining, threat of data mining, solution of data mining, role of data mining, data warehouse, oltp & olap, data warehouse, data mining tools, latest research
Data Scientist has been regarded as the sexiest job of the twenty first century. As data in every industry keeps growing the need to organize, explore, analyze, predict and summarize is insatiable. Data Science is creating new paradigms in data driven business decisions. As the field is emerging out of its infancy a wide range of skill sets are becoming an integral part of being a Data Scientist. In this talk I will discuss the different driven roles and the expertise required to be successful in them. I will highlight some of the unique challenges and rewards of working in a young and dynamic field.
A Brief Tour through the Geology & Endemic Botany of the Klamath-Siskiyou Rangemark madsen
A hotspot of diversity for rare plants, butterflies and birds, the Klamath-Siskiyou region of southern Oregon is a scientist's (and naturalist's) paradise. This is transverse range running from the Cascades range to the Pacific Ocean, creating an east-west corridor between the coast and the volcanic Cascades range. Mark Madsen’s love of biology while living in the area for 15 years sparked an interest in botanical taxonomy in the world of serpentine soils and the plant communities thriving in the region, including remnant species from the last ice age.
A Pragmatic Approach to Analyzing Customersmark madsen
The business market is different today than it was 20 years ago when BI got started. We're just beginning to grasp how to work within the new economic and communication models. Companies can't rely solely on financial and operational metrics any more, and need to analyze customer behaviors in more detail.
The big change in analysis is a move from mass market metrics to individualized data, no longer analyzing or managing by averages. The stream of events and observations available from applications today combined with new platforms for collecting and processing data enables (relatively) easy analysis.
Despite this, many companies struggle to analyze customer data. This talk will describe a handful of customer metrics and models that are (relatively) easy to do, yet are often not done. It's often easier to succeed by stringing together a handful of simple techniques rather than applying advanced techniques.
Expect to come away from this session with:
- a little history of customer data use by marketing and how that has changed in the last 10 years.
- the most common behavioral data sources you have available.
- some of the basic questions that often go unanswered, and data that is not assessed in the proper context.
- some basic analyses you can perform.
Building the Enterprise Data Lake: A look at architecturemark madsen
The topic is building an Enterprise Data Lake, discussing high level data and technology architecture. We will describe the architecture of a data warehouse, how a data lake needs to differ, and show a high level functional and data architecture for a data lake. This webinar will cover:
Why dumping data into Hadoop and letting users get it out doesn't work
The difference between a Hadoop application and a Data Lake
Why new ideas about data architecture are a key element
An Enterprise Data Lake reference architecture to frame what must be built
Slides for Briefing Room webcast ( https://bloorgroup.webex.com/bloorgroup/lsr.php?RCID=869f964b1380f728cedde802779a1e12 )
Organizations worldwide are learning hard lessons these days about the constraints of dated information systems. The time-tested process of Extract-Transform-Load (ETL) is fast losing its ability to cope with the volume, velocity and variety of Big Data coming down the pike. Forward-thinking companies are therefore prepping the battle field by designing on-ramps to the future of streaming analytics. Register for this episode of The Briefing Room to hear Analyst Mark Madsen explain how a new era of data solutions is rising to the challenge of streaming data. He'll be briefed by Steve Wilkes, founder and CTO of the Striim platform. Steve will share how enterprises are turning to streaming data integration, in-memory transformations and continuous processing to achieve the goals of ETL in milliseconds – at a fraction of the cost and complexity of legacy systems. Several case studies will be shared.
On the edge: analytics for the modern enterprise (analyst comments)mark madsen
On the Edge: Analytics for the Modern Enterprise
[these are the analyst comments on enterprise data architecture and streaming]
Webcast description: The speed of business today requires new approaches to generating and leveraging analytics. Latencies of a day, an hour or even minutes no longer suffice in many situations. For these use cases, organizations must embrace analytics at the edge: a process that involves targeted number-crunching at the fringe of the enterprise. When designed properly, these systems give companies a leg up on their competitors. Register for this episode of The Briefing Room to hear veteran Analyst Mark Madsen of Third Nature explain how a new era of information architectures is now unfolding, paving the way to much more responsive and agile business models. He'll be briefed by Kim Macpherson of the Cisco Data and Analytics Business Unit, who will explain how her company's platform is uniquely suited for this new, federated analytic paradigm. She'll demonstrate how edge analytics can help companies address opportunities quickly and effectively.
Crossing the chasm with a high performance dynamically scalable open source p...mark madsen
Many open source projects are seeking commercial backing and growing into businesses – e.g. Spark, Cassandra, Hadoop. How do you market your project or product to investors or to enterprises? What should your pitch deck or customer presentation look like? This talk will show you the archetype (based on watching hundreds of these presentations) and exactly how not to do a presentation like this.
Intro:
Mark Madsen’s exposure to the software industry can be measured in terapoints (that’s 1 million powerpoints). He received his Master’s degree in Science from an unnamed university in 1993 and has been doing what is described in technical jargon as “a lot of stuff” since then. With knowledge as broad as the wide Mississippi, his qualifications need not be divulged, and certainly not investigated. He doesn't want anyone to make a fuss over him—just treat him as you would any great man. Please welcome Mark Madsen
Don't let data get in the way of a good storymark madsen
Storytelling is not about raising someone’s IQ, it’s about raising their blood pressure. Stories engage emotions rather than intellect, making “storytelling with data” a poor metaphor for data visualization when our goal is to communicate clearly.
People are often confused or misled by “story”, thinking they need a classical story structure with protagonists, action and resolution when the job may be simpler, or more complicated. Some of the storytelling tools and suggestions vendors promote would get you kicked out of your boss’s office you used them without taking into account their goals and context.
Narrative is what we are really talking about, not story. We need to focus our attention on narrative techniques rather than “story” and its forced linear structure. This means understanding why we want to communicate: is it to explain, to build shared understanding, to convince others that our interpretation is the right one?
We use visualization as a tool for many different purposes, communication being one. The idea of narratives with data is a good one, but not all narrative is story. The purpose of this talk is to provide clarity around the goals of communicating with data and to provide a goal-oriented framework that escapes the bad metaphorical frame imposed by “storytelling”.
Data lakes, data exhaust, web scale, data is the new oil. Vendors are throwing new terms and analogies at us to convince us to buy their products as the market around data technologies grows. We change data persistence and transaction layers because "databases don't scale" or because data is "unstructured". If data had no structure then it wouldn't be data, it would be noise. Schema on read, schema on write, schemaless databases; they imply structure underlying the data. All data has schema, but that word may not mean what you think it means.
This presentation will describe concepts of data storage and retrieval from technology prehistory (i.e. before the 1980s) and examine the design principles behind both old and new technology for managing data because sometimes post-relational is actually pre-relational. It is important to separate what is identical to things that were tried in the past from new twists on old topics that deliver new capabilities.
Directly related to these topics are performance, scalability and the realities of what organizations do with data over time. All of these topics should guide architecture decisions to avoid the trap of creating technical debts that must be paid later, after systems are in place and change is difficult.
The problems of scale, speed, persistence and context are the most important design problem we'll have to deal with during the next decade. Scale because we're creating and recording more data than at any time in human history – much of it of dubious value, but none of it obviously value-less. Speed because data flows now. Ceaselessly. In high volume. It has to be persisted in multiple latencies, from milliseconds to decades. And context because the context of creation is different from the context of transmission is different from the context of use.
There are a lot of red herrings, false premises and just-plain-dementia that get in the way of us seeing the problem clearly. We must work through what we mean by "structured" and "unstructured", what we mean by “big data” and why we need new technologies to solve some of our data problems. But “new technologies” doesn’t mean reinventing old technologies while ignoring the lessons of the past. There are reasons relational databases survived while hierarchical, document and object databases were market failures, technologies that may be poised to fail again, 20 years later.
What we believe about data’s structure, schema, and semantics are as important as the NoSQL and relational databases we use. The technologies impose constraints on the real problem: how we make sense of data in order to tell a computer what to do, or to inform human decisions. Most discussions of data and code lose the unconscious tradeoffs that are made when selecting these technology handcuffs.
Cloud computing is creating a new era for IT by providing a set of services that appear to have infinite capacity, immediate deployment and high availability at trivial cost. These are all appealing to someone running a data warehouse when data volume, use and cost are growing at a rapid rate.
Today most organizations look at cloud as a way to lower data center and IT costs. While cost reduction is a real benefit, there is more value in the increased scalability, speed to procure (and give up) resources, and ease of delivery in cloud environments.
Database workloads are particularly challenging in the cloud. Cloud deployments beyond a moderate scale favor shared-nothing database architectures designed to run transparently in a multi-node environment. We are still in an early period of standardization and design of software to run in the cloud. Not all workloads are suitable for deployment on a collection of small virtualized servers today. Business intelligence and analytic database workloads fall into this area, raising the importance of analysis for fit with public and private cloud options.
Open Data: Free Data Isn't the Same as Freeing Datamark madsen
Talk given at the South Tyrol Innovation conference on open data, mainly focused on government open data.
Open data doesn’t mean free data and other maunderings about public data, public goods, and networked data as a resource.
The hidden costs of open data (and how to pay for them).
Beyond transparency (which is where a lot of this started).
Description of the basic cloud principles, the cost & deployment model for cloud, shortcomings for BI workloads beyond modest scale, some stats on market adoption/preference of cloud for DW.
Big Data Wonderland: Two Views on the Big Data Revolutionmark madsen
To kick off the Big Data for Enterprise IT Day, we present two views of big data. Is it truly something new, or just an evolution of what we have already? Join us for an interesting and entertaining talk that will help frame your thinking on big data. We take on the roles of former bosses: the techno-lustful and the luddite, and debate the key talking points put forth in the market.
An earlier video of this talk can be seen at http://www.youtube.com/watch?v=qnHHOWz5uvM
Using Data Virtualization to Integrate With Big Datamark madsen
Hadoop and big data don't sit as an island in organizations. To analyze event streams and similar data requires integrating with other data from systems in the organization. This isn't easy with big data systems today because there are disparities in the technoogies and environments when compared to traditional IT. Data virtualization is one way to smooth over the integration and allow Hadoop to access other data, or allow SQL-oriented tools to access Hadoop
One Size Doesn't Fit All: The New Database Revolutionmark madsen
Slides from a webcast for the database revolution research report (report will be available at http://www.databaserevolution.com)
Choosing the right database has never been more challenging, or potentially rewarding. The options available now span a wide spectrum of architectures, each of which caters to a particular workload. The range of pricing is also vast, with a variety of free and low-cost solutions now challenging the long-standing titans of the industry. How can you determine the optimal solution for your particular workload and budget? Register for this Webcast to find out!
Robin Bloor, Ph.D. Chief Analyst of the Bloor Group, and Mark Madsen of Third Nature, Inc. will present the findings of their three-month research project focused on the evolution of database technology. They will offer practical advice for the best way to approach the evaluation, procurement and use of today’s database management systems. Bloor and Madsen will clarify market terminology and provide a buyer-focused, usage-oriented model of available technologies.
Webcast video and audio will be available on the report download site as well.
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Architecting a Platform for Enterprise Use - Strata London 2018
1. Copyright Third Nature, Inc.
Architecting a Data
Platform For
Enterprise Use
May 2018
Mark Madsen
Todd Walter
2. Copyright Third Nature, Inc.
The business complaints about data and analytics
You really
should
fire IT.
3. Copyright Third Nature, Inc.Copyright Third Nature, Inc.
Data
deficient
Takes
too long
Costs
too much
Function
deficient
IT root
causes
IT proximate
causes
What is said
in disputes
Lack of
agility
People & vendor
cost basis
Client-server
infrastructure
Lack of
“good enough”
competency
1980s-era
methods
Inappropriate
technology
Data hygiene
fetishes
Vendor lock
1990s-era
procurement
IT skills
deficit
Dysfunctional
OLTP portfolio
4. Copyright Third Nature, Inc.Copyright Third Nature, Inc.
Data
deficient
Takes
too long
Costs
too much
Function
deficient
IT root
causes
IT proximate
causes
What is said
in disputes
How business
responds
How IT
responsds
Lack of
agility
People & vendor
cost basis
Client-server
infrastructure
Lack of
“good enough”
competency
SaaS /
Cloud BI
Consultants
Self-service
BI
Shadow BI
Workarounds &
spreadmarts
Hidden costs
Loss of
control
Loss of
visibility
Loss of
knowledge
Security risks
Bad data risks
1980s-era
methods
Inappropriate
technology
Data hygiene
fetishes
Vendor lock
1990s-era
procurement
IT skills
deficit
Dysfunctional
OLTP portfolio
5. Copyright Third Nature, Inc.Copyright Third Nature, Inc.
Data
deficient
Takes
too long
Costs
too much
Function
deficient
IT root
causes
IT proximate
causes
What is said
in disputes
How business
responds
How IT
responds
Lack of
agility
People & vendor
cost basis
Client-server
infrastructure
Lack of
“good enough”
competency
SaaS /
Cloud BI
Consultants
Self-service
BI
Shadow BI
Workarounds &
spreadmarts
Hidden costs
Loss of
control
Loss of
visibility
Loss of
knowledge
Security risks
Bad data risks
1980s-era
methods
Inappropriate
technology
Data hygiene
fetishes
Vendor lock
1990s-era
procurement
IT skills
deficit
Dysfunctional
OLTP portfolio
This is not a technology problem –
it is an architecture problem.
6. Copyright Third Nature, Inc.Copyright Third Nature, Inc.
What do we hear in the field?
"I want to do analytics, beyond what I can do with BI tools"
"I need an analytics strategy"
"I need an analytics roadmap"
"I have a specific analytics project of type..."
“We have a data lake. What should we do with it?”
“Technology Y replaces technology X”
"I want to modernize my DW" aka "I want to speed up my DW"
“Don’t need schema, curation, ETL, governance … anymore”
Good judgement is the result of experience.
Experience is the result of bad judgement.
—Fred Brooks
8. Copyright Third Nature, Inc.
DW: Centralize, that solves all problems!
Creates bottlenecks
Causes scale problems
Availability?
9. Copyright Third Nature, Inc.
The data lake solution: no central authority
wtf, it was fully
operational!
10. Copyright Third Nature, Inc.
The data lake solution?
There’s a problem: as
the lake is envisioned,
it is still a centralized
data architecture, but
this time there is no
single global model.
Instead it’s files and
not modeled. It can be
operational while
under construction.
It’s still a death star.
11. Copyright Third Nature, Inc.
Eventually we run into the same problems
Seriously, wtf?
It was agile
and operational
Rising complexity and scale break centralized models
12. Copyright Third Nature, Inc.
“Standardize on one tech, it’s simpler for everyone”
It’s called stack think. Pick your vendor. Cede all architecture.
14. Copyright Third Nature, Inc.Copyright Third Nature, Inc.
“Infrastructure one PoC at a time”
• Use case driven
• Leverage (any) new
technology
• Re-use of the
technology stack
• Data is captive to the
application or to the
technology stack
• Developers tend to
think “function first”
• Analytics people also
think “function first”,
but generally require
integrated data
15. Copyright Third Nature, Inc.Copyright Third Nature, Inc.
A key aspect of operational analytics is “end-to-end”
E2E does not allow you to take a random
walk through technologies and BYOT
because the complexity of integration leads
to a mountain of technical debt.
• Most data
applications are
organized around a
process, not a task or
function.
• Real-time analytics
systems pass their
data over the
network, but the
dependencies can
cross application
boundaries, as well
as requiring
persistence.
• Developers are
being forced to think
of data outside the
local context.
17. Copyright Third Nature, Inc.Copyright Third Nature, Inc.
Big data reality: What users seeBig data promise: What you see
18. Copyright Third Nature, Inc.
Persisting data
is not the end
of the line.
If you stop
here you win
the battle and
lose the war
19. Copyright Third Nature, Inc.
We don’t have an analytics problem, just like we
didn’t have a BI problem
The origin of analytics as “business
intelligence” was stated well in 1958:
…the ability to apprehend the
interrelationships of presented facts in
such a way as to guide action towards a
desired goal. ~ H. P. Luhn
“A Business Intelligence System”, http://altaplana.com/ibmrd0204H.pdf
”
“
Our goal is analytics as a capability, not a technology
20. Copyright Third Nature, Inc.
Three constituencies
Stakeholder Analyst Builder
aka the recipient aka the data scientist aka the engineer
21. Copyright Third Nature, Inc.
Starting points for analytics strategy
Many organizations choose to start with
the analysts. Create a data science team.
Turn them loose to find a problem.
Many more start with builders: technology
solutions looking for problems, e.g. 65% of
the IT driven Hadoop and Spark projects
over the last five years.
The right place to start? Stakeholders. The
goal to achieve, the problem to solve.
22. Copyright Third Nature, Inc.
NATURE OF THE PROBLEM FROM
THE STAKEHOLDER’S PERSPECTIVE
Each constituency has their own set of problems to deal with
23. BI and Analysis: Repeatability and Discoverability
Discover
Explore
Analyze
Model
Consume
Promote
Focus on repeatability
Application cycle time
90% of users
Focus on discoverability
Analyst cycle time
9% analysts, 1% data scientists
80% of data use 80% of analysis
24. The myth that still drives analytics – analytic gold
All we need is a fat
pipe and pans
working in parallel…
25. Analytic insights that result in no action are expensive trivia.
It’s not the insight, but what you do with it, that matters
26. Applying analytics is not an analytics problem
Applying analytics is not in the
analyst’s control.
It’s not in the engineer’s control.
It’s in the control of the people
involved in the process.
Failures are often in execution, not
in analytics development.
For example, we saw unexpectedly
poor performance in a number of
geographies. Was it the new
analytics we tried? Was it a data
problem? No, it was a simple
compliance problem.
27. NATURE OF THE PROBLEM FROM
THE ANALYST’S PERSPECTIVE
29. The nature of analytics problems is researching the
unknown rather than accessing the known.
Repeat for each new problem
Diagram: Kate Matsudaira
30. The main hurdle: just getting the data
Do you know where to find it? Because it’s
may not be in the data warehouse.
Do you have access to it?
Is access fast enough? Because DWs are for
QRD, not for moving huge piles of data. And
ERP systems and SaaS apps are right out.
These are why developers jump on “data lake”
as the solution. But that’s a small, easy part.
31. Do you have the right data?
Many machine learning
techniques require labeled
(known good) training data:
Supervised learning: a person
has to define the correct
output for some portion of
the data. Data is divided into
training sets used for model
building and test sets for
validating the results.
• What is spam and what isn’t?
• What does a fraudulent
transaction look like
36
32. Do you have enough of the right data?
ML needs a lot, you may be disappointed in your efforts
33. Where do analysts spend their time? mostly data work
Slide 38
Define the
business problem
Translate the
problem into an
analytic context
Select appropriate
data
Learn the data
Create a model set
Fix problems with
data
Transform data
Build models
Assess models
Deploy models
Assess results
% of time spent
70% 30%
Source: Michael Berry, Data Miners Inc.
36. Copyright Third Nature, Inc.Copyright Third Nature, Inc.
Schema control, space limitations, resource
limitations, production lockdown
Analytics work is a production workload, but…
37. NATURE OF THE PROBLEM FROM
THE BUILDER’S PERSPECTIVE
38. IT and Ops people want to know “what to build?”
Giant data platform? Self service tools?
39. Analytics has different processes and workloads
None of this analytics work
is the same as what IT
considered “analysis” to be,
which is usually equated
with BI or ad-hoc query.
Ad-hoc analysis =
Exploratory data analysis =
Batch analytics =
Real-time analytics
A real analytics production workflow
Hatch, CIKM ‘11 Slide 44
40. The world changes, do the models?
In BI you maintain ETL and
schemas, in ML you maintain
models.
“Model decay” happens as the
assumptions around which a
model is built change, e.g. spam
techniques change.
When you adjust the model you
need to know it is normal again
▪ Better save the data used to
build the model
▪ Better save the model
▪ Baseline and measurements
41. THREE PERSPECTIVES, ONE SOLUTION?
There are requirements from all constituents. You need to put them
together to have a complete picture of what’s needed.
42. The missing stakeholder
There is another stakeholder:
analytics management - the
CAO, CDO, VP of analytics, aka
“your boss” if you’re a data
scientist.
This is the perspective and
problems of the person
responsible for oversight of
the team and efforts is across
the organization and across
multiple projects
47. There is an extensive list of requirements to support
Primary requirements needed by constituents S D E
Data catalog and ability to search it for datasets X X
Self-service access to curated data X
Self-service access to uncurated (unknown, new) data X X
Temporary storage for working with data X
Data integration, cleaning, transformation, preparation tools and environment X X
Persistent storage for source data used by production models X X
Persistent storage for training, testing, production data used by models X X
Storage and management of models X X
Deployment, monitoring, decommissioning models X
Lineage, traceability of changes made for data used by models X X
Lineage, traceability for model changes X X X
Managing baseline data / metrics for comparing model performance X X X
Managing ongoing data / metrics for tracking ongoing model performance X X X
S = stakeholder, user, D = data scientist, analyst, E = engineer, developer
49. 54
100
1,000
10,000
1 2 3 4 5 6 7 8 9 10
Customer Data Plateaus
Add users,
accumulate
history, add
attributes, add
dimensions
Entirely new data
set enabled by
new technology
at new price
point – value has
to exceed cost
(ROI). Earliest
adopters have
vision ahead of
ROI
50. 55
1
10
100
1,000
10,000
100,000
1,000,000
10,000,000
1 2 3 4 5 6 7 8 9 10 11 12 13
User Adoption of New Data
Customer Data
Users - Data Set 1
Users - Data Set 2
Users of Integrated Data
User adoption of new data sets
starts over. Very small number of
experts growing to wider
audience, sophisticated users
moving to business analysts, then
business users, B2B customers and
even to consumers
Greater value is derived when
data sets are linked – see bigger
picture (eg who buys what,
when). Comes after initial
extraction of easy value from
standalone data
53. 58
• <Store, Item, Week>
• <Store, Item, Day>
– Simple aggregations
• Market Basket
– Affinity
– Link to person, demographics, HR
• Inventory by SKU by store
– Temporal, time series, forecasting
– Link to product, marketing, market basket
Retail Plateaus
2B records total
for 9 quarters
2B records per
day, keep 9
quarters
54. 59
• Web Logs and traffic
– Behavioral patterns – eg path linked to person, offers, other channels
– Operations of the web site
• Supply chain sensors – sampled at major event
– Activity Based Costing
– Link to customer, product, HR, planning
• Social Media
– Text analysis, Filtering, languages
– Link to customer, sales, other channel interactions
• Supply chain sensors – sampled at minutes or seconds
– Telematics
– Real time, Event detection, trending, static and dynamic rules
– Link to HR, thresholds, forecasts, routing, planning
Retail Plateaus
30B records per day
57. 62
2855
Given the rise in warranty costs, isolate the problem to be a plant, then to a battery lot.
Communicate with affected customers, who have not made a warranty claim on batteries,
through Marketing and Customer Service channels to recall cars with affected batteries.
Inventory
Returns
Manufacturing
Supply Chain
Customer Service
Orders
Revenue
Expenses
Case History
Customers
Products
Pipeline
Customers
Campaign History
FINANCE
SALESMARKETING
OPERATIONS
CUSTOMER EXPERIENCE
2855
60. 65
PRODUCT SENSOR SOCIAL MEDIA CUSTOMER
CARE AUDIO
RECORDINGS
DIGITAL
ADVERTISING
CLICKSTREAM
65 41 32 19 28
How many
visitors did we
have to our
hybrid cars
microsite
yesterday?
What are the
temperature
readings for
batteries by
Manufacturer?
What is the
sentiment
towards line of
hybrid
vehicles?
Which
customers likely
expressed
anger with
customer
care?
Which ad
creative
generated the
most clicks?
64. 69
2855
Out of Deviation
Sensor Readings
Fraud
Events
CUSTOMER EXPERIENCE
SALES
MARKETING
OPERATIONS
Abandoned
Carts
Online Price
Quotes
Social Media
Influencers
FINANCE
11234
Minimum Viable
Curation
Minimum Viable
Data Quality
65. 70
2855
Out of Deviation
Sensor Readings
Fraud
Events
CUSTOMER EXPERIENCE
SALES
MARKETING
OPERATIONS
Abandoned
Carts
Online Price
Quotes
Social Media
Influencers
FINANCE
11234Sensor data
formatting
Unit
Normalization
Vehicle/version
normalization
Make/model/year
selection
Data Comprehension,
Pipelines
66. 71
2855
Enterprise
Wide Use
10s to 100s
of Users
10 Users
Super Users
Business
Analysts
Action list, Report,
Dashboard Users
Share across
Enterprise
Share in
department
Share over
cube wall
User Base and
Sharing
68. 73
2855
Production tools,
curated data,
integrated across
business areas
Wide variety
of tools and
data forms
Targeted data access,
many applications,
response time SLAs
Evolving
Consumption
Requirements
High CPU,
moderate to
low IO
Bulk scans, large
computation,
transformation,
data access,
specific
integration
Moderate
CPU, High IO,
Resource
management
70. 75
2855
MANUFACTURING
CAMPAIGN HISTORY
COSTS
PRODUCTS
FINANCE
SALES
MARKETING
OPERATIONS
CUSTOMER EXPERIENCE
Given the rise in warranty
costs, isolate the problem to
be a plant and the specific
lot.
Exclude 2/3rd of the batteries
from the lot that are fine.
Communicate with affected
customers, who have not
made a warranty claim,
through Marketing and
Customer Service channels
to recall cars with affected
batteries.
CUSTOMERS
CASE HISTORY
SENSOR
Access Wide Variety of Data
to Answer a Question
71. Copyright Third Nature, Inc.
An event contains mainly IDs…that reference other data
2010.01.10.01:41:4468 67.189.110.179 Ser40489a07-7f6e4251801a
13ae51a6d092 301212631165031 590387153892659 DNIS5555685981-
UTF8&1T4ACGW_enUS386US387&fu=0&ifi=1&dtd=204&xpc=1KoLqh374s
m100109_44IOJ1-Id=105543CD1A7B8322
timestamp
user ID
device ID
game ID event details
IP adr
log ID
Log de-referencing and enrichment is difficult since you can’t
enforce integrity like you can in a DB.
What’s the glue that holds it together?
It’s just keys to other data.
e.g. remember that device ID 0 problem?
74. Copyright Third Nature, Inc.
MDM again
If you want to link datasets then you must manage the keys
You need canonical forms for common data (in code too)
user ID
IP adr
2010.01.10.01:41:4468 67.189.110.179 Ser40489a07-7f6e4251801a
13ae51a6d092 301212631165031 590387153892659 DNIS5555685981-
UTF8&1T4ACGW_enUS386US387&fu=0&ifi=1&dtd=204&xpc=1KoLqh374s
m100109_44IOJ1-Id=105543CD1A7B8322
UID CID Email City State Country
590387153892659 10098213 barry.dylan@odin.com Paris Île-de-FranceFrance
2010.01.10.14:26:2468 67.189.110.179 10098213 5046876319474403 MOZILLA/4.0
(COMPATIBLE; TRIDENT/4.0; GTB6; .NET CLR 1.1.4322) https://www.phisherking.com/
gifts/store/LogonForm?mmc=link-src-email_m100109 http://www.google.com/search?
sourceid=navclient&aq=0h&oq=Italian&ie=UTF8&pid=1T4ACGW_13ae51a6d092&q=ita
lian+rose&fu=0&ifi=1&dtd=204&xpc=1KoLqh374s
game ID
date
IP adr
customer ID
Event
Click
Cust-
user
76. Copyright Third Nature, Inc.
What you really have
This is called “self-service.” Analysts will love it…
77. Copyright Third Nature, Inc.
The solution to our problems isn’t
technology, it’s architecture.
78. Copyright Third Nature, Inc.
History: This is how BI was done through the 80s
First there were files and reporting programs.
Application files feed through a data processing pipeline to
generate an output file. The file is used by a report formatter
for print/screen. Files are largely single-purpose use.
Every report is a program written by a developer.
Data pipeline
code
79. Copyright Third Nature, Inc.
History: This is how BI ended the 80s
The inevitable situation was...
Data pipeline
code
80. Copyright Third Nature, Inc.
History: This is how we started the 90s
Collect data in a database. Queries replaced a LOT of
application code because much was just joins. We
learned about “dead code”
SQL
SQL
SQL
SQL
SQL
81. Copyright Third Nature, Inc.
Pragmatism and Data
Lessons learned during
the ad-hoc SQL era of the
DW market:
When the technology is
awkward for the users, the
users will stop trying to use
it.
Even “simple” schemas
weren’t enough for anyone
other than analysts and
their Brio…
Led to the evolution of
metadata-driven SQL-
generating BI tools, ETL
tools.
82. Copyright Third Nature, Inc.
BI evolved to hiding query generation for end users
With more regular schema models, in particular
dimensional models that didn’t contain cyclic join paths, it
was possible to automate SQL generation via semantic
mapping layers.
We developed data pipeline building tools (ETL).
Query via business terms made BI usable by non-technical
people.
ETL
SQL
Life got much easier…for a while
83. Copyright Third Nature, Inc.
Today’s model: Lake + self-service, looks familiar…
The Lake with data pipelines to files or Hive tables
is exactly the same pattern as the COBOL batch..
Dataflow
code
We already know that people don’t scale…
84. Copyright Third Nature, Inc.
DATA ARCHITECTURE
We’re so focused on the light switch that we’re not
talking about the light
85. Copyright Third Nature, Inc.
Decouple the Architecture
The core of the data warehouse isn’t the
database, it’s the data architecture that the
database and tools implement.
We need a data architecture that is not limiting:
▪ Deals with change more easily and at scale
▪ Does not enforce requirements and models up front
▪ Does not limit the format or structure of data
▪ Assumes the range of data latencies in and out, from
streaming to one-time bulk
▪ Allows both reading and writing of data from outside
86. Copyright Third Nature, Inc.
The architecture from 1988 we SHOULD HAVE BEEN USING
The general concept of a
separate architecture for BI
has been around longer, but
this paper by Devlin and
Murphy is the first formal
data warehouse architecture
and definition published.
91
“An architecture for a business and
information system”, B. A. Devlin,
P. T. Murphy, IBM Systems Journal,
Vol.27, No. 1, (1988)
Slide 91Copyright Third Nature, Inc.
87. Copyright Third Nature, Inc.
The goal is to decouple: solve the application and
infrastructure problems separately
Platform Services
Data Access
Deliver & Use
Data storage
This separates BI from other uses of data, allowing each type of
use to structure the data specific to its own requirements.
Data access is already
somewhat separate
today. Make the
separation of different
access methods a formal
part of the architecture.
Don’t force one model.
88. Copyright Third Nature, Inc.
The goal is to decouple: solve the application and
infrastructure problems separately
Platform Services
Data Management
Process & Integrate
Data storage
Data management should not be subject to the constraints of a single use
Data management
was blended with both
data acquisition and
structuring data for
client tools. It should
be an independent
function.
89. Copyright Third Nature, Inc.
The goal is to decouple: solve the application and
infrastructure problems separately
Data Acquisition
Collect & Store
Incremental
Batch
One-time copy
Real time
Platform Services
Data storage
Data arrives in many latencies, from real-time to one-time. Acquisition
can’t be limited by the management or consumption layers.
Data acquisition should not be
directly tied to the needs of
consumption. It must operate
independently of data use.
90. Copyright Third Nature, Inc.
The full analytic environment subsumes all the functions
of the data warehouse, and extends them
Data Acquisition
Collect & Store
Incremental
Batch
One-time copy
Real time
Platform Services
Data Management
Process & Integrate
Data Access
Deliver & Use
Data storage
The platform has to do more than serve queries; it has to be read-write.
91. Copyright Third Nature, Inc.
The data architecture must align with system components
because each of them addresses different data needs
Incremental
Collect
Batch
One-time copy
Real time
Manage & Integrate
Data Acquisition
Collect & Store
Data Management
Process & Integrate
Data Access
Deliver & Use
Separating concerns is part of the mechanism for change isolation
92. Copyright Third Nature, Inc.
Zone 1: Acquistion
▪ Focus is only on collecting and tracking data
▪ Direct recording of data from sources
▪ Each dataset has as much schema as the source
provides, could be explicit or implicit or none
▪ Foundation layer for all subsequent use
Zone 1
Transactions Events Objects
93. Copyright Third Nature, Inc.
Zone 2
Zone 2: Integration - Standardized
▪ Parsed and cataloged – all structure and form are
made explicit so data can be accessed
▪ Namespace (e.g. keys) managed
▪ Common elements are standardized
▪ Datasets are profiled, indexed, available
Zone 1
Transactions Events Objects
94. Copyright Third Nature, Inc.
Zone 2
Zone 2
Zone 2: Integration - Enhanced
▪ Common shared KPIs, master datasets
▪ Extracted and derived data available, e.g. NEE output
▪ Data is linkable, labeled, possibly cleaned
Zone 1
Transactions Events Objects
95. Copyright Third Nature, Inc.
Layer 1
Layer 2
Layer 0
Zone 3
Zone 3: Access
▪ Data structured to suit the workload
▪ Integrated data for a specific purpose
▪ Business rules applied, e.g. filters, controls, etc.
▪ Designed, explicit structures
▪ Generally repeatable use
96. Copyright Third Nature, Inc.
Layer 0
Layer 1
Layer 2
Zone 3
Zone 4
Zone 4: Transient
▪ Under business / developer control
▪ The data for one-off projects of unknown value
or repeatability, integrated from other layers
▪ Place for ephemeral analytics output
▪ The sandbox
97. Copyright Third Nature, Inc.
Zone 1
Raw
Zone 2
Standardized
Zone 2
Enhanced
Zone 3
Managed
Zone 4
Transient
Decoupled data architecture
98. Copyright Third Nature, Inc.
Food supply chain: an analogy for analytic data
Multiple contexts of use, differing quality levels
You need to keep the original because just like baking,
you can’t unmake dough once it’s mixed.
99. Copyright Third Nature, Inc.
The design focus is different in each area
Ingredients
Goal: available
User needs a recipe
in order to make
use of the data.
Pre-mixed
Goal: discoverable
and integrateable
User needs a menu
to choose from the
data available
Meals
Goal: usable
User needs utensils
but is given a
finished meal
100. Copyright Third Nature, Inc.
The data is in zones of management, not isolating layers
Raw data in an immutable
storage area
Standardized or
enhanced data
Common or
usage-
specific
data
Transient data
Relax control to enable self-service while avoiding a mess.
Do not constrain access to one zone or to a single tool.
Focus on visibility of data use, not control of data.
101. Copyright Third Nature, Inc.
This data architecture resolves rate of change problems
Raw data in an immutable
storage area
Standardized or
enhanced data
Common or
usage-
specific data
Transient data
New data of
unknown value,
simple requests for
new data can land
here first, with little
work by IT.
More effort applied to
management, slower.
Optimized for
specific uses /
workloads. Generally
the slowest change.
Not fast vs slow:
fast vs right
Not flexibility vs
control: flexibility
vs repeatability
Agile for
structure change
vs agile for
questions / use
102. Copyright Third Nature, Inc.
SQLServer
RDBMS
Example: data environment, mid-size retailer
Replication, batch ETL
to collect data
Web Application
Standard retail /
wholesale data
warehouse model.
Nightly batch model
from stage, contains
~ 1 TB of data
DW
New requirement:
Load all the online marketing and web activity for
use in analytic models (not simple web analytics)
Will add ~10 TB of data, continuous or hourly loads
103. Copyright Third Nature, Inc.
SQLServer
RDBMS
Example: data environment, mid-size retailer
Replication, batch ETL
to collect data
Hadoop
Web Application
Persistent store,
data warehouse on
same DB in
different schemas
DW
Log file fetch & load for clickstream,
summaries sent to reporting env
Discovery work usually
done in Hadoop,
sometimes in database.
104. Copyright Third Nature, Inc.
SQLServer
This data architecture uses the 3 zone pattern
Hadoop
Web Application
RDBMS
DW
Raw data in an immutable
storage area
Standardized or
enhanced data
Common or
usage-
specific data
Transient data
All reporting data
and analytics
output is sent to
the DW where it is
can be accessed.
Cleaned,
summarized or
derived data is
stored in DB
Raw data is stored
in two places:
relational and small
sets in DB, log
collection in Hadoop
105. Copyright Third Nature, Inc.
The concept of a zone is not a physical system. It’s data architecture
Apps
DW
The biggest decision is to separate all data collection from
the data integration from consumption.
Physical system/technology overlays are separate, depend
on the specific use cases and needs of the organization.
Dist FS
RDBMSs
Decompress &
process device logs
RDBMS
RDBMS
RDBMS
Discovery
platform
107. 112
Blended Architectures Are a Requirement,
Not an Option
Data Warehouse + Data Lake
On Premise + Cloud
RDBMS + S3 + HDFS
Commercial + Open Source
108. Copyright Third Nature, Inc.
Bricks are not buildings
We don’t think this is equivalent to this
Architecture is not technology. It’s not a product you can buy.
109. Copyright Third Nature, Inc.Copyright Third Nature, Inc.
New tools and tool architectures are required
One tool for all jobs doesn’t work any more*
* it never did, we just had fewer jobs
110. Copyright Third Nature, Inc.
Beware overengineering: do you really need special
new technology infrastructure?
Numberofpeople
The distribution of data
size is about normal, yet
these guys set the tone of
the market today.
Bigness of data
111. Copyright Third Nature, Inc.
Analytics: This is really raw data under storageNumberofjobs
Microsoft study of 174,000 analytic
jobs in their petabyte scale cluster:
median size = ???
Bigness of data
Copyright Third Nature, Inc.
112. Copyright Third Nature, Inc.
Working data for analytics most often not bigNumberofjobs
14 GB
Smallness of data
Copyright Third Nature, Inc.
The usable size distribution
113. Copyright Third Nature, Inc.
TANSTAAFL
When replacing the old
with the new (or ignoring
the new over the old) you
always make tradeoffs,
and usually you won’t see
them for a long time.
Technologies are not
perfect replacements for
one another. Often not
better, only different.
117. Mark Madsen is the global head of
architecture at Teradata, Prior to that
he was president of Third Nature, a
research and consulting firm focused
on analytics, data integration and data
management. Mark is an award-
winning author, architect and CTO
whose work has been featured in
numerous industry publications. Over
the past ten years Mark received
awards for his work from the American
Productivity & Quality Center, TDWI,
and the Smithsonian Institute. He is an
international speaker, chairs several
conferences, and is on the O’Reilly
Strata program committee. For more
information or to contact Mark, follow
@markmadsen on Twitter or visit
http://ThirdNature.net
About the Presenter
118. Todd Walter
Chief Technologist - Teradata
• Chief Technologist for Teradata
• A pragmatic visionary, Walter helps business leaders,
analysts and technologists better understand all of the
astonishing possibilities of big data and analytics
• Works with organizations of all sizes and levels of
experience at the leading edge of adopting big data, data
warehouse and analytics technologies
• With Teradata for more than 30 years and served for more
than 10 years as CTO of Teradata Labs, contributing
significantly to Teradata’s unique design features and
functionality
• Holds more than a dozen Teradata patents and is a
Teradata Fellow in recognition of his long record of
technical innovation and contribution to the company