Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Building a Data Platform Strata SF 2019

1,650 views

Published on

Building a data lake involves more than installing Hadoop or putting data into AWS. The goal in most organizations is to build multi-use data infrastructure that is not subject to past constraints. This tutorial covers design assumptions, design principles, and how to approach the architecture and planning for multi-use data infrastructure in IT.
[This is a new, changed version of the presentations of the same title from last year's Strata]

Published in: Data & Analytics
  • DOWNLOAD FULL BOOKS, INTO AVAILABLE FORMAT ......................................................................................................................... ......................................................................................................................... 1.DOWNLOAD FULL. PDF EBOOK here { https://tinyurl.com/yxufevpm } ......................................................................................................................... 1.DOWNLOAD FULL. EPUB Ebook here { https://tinyurl.com/yxufevpm } ......................................................................................................................... 1.DOWNLOAD FULL. doc Ebook here { https://tinyurl.com/yxufevpm } ......................................................................................................................... 1.DOWNLOAD FULL. PDF EBOOK here { https://tinyurl.com/yxufevpm } ......................................................................................................................... 1.DOWNLOAD FULL. EPUB Ebook here { https://tinyurl.com/yxufevpm } ......................................................................................................................... 1.DOWNLOAD FULL. doc Ebook here { https://tinyurl.com/yxufevpm } ......................................................................................................................... ......................................................................................................................... ......................................................................................................................... .............. Browse by Genre Available eBooks ......................................................................................................................... Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult,
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Real Ways To Make Money, Most online opportunities are nothing but total scams! ➤➤ https://tinyurl.com/y4urott2
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • New research shows 74% of men are more attracted to shis one thing, read more  https://tinyurl.com/y6enhezd
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Building a Data Platform Strata SF 2019

  1. 1. CopyrightThird Nature,Inc. Architecting a Data Platform For Enterprise Use March 2019 Mark Madsen Todd Walter
  2. 2. CopyrightThird Nature,Inc. Why are we talking about architecture?
  3. 3. 4 © 2017 Gartner, Inc.and/or its affiliates. All rights reserved. Gartner statement in 2018: only 15% are reported to be successful only 17%of Hadoop deployments are in production in 2017 Survey Analysis: BI and Analytics Spending Intentions, 2017 A McKinsey survey this year asked executives if their company had achieved a positive ROI with their big data projects: 7% answered “yes” Gartner Finding: in 2017
  4. 4. CopyrightThird Nature,Inc. DW: Centralize, that solves all problems! Creates bottlenecks Causes scale problems Availability?
  5. 5. CopyrightThird Nature,Inc. The data lake solution: no central authority wtf, it was fully operational!
  6. 6. CopyrightThird Nature,Inc. The data lake solution? There’s a problem: as the lake is envisioned, it is still a centralized data architecture, but this time there is no single global model. Instead it’s files and not modeled. It can be operational while under construction. It’s still a death star.
  7. 7. CopyrightThird Nature,Inc. Eventually we run into the same problems Seriously, wtf? It was agile and operational You’re building a centralized model without realizing it. Maybe we can avoid recreating the same problems.
  8. 8. Copyright Third Nature, Inc. The solution to our problems isn’t technology, it’s architecture.
  9. 9. CopyrightThird Nature,Inc. Architecture? What does the architect do? What is the architect’s responsibility and role? What is the core problem of data architecture?
  10. 10. CopyrightThird Nature,Inc. Use
  11. 11. CopyrightThird Nature,Inc. Use Over time
  12. 12. CopyrightThird Nature,Inc. Use Over time By people
  13. 13. Copyright Third Nature, Inc. Bricks are not buildings We don’t think this is equivalent to this Architecture is not technology. It’s not a product you can buy.
  14. 14. Copyright Third Nature, Inc. Blueprints are not architectures
  15. 15. Copyright Third Nature, Inc. Technical wiring diagrams Product lists Pretty pictures Hand-waving abstract statements and rules A thing you can purchase …so what is it? Architecture is not…
  16. 16. Copyright Third Nature, Inc. What is this?
  17. 17. Architecture is an abstraction – a pattern that supports a purpose You need purpose, therefore focus business goals, outcomes, use cases
  18. 18. 19 No architecture = accidental architecture The complexity of most organizations exceeds the capability of one system Just like the DW, keep piling things on… Accidental architecture at enterprise scale: Kowloon full scope (above), and the detail of one building section(left) – this is a true monolith. Functional, served a purpose, impossible to add services to, support, maintain. Result: avoid, or bulldoze.
  19. 19. 20 This focus on monolith is why the data warehouse is in legacy mode Everything is tightly interconnected, such that nobody wants to change anything because they might break something else. It’s easier to start over somewhere else. Then they blame the data warehouse, or the database, or the data management team for what is essentially an architectural failure.
  20. 20. CopyrightThird Nature,Inc. “The problem is the tools. Standardize on one tech!” It’s called stack think. Pick your vendor. Cede all architecture.
  21. 21. CopyrightThird Nature,Inc. The problem is methods and process: agile your way into the Analytics Shantytown
  22. 22. CopyrightThird Nature,Inc. “Start with the platform. The rest will follow.”
  23. 23. CopyrightThird Nature,Inc.CopyrightThird Nature,Inc. Big data reality: What users seeBig data promise: What you see
  24. 24. CopyrightThird Nature,Inc. HISTORY: HOW DID WE GET HERE?
  25. 25. CopyrightThird Nature,Inc. 4/23/2018 The market shifted from not enough data in the 90’s to too much data: the problem has become managing not just size, but scope and variety
  26. 26. CopyrightThird Nature,Inc. More complex needs drive more complex technology
  27. 27. CopyrightThird Nature,Inc. Market hype and IT workforce skill gaps lead to FOMO The pressure on IT to “just buy something” is high. The usual IT response is based on procurement, not innovation or integration. The data infrastructure solution tends to be: buy tools, collect data.
  28. 28. CopyrightThird Nature,Inc. 4/23/2018 Developmentaccumulates in a disorderly fashion
  29. 29. The end result in the IT landscape is complexity
  30. 30. 31 Sales SQL BI: dashboards, reports, queries Customers Inventory Products tables Where we started: the data warehouse and BI The data warehouse solved a key problem: access to data from multiple OLTP systems to provide a unified view of key information. It is built on some assumptions:you must model the data before use, the data must be cleaned first,the data must be in tables. The DW is built for repeatable datause, not for one-time uses of informationor unknown-value datasets. © 2018 Teradata
  31. 31. 32 Sales SQL BIDiscovery Customers Inventory Products Financials tabular ? Analysts Anyone tables After a while, user self-service was required © 2018 Teradata Eventually we reached maturitywith the DW, driving an increasingrate of new data requests by departments and individuals. A backlog of smaller data requests built up around .it We got self-service tools that actually worked.
  32. 32. 33 Sales SQL BIDiscovery Customers Inventory Products Financials tabular ? Analysts Anyone tables Why Isn’t the Data in the DW? Rush Jobs and Unknown Data © 2018 Teradata The real problem is time: business analyses are often needed in a week or less. If the data is not in the DW then the analyst has to wait – the industry average for making data available in a DW is 10 weeks. They need quick access to new data rather than reusable, cleaned, common data.This means they need another way – and self-service tools offer an answer other than “wait.”
  33. 33. 34 Sales SQL BIDiscovery Customers Inventory Products Financials tabular ? Analysts Anyone tables But data is rarely used in isolation, DW data is often needed © 2018 Teradata New data usually needs to be linked to existing data. Users don’t justneed accessto data, they need a place to work with and store datatoo. Ignore this requirement and you have runaway copies, extracts, andfiles tied to specific tools,with have no visibility into what is happening.
  34. 34. 35 Sales SQL BIDiscovery Customers Inventory Products Financials ? R Analysts Anyone tables Array / matrix Warehouse events There are limits to what you can do with queries © 2018 Teradata tabular Some questionsare not answerable with queries. Deeper analysis is required to answer the question. The truth is, and always has been, that tables and a database are not the only technology in the analytic ecosystem.
  35. 35. 36 Sales SQL BIDiscovery Data science Customers Inventory Products Financials clicks ? R Python Analysts Spark Anyone Data scientists tables Array / matrix Time series Warehouse events Data Science accelerates, more data, more engines © 2018 Teradata tabular As BI matured,informationneeds grew more complex. New analytics, new data,higher volumes drove creationof new techniquesand new processingengines. New techniques,new engines, means new structuringand positioning of data is required.
  36. 36. 37 Sales SQL BIDiscovery Data science Customers Inventory Products Financials clicks ? R Python Tensor Flow Analysts Spark Anyone Data scientists tables Array / matrix Time series Graph Warehouse events More unique data, more approaches, technologies arrive monthly © 2018 Teradata tabular ? e.g. Emails, images, more events
  37. 37. 38 Sales SQL BIDiscovery Data science Customers Inventory Products Financials clicks ? R Python Tensor Flow Analysts Spark Anyone Data scientists tables Array / matrix Time series Warehouse events The end result of years of addition is accidental architecture © 2018 Teradata tabular Graph ? e.g. Emails, images, more events
  38. 38. 39 Sales SQL BIDiscovery Data science Customers Inventory Products Financials clicks ? R Python Tensor Flow Analysts Spark Anyone Data scientists tables Array / matrix Time series Warehouse events Today’s environment has (and still needs) different engines Engines © 2018 Teradata tabular Graph ? e.g. Emails, images, more events
  39. 39. 40 Sales SQL BIDiscovery Data science Customers Inventory Products Financials clicks ? R Python Tensor Flow Analysts Spark Anyone Data scientists tables Array / matrix Time series Warehouse events Some engines require specific structures and positioning of data Data storedto engine needs Engines © 2018 Teradata tabular Graph ? e.g. Emails, images, more events
  40. 40. 41 Sales SQL BIDiscovery Data science Customers Inventory Products Financials clicks ? R Python Tensor Flow Analysts Spark Anyone Data scientists S3HDFSRDBMSRDBMS Raw data tables Array / matrix Time series Warehouse events Distributed data is the norm, stored in multiple types of repositories © 2018 Teradata tabular Graph ? e.g. Emails, images, more events
  41. 41. 42 Sales SQL BIDiscovery Data science Customers Inventory Products Financials clicks ? R Python Tensor Flow Analysts Spark Anyone Data scientists tables Array / matrix Time series Warehouse events How much of the data science problem is tools vs engines? Data storedto engine needs Engines © 2018 Teradata tabular Graph ? e.g. Emails, images, more events The dirty secret of datascience: more than 80% of the enterprise market does it on laptops.Mostdoes not need specialized infrastructure.
  42. 42. 43 Sales SQL BIDiscovery Data science ? Customers Inventory Products Financials clicks e.g. Emails, images, more events ? R Python Tensor Flow Analysts Spark Anyone Data scientists tables Array / matrix Time series Warehouse events Managing this complexity is a growing challenge © 2018 Teradata tabular Graph The challenge with use of data science shifts to operations:how to deploy and manage a production environment with so many technologies and copies of data?
  43. 43. 44 Sales SQL BIDiscovery Data science Customers Inventory Products Financials clicks ? R Python Tensor Flow Analysts Spark Anyone Data scientists tables Array / matrix Time series Warehouse events Accidental architecture does not lead to agility © 2018 Teradata tabular Graph ? e.g. Emails, images, more events Accidentalarchitectureis unsustainablein the long run.
  44. 44. 45 What’s missing: loosely integrated data stored for provisioning. Sales SQL BIDiscovery Data science Customers Inventory Products Financials clicks ? R Python Tensor Flow Analysts Spark Anyone Data scientists tables Array / matrix Time series Warehouse events © 2018 Teradata tabular Graph ? e.g. Emails, images, more events The arrows in architecturediagrams hide the hard work and make things seemsimplerthan they are. The arrows are wherethe integration costs and labor are buried.
  45. 45. 46 Sales SQL BIDiscovery Data science Customers Inventory Products Financials clicks ? R Python Tensor Flow Analysts Spark Anyone Curated data stored for integration and provisioning tables Array / matrix Time series Warehouse events Data architecture is needed to fill the gap© 2018 Teradata You need a layerto separatethe messof raw data below from the distributed,context-specificusesof data above. This minimizesthe cost of change, providesuser control of data via separationof data managementfrom data use. tabular Graph ? e.g. Emails, images, more events Data scientists
  46. 46. 47 Sales Renovating the accidental architecture requires a transition path to separate data infrastructure from the uses of data © 2018 Teradata SQL BIDiscovery Data science Customers Inventory Products Financials clicks ? R Python Tensor Flow Analysts Spark Anyone Data storedto engine needs Curateddata stored for integrationand provisioning S3HDFSRDBMSRDBMS Raw data tables Array / matrix Time series Warehouse events Engines tabular Graph ? e.g. Emails, images, more events Data scientists
  47. 47. CopyrightThird Nature,Inc. We continue to act as if there is one single system “So much complexity in software comes from trying to make one thing do two things.” — Ryan Singer
  48. 48. CopyrightThird Nature,Inc. Break down the monolithic architecture – not one building, multiple buildings in a block
  49. 49. CopyrightThird Nature,Inc. Buildingsabove: flexibility,repurposing, quicker change above, funded separately Applications Utilitiesbelow:stability, reuse, slow predictable change below, funded centrally Infrastructure We built things as a single entity that combined the building and the infrastructure in one complex monolith Complexity requires a shift: separate the application from the infrastructure – we are focused on the block, no the buildings
  50. 50. © Third Nature Inc. Speed and agility You want speed. Speed comes from burying work in the infrastructure, which trades flexibility for repeatability. Therefore you must be careful when you draw the line between what is above ground and below, the uses of data and the infrastructure.
  51. 51. CopyrightThird Nature,Inc. "Always design a thing by consideringit in its next larger context - a chair in a room, a room in a house,a house in an environment,an environmentin a city plan." – Eliel Saarinen
  52. 52. CopyrightThird Nature,Inc. Think of IT as a city Where does a building fit?
  53. 53. CopyrightThird Nature,Inc. Streaming analytics Data science Data collection Self-service / Discovery BI (QRD) Self-service Data
  54. 54. 56 The bigger picture of what we create The ecosystem is like the city and all of its extended neighborhoods. The different applications or types of projects are like the neighborhoods with specific types of buildings. Each individual building (application) has its own unique blueprint. Underlying all the applications is shared data infrastructure, like city services. The shared infrastructure is the foundation of the analytics architecture for a company. City Neighborhoods Buildings and shared infra
  55. 55. CopyrightThird Nature,Inc. “Begin with the end in mind” The starting point can’t be with technology. That’s like starting with bricks when designing a house. You may get lucky but… The goals and specific uses are the place to start ▪ Use dictates need ▪ Need dictates capabilities ▪ Capabilities are solved with technology This is how you avoid spending $2M on a Hadoop andspark cluster in order to serve data to analysts whose primary requirements aremet with laptops.
  56. 56. CopyrightThird Nature,Inc. Persisting data is not the end of the line. If you stop here you win the battle and lose the war
  57. 57. CopyrightThird Nature,Inc. We don’t have a data science problem, just like we didn’t have a BI problem The origin of analytics as “business intelligence” was stated well in 1958: …the ability to apprehend the interrelationships of presented facts in such a way as to guide action towards a desired goal. ~ H. P. Luhn “A Business Intelligence System”, http://altaplana.com/ibmrd0204H.pdf ” “ Our goal is analytics as a capability, not a technology
  58. 58. © Third Nature Inc. B I
  59. 59. © Third Nature Inc. The old problem was access, the new problem is analysis
  60. 60. CopyrightThird Nature,Inc. Three constituencies Stakeholder Analyst Builder aka the recipient aka the data scientist aka the engineer
  61. 61. CopyrightThird Nature,Inc. Starting points for analytics strategy Many organizations choose to start with the analysts. Create a data science team. Turn them loose to find a problem. Many more start with builders: technology solutions looking for problems, e.g. 65% of the IT driven Hadoop and Spark projects over the last five years. The right place to start? Stakeholders.The goal to achieve,the problem to solve.
  62. 62. CopyrightThird Nature,Inc. WHAT IS THE CONTEXT OF USE?
  63. 63. There is an extensive list of requirements to support Primary requirementsneededby constituents S D E Data catalogand ability to search it for datasets X X Self-service accessto curateddata X Self-service accessto uncurated(unknown, new) data X X Temporary storagefor working with data X Data integration,cleaning,transformation, preparation toolsandenvironment X X Persistentstoragefor source dataused by productionmodels X X Persistentstoragefor training,testing,production data used by models X X Storageand management of models X X Deployment, monitoring, decommissioning models X Lineage, traceabilityof changes made for data used by models X X Lineage, traceabilityfor model changes X X X Managingbaseline data / metrics for comparing model performance X X X Managingongoing data / metrics for trackingongoing model performance X X X S = stakeholder, user, D = data scientist, analyst, E = engineer, developer
  64. 64. © Third Nature Inc. How did we get to this state with BI & analytics? There’s a difference between having no past and actively rejecting it.
  65. 65. 67 100 1,000 10,000 1 2 3 4 5 6 7 8 9 10 Customer Data Plateaus Add users, accumulate history, add attributes, add dimensions Entirely new data set enabled by new technology at new price point – value has to exceed cost (ROI). Earliest adopters have vision ahead of ROI
  66. 66. 68 1 10 100 1,000 10,000 100,000 1,000,000 10,000,000 1 2 3 4 5 6 7 8 9 10 11 12 13 User Adoption of New Data Customer Data Users - Data Set 1 Users - Data Set 2 Users of Integrated Data User adoption of new data sets starts over. Very small number of experts growing to wider audience, sophisticated users moving to business analysts, then business users, B2B customers and even to consumers Greater value is derived when data sets are linked – see bigger picture (eg who buys what, when). Comes after initial extractionof easy value from standalonedata
  67. 67. 69 • <Store, Item, Week> • <Store, Item, Day> – Simple aggregations • Market Basket – Affinity – Link to person, demographics, HR • Inventoryby SKU by store – Temporal, time series, forecasting – Link to product, marketing, market basket Retail Plateaus 2B records total for 9 quarters 2B records per day, keep 9 quarters
  68. 68. 70 • Web Logs and traffic – Behavioral patterns– eg path linked to person,offers,other channels – Operationsof the web site • Supply chain sensors – sampled at major event – Activity Based Costing – Link to customer, product,HR, planning • Social Media – Text analysis, Filtering, languages – Link to customer, sales,other channelinteractions • Supply chain sensors – sampled at minutes or seconds – Telematics – Real time, Eventdetection,trending,static and dynamicrules – Link to HR, thresholds, forecasts,routing,planning Retail Plateaus 30B records per day
  69. 69. 71 1 10 100 1,000 10,000 100,000 1,000,000 10,000,000 100,000,000 1,000,000,000 10,000,000,000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 Data Size Plateaus over Time Lather, Rinse, Repeat. Trend line is roughly Moore’s Law. Delay or skip a generation if new data set is two orders of magnitude instead of one.
  70. 70. 72 1 10 100 1,000 10,000 100,000 1,000,000 10,000,000 100,000,000 1,000,000,000 10,000,000,000 100,000,000,000 1,000,000,000,000 10,000,000,000,000 100,000,000,000,000 1,000,000,000,000,000 10,000,000,000,000,000 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 Customer Data Size vs. 2x per 12mo Customer Data Moore's Law - 2x/18mo 2x/12 mo But all studies show demand/creation of data increasing significantly faster than Moore’s law If we started only 1 OOM behind, we are now 5 OOM away from user demand for data Data Exhaust
  71. 71. 73 FINANCE Revenue Expenses Customers CUSTOMER CARE Customer Products Orders Case History SALES Orders Customers Products MARKETING Customers Orders Campaign History OPERATIONS Inventory Returns Manufacturing Supply Chain How many batteries are in inventory by plant? What is the trend of warranty costs? How many people made a warranty claim last week? How many sales have been made quarter to date? Which customers should get a communica tion on extended warranties? 54 32 29 49 66
  72. 72. 74 2954 32 49 41
  73. 73. 75 Given the rise in warranty costs, isolate the problem to be a plant, then to a battery lot. Communicate with affected customers, who have not made a warranty claim on batteries, through Marketing and Customer Service channels to recall cars with affected batteries. 2855 Inventory Returns Manufacturing Supply Chain Customer Service Orders Revenue Expenses Case History Customers Products Pipeline Customers Campaign History FINANCE SALESMARKETING OPERATIONS CUSTOMER EXPERIENCE 2855
  74. 74. 76 Manufacturing: Data Overlap Analysis New Business Improvement Opportunities through Data Leverage Sales Force Profitability Analysis 100% 80% 66% 24% 41% 66% 0% 24% 0% 24% Transportation Planning 11% 100% 87% 28% 64% 56% 22% 34% 13% 45% Production Planning 6% 57% 100% 19% 83% 35% 17% 28% 9% 40% Vendor Managed Inventory 12% 100% 100% 100% 78% 100% 61% 100% 39% 100% Global Pricing Rationalization 4% 50% 100% 17% 100% 27% 15% 29% 11% 41% Fulfillment (Perfect Order) 16% 94% 90% 48% 58% 100% 35% 53% 19% 56% Manufacturing Quality Optimization 0% 76% 88% 60% 66% 71% 100% 79% 43% 73% Preventative Maintenance Analysis 7% 75% 93% 61% 80% 68% 50% 100% 27% 82% Warranty Claims Analysis 0% 94% 100% 83% 100% 83% 94% 93% 100% 98% Quality Life Cycle Improvement 4% 57% 75% 35% 64% 41% 26% 47% 16% 100% SalesForce ProfitabilityAnalysis ProductionPlanning VendorManaged Inventory GlobalPricing Rationalization TransportationPlanning Fulfillment (PerfectOrder) ManufacturingQuality Optimization Preventative Maintenance Analysis WarrantyClaims Analysis QualityLifeCycle Improvement If Then
  75. 75. 78 PRODUCT SENSOR SOCIAL MEDIA CUSTOMER CARE AUDIO RECORDINGS DIGITAL ADVERTISING CLICKSTREAM 65 41 32 19 28 How many visitors did we have to our hybrid cars microsite yesterday? What are the temperature readings for batteries by Manufactur er? What is the sentiment towards line of hybrid vehicles? Which customers likely expressed anger with customer care? Which ad creative generated the most clicks?
  76. 76. 79 2855 SENSOR DIGITAL ADVERTISING CLICKSTREAM INTERACTIONS RATINGS & REVIEWS CUSTOMER PORTAL INTERACTIONS EXTERNAL INTERACTIONS SOCIAL MEDIA IVR Routing RFID ELECTRONIC COMMERCE FINANCE SALESMARKETING Inventory Returns Manufacturing Supply Chain Customer Service Orders Revenue Expenses Case History Customers Products PipelineCustomers Campaign History OPERATIONS CUSTOMER CARE – AUDIO RECORDINGS Maps Telemetry SERVER LOGS CUSTOMER EXPERIENCE Enterprise Data
  77. 77. 80 2855 Schema on Write Evolving Schema Schema on Read Enterprise Data Evolving Data New Data Sources Schema?
  78. 78. 81 2855 High, Well Known Quality Directional Quality Unknown/Low quality Well Defined Data Model Curated JSON, XML, DB Extracted Attributes Curation Required
  79. 79. 82 2855 Out of Deviation Sensor Readings Fraud Events CUSTOMER EXPERIENCE SALES MARKETING OPERATIONS Abandoned Carts Online Price Quotes Social Media Influencers FINANCE 11234 Minimum Viable Curation Minimum Viable Data Quality
  80. 80. 83 2855 Out of Deviation Sensor Readings Fraud Events CUSTOMER EXPERIENCE SALES MARKETING OPERATIONS Abandoned Carts Online Price Quotes SocialMedia Influencers FINANCE 11234 Sensor data formatting Unit Normalization Vehicle/version normalization Make/model/year selection Data Comprehension, Pipelines
  81. 81. 84 2855 Enterprise Wide Use 10s to 100s of Users 10s Users Analysts / engineers Business Analysts Action list, Report, Dashboard Users Share across Enterprise Share in department Share over cube wall User Base and Sharing
  82. 82. 85 Evolving Consumption 2855 Enterprise production analytics Departmental analytics Exploratory analytics Repeatable, auditable results Ad-Hoc Query, Self Service Data Labs
  83. 83. 86 Evolving Consumption Requirements 2855 Production tools, curated data, integrated across business areas Wide variety of tools and data forms Targeted data access, many applications, response time SLAs High CPU, moderate to low IO Bulk scans, large computation, transformation, data access, specific integration Moderate CPU, High IO, Resource management
  84. 84. 87 MARKETING FINANCE SALES OPERATIONS CUSTOMER EXPERIENCE Given the rise in warranty costs, isolate the problem to be a plant and the specific lot. Exclude 2/3rd of the batteries fromthe lot that are fine. Communicate with affected customers, who havenot made a warranty claim, through Marketing and Customer Service channels to recall cars with affected batteries. 2855 MANUFACTURING CAMPAIGN HISTORY COSTS PRODUCTS CUSTOMERS CASE HISTORY SENSOR Access Wide Variety of Data to Answer a Question
  85. 85. Copyright Third Nature, Inc. DATA ARCHITECTURE We’re so focused on the light switch that we’re not talking about the light
  86. 86. CopyrightThird Nature,Inc. History: This is how BI was done through the 80s First there were files and reporting programs. Application files feed through a data processing pipeline to generate an output file. The file is used by a report formatter for print/screen. Files are largely single-purpose use. Every report is a program written by a developer. Data pipeline code
  87. 87. CopyrightThird Nature,Inc. History: This is how BI ended the 80s The inevitable situation was... Data pipeline code
  88. 88. CopyrightThird Nature,Inc. History: This is how we started the 90s Collect data in a database. Queries replaced a LOT of application code because much was just joins. We learned about “dead code” SQL SQL SQL SQL SQL
  89. 89. CopyrightThird Nature,Inc. Pragmatism and Data Lessons learned during the ad-hoc SQL era of the DW market: When the technology is awkward for the users, the users will stop trying to use it. Even “simple” schemas weren’t enough for anyone other than analysts and their Brio… Led to the evolution of metadata-driven SQL- generating BI tools, ETL tools.
  90. 90. CopyrightThird Nature,Inc. BI evolved to hiding query generation for end users With more regular schema models, in particular dimensional models that didn’t contain cyclic join paths, it was possible to automate SQL generation via semantic mapping layers. We developed data pipeline building tools (ETL). Query via business terms made BI usable by non-technical people. ETL SQL Life got much easier…for a while
  91. 91. CopyrightThird Nature,Inc. Today’s model: Lake + data engineers, looks familiar… The Lake with data pipelines to files or Hive tables is exactly the same pattern as the COBOL batch.. Dataflow code We already know that people don’t scale…
  92. 92. Copyright Third Nature, Inc. The architecture from 1988 we SHOULD HAVE BEEN USING The general concept of a separate architecture for BI has been around longer, but this paper by Devlin and Murphy is the first formal data warehouse architecture and definition published. 95 “An architecture for a business and information system”, B. A. Devlin, P. T. Murphy, IBM Systems Journal, Vol.27, No. 1, (1988) Slide 95Copyright Third Nature, Inc.
  93. 93. But 30 years ago we did not expect so many different models of deployment,execution and use. Needs change DeployETL Data Data Storage Alerts / Reports/ Decisioning Deploy f Data Streams Intelligent Filter / Transform Model Execution Analytics for eyeballs and analytics for machines are different
  94. 94. Copyright Third Nature, Inc. Decouple the Architecture The core of a data warehouse isn’t the database, it’s the data architecture that the database and tools implement. We need a new data architecture that is not limiting: ▪ Deals with change more easily and at scale ▪ Does not enforce requirements and models up front ▪ Does not limit the format or structure of data ▪ Assumes the range of data latencies in and out, from streaming to one-time bulk ▪ Allows both reading and writing of data ▪ Makes data linkable, and provide governance where required ▪ Does not give up the gains of the last 25 years
  95. 95. Copyright Third Nature, Inc. The goal is to decouple: solve the application and infrastructure problems separately, independently Data Acquisition Collect & Store Incremental Batch One-time copy Real time Platform Services Data storage Data arrives in many latencies, from real-time to one-time. Acquisition can’t be limited by the management or consumption layers. Data acquisition should not be directly tied to the needs of consumption. It must operate independently of data use.
  96. 96. Copyright Third Nature, Inc. Data hoarding is not a data management strategy
  97. 97. Copyright Third Nature, Inc. The goal is to decouple: solve the application and infrastructure problems separately, independently Platform Services Data Management Process & Integrate Data storage Data management should not be subject to the constraints of a single use Data management has historically been blended with both data acquisition and structuring data for client tools. It should be an independent function.
  98. 98. Copyright Third Nature, Inc. The goal is to decouple: solve the application and infrastructure problems separately, independently Platform Services Data Access Deliver & Use Data storage This separates uses of data from each other, allowingeach type of use to structure the data specific to its own requirements. Data access is already somewhat separate today. Make the separation of different access methods a formal part of the architecture. Don’t force one model.
  99. 99. Copyright Third Nature, Inc. The full analytic environment subsumes all the functions of a data lake and a data warehouse, and extends them Data Acquisition Collect & Store Incremental Batch One-time copy Real time Platform Services Data Management Process & Integrate Data Access Deliver & Use Data storage The platform has to do more than serve queries; it has to be read-write.
  100. 100. Copyright Third Nature, Inc. The data architecture must align with system components because each of them addresses different data needs Incremental Collect Batch One-time copy Real time Manage & Integrate Data Acquisition Collect & Store Data Management Process & Integrate Data Access Deliver & Use Separating concerns is part of the mechanism for change isolation
  101. 101. Copyright Third Nature, Inc. Divide the data architecture to address three different goals Collection Creation, collection, storage of new data Distribution Organization and provisioning of data to multiple points of use Consumption Direct support of data use Separation of concerns, coordination of process Collect Curate Consume
  102. 102. CopyrightThird Nature,Inc. The design focus is different in each area Ingredients Goal: available User needs a recipe in order to make use of the data. Pre-mixed Goal: discoverable and integrateable User needs a menu to choose from the data available Meals Goal: usable User needs utensils but is given a finished meal
  103. 103. CopyrightThird Nature,Inc. Food supply chain: an analogy for analytic data Multiple contexts of use, differing quality levels You need to keep the original because just like baking, you can’t unmake dough once it’s mixed.
  104. 104. CopyrightThird Nature,Inc. Data has to be moved, standardized, tracked There is a lot of data policy and governance to think about Collect Manage & Integrate Data Acquisition Collect & Store Data Management Process & Integrate Data Access Deliver & Use CRM User reg Orders Leads Cust Cust Cust Cust Master DW Mktg Campaign analysis Rec engine
  105. 105. CopyrightThird Nature,Inc. Data curation The problem with so many sources, types, formats and latencies of data is that it is now impossible to create one model for all of it in advance. Data modeling is about the inside of a dataset. Curation is about the set. Data curation, rather than data modeling, is becoming the more important data management practice.
  106. 106. Copyright Third Nature, Inc. The missing ingredient from most data projects Specifically, metadata kept separate from the data.
  107. 107. CopyrightThird Nature,Inc. The data is in zones of management, not isolating layers Raw data in an immutable storage area Standardized or enhanced data Common or usage- specific data Transient data Relax control to enable self-service while avoiding a mess. Do not constrain access to one zone or to a single tool. Focus on visibility of data use, not control of data.
  108. 108. CopyrightThird Nature,Inc. This data architecture resolves rate of change problems Raw data in an immutable storage area Standardized or enhanced data Common or usage- specific data Transient data New data of unknown value, simple requests for new data can land here first, with little work by IT. More effort applied to management, slower. Optimized for specific uses / workloads. Generally the slowest change. Not fast vs slow: fast vs right Not flexibility vs control: flexibility vs repeatability Agile for structure change vs agile for questions / use
  109. 109. Copyright Third Nature, Inc. The concept of a zone is not a physical system. It’s data architecture Apps DW The biggest decision is to separate all data collection from the data integration from consumption. Physical system/technology overlays are separate, depend on the specific use cases and needs of the organization. Dist FS RDBMSs Decompress & process device logs RDBMS RDBMS multiple Discovery platform
  110. 110. Copyright Third Nature, Inc. Data has to be moved, standardized, tracked There is a lot of data policy and governance to think about Collect Manage & Integrate Data Acquisition Collect & Store Data Management Process & Integrate Data Access Deliver & Use CRM User reg Orders Leads Cust Cust Cust Cust Master DW Mktg Campaign analysis Rec engine
  111. 111. Copyright Third Nature, Inc. HOW DO YOU EVOLVE WHAT YOU HAVE NOW?
  112. 112. Copyright Third Nature, Inc. Agile methods without agile architectures fail
  113. 113. 119 Process and governance are part of architecture Data policy and organizational model go together. Policy choices are like land use policies. Want free access to water? Ribbon farms in a capitalist world, domains in a feudal manor.
  114. 114. Copyright Third Nature, Inc. Reinforcing relationships resist change, despite radical technology and practice shifts Note how only one third is tech Architectural Regime MethodologyTechnology Organization Organization defines wherethe work is done and the roles. Technology defines what work can be done in a given area. Methodology defines how work is done and what that work is. Slide 120
  115. 115. Copyright Third Nature, Inc. What about the technology? Do I need an <X>?
  116. 116. 122 “Technology first” thinking creates implementation problems If you start with technology,it will constrain the problems you can solve
  117. 117. 123 Blended Architectures Are a Requirement, Not an Option Data Warehouse + Data Lake On Premise + Cloud RDBMS + S3 + HDFS Commercial + Open Source You can’t just buy one thing platform from one vendor. We aren’t building a death star. Each of the zones is likely to have products specific to that zone’s usage. The uses differ, the people using them differ, shouldn’t the tools should differ too?
  118. 118. Copyright Third Nature, Inc. In other words… Software is like puppies. Getting a puppy is easy, raising one is hard. “The short term benefits of using a new [type of] database exceed the long term cost of operating it.” Dan Mckinley
  119. 119. Copyright Third Nature, Inc. Choose Boring Technology You only get so many chances to make big changes. Don’t waste them. You can spend time focusing on the goal and worry less about the known tech, or you can spend time learning the new tech but less time focusing on the goal. The important thing is not the choice of tech, it’s knowing when the time is right to make a new tech choice.
  120. 120. CopyrightThird Nature,Inc. TANSTAAFL When replacing the old with the new (or ignoring the new over the old) you always make tradeoffs, and usually you won’t see them for a long time. Technologiesare not perfect replacementsfor one another. Often not better, only different.
  121. 121. © Third Nature Inc. Manage your data (or it will manage you) Data management is where developers are weakest. Modern engineering practices are where data management is weakest. You need to bridge these groups and practices in the organization if you want to do meaningful work with data. Remember Conway’s Law when you build.
  122. 122. © Third Nature Inc. “Begin with the end in mind” The starting point can’t be with technology. That’s like starting with bricks when designing a house. You may get lucky but… The goals and specific uses are the place to start ▪ Use dictates need ▪ Need dictates capabilities ▪ Capabilities are solved with technology This is how you avoid spending $20M on a new cluster in order to serve data to analysts whose primary requirements aremet with laptops.
  123. 123. © Third Nature Inc. A good design is not the one that correctly predicts the future, it’s one that makes adapting to the future affordable. — Venkat Subramaniam
  124. 124. Mark Madsen is the global head of architectureat Think Big Analytics, Prior to that he was president of Third Nature, a research and consulting firm focused on analytics, data integration and data management.Mark is an award-winning author, architectand CTO whose work has been featured in numerous industry publications.Over the past ten years Mark received awards for his work from the American Productivity& Quality Center, TDWI, and the Smithsonian Institute.He is an international speaker, chairs several conferences, and is on the O’Reilly Strataprogram committee. For more information or to contactMark, follow @markmadsen on Twitter. About the Presenter
  125. 125. Todd Walter Chief Technologist - Teradata • Chief Technologist for Teradata • A pragmatic visionary, Walter helps business leaders, analysts and technologists better understand all of the astonishing possibilities of big data and analytics • Works with organizations of all sizes and levels of experience at the leading edge of adopting big data, data warehouse and analytics technologies • With Teradata for more than 30 years and served for more than 10 years as CTO of Teradata Labs, contributing significantly to Teradata’s unique design features and functionality • Holds more than a dozen Teradata patents and is a Teradata Fellow in recognition of his long record of technical innovation and contribution to the company

×