Эволюция Big Data и Information Management. Reference Architecture.
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Эволюция Big Data и Information Management. Reference Architecture.

on

  • 284 views

Референсная архитектура Oracle Big Data и эволюционное развитие Information Management

Референсная архитектура Oracle Big Data и эволюционное развитие Information Management

Statistics

Views

Total Views
284
Views on SlideShare
284
Embed Views
0

Actions

Likes
0
Downloads
13
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • 2010 Tom Davenport in HBR
  • The closer you are to monetising data the more organised the data should beHadoop minimises the penalty for not being organised i.e. not understanding your dataData ManagementData ProfilingDescriptive statisticsGraphical Analysis
  • Many of our customer have already developed Hadoop based solutions in a pre-production setting by downloading from internet and running it on a virtualised Linux server, often on a laptop.
  • If the audience is very pro Big Data lay on the first explanation thick – talk about TRADITIONAL systems and how ETL can be very slow to put into place because of the need to agree the process with the business, build a common understanding of data and how it must be integrated etc.Schema on read is the opposite – it is very fast to value BUT the cost of ETL is carried by each system that accesses the data. Data quality is a function of the program that accesses the data.Time also has a bearing here. Use the example of the recent changes to Hadoop and the deprecation of large numbers of JAVA classes
  • Ken was also at Zinga also ex SiebelHis point about the way they have included Analysts in their product teams is a key one regards Analytics 3.0Also in Zinga more than 50% of the data was held in flex fields – it’s a shame nobody told them how to model this kind of system!The closer you are to monetising data the more organised the data should beHadoop minimises the penalty for not being organised i.e. not understanding your data

Эволюция Big Data и Information Management. Reference Architecture. Presentation Transcript

  • 1. Information Management Reference Architecture 3rd Evolution EMEA Enterprise Architecture
  • 2. Contents  Introduction  Conceptual view  Design Patterns  IM Logical view and component outline  Discovery Lab  R/T Event Engine logical view  Mapping to previous Reference Architecture release
  • 3. Introduction
  • 4. Introduction  This PPT documents the main architectural components of Oracle‟s Information Management Reference Architecture.  The architecture is intended to be practical and pragmatic, with many of the ideas and experiences that inform the approach dating back almost 20 years in Oracle and are based on real world customer experiences.  We define Information Management to mean the following. Please note that this definition embraces all types and forms of data as well as embracing aspects such as Information Discovery and Governance: “Information Management is the means by which an organisation maximises the efficiency with which it plans, collects, organises, uses, controls, stores, disseminates, and disposes of its Information, and through which it ensures that the value of that information is identified and exploited to the maximum extent possible” 3rd Evolution of Oracle‟s Information Management Reference Architecture
  • 5. Oracle’s Information Management Reference Architecture (3rd Edition)  More relevant to Big Data oriented audience  Better representation of pragmatic customer projects  Includes Raw data store as part of the architecture  Show effort / cost to store and interpret data that separates schema-on-read and schema-on-write approaches  Aligned to Analytics 3.0  Consistent with Oracle‟s engineering efforts What‟s changed?
  • 6. Aligning analytical requirements and IM architecture Enabling Analytics 3.0 with a pragmatic architecture Analytics 2.0 Analytics 3.0 Analytics 1.0 • Reporting with limited use of descriptive analytics • Limited range of tabular data • Batch oriented analysis • Analysis bolted onto limited set of business processes • Firms “Competing on Analytics” • Extended analytics to larger and less structured datasets • Emergence of Big Data into the commercial world • Recognition of Data Science role in commercial orgs. • Platform for monetisation • Deeper analysis & more data • Faster test-do-learn iterations • Different types of data & wider business process coverage • Analysts focus on discovery and driving business value • “Agile” with operational elements incorporated into design patterns Adapted from Tom Davenport material
  • 7. Oracle’s Information Management Reference Architecture (3rd Edition) “All those layers and definitions in your Reference Architecture, I just don’t get it… and it looks complicated !” Hadoop developer knee deep in complex Map:Reduce code What‟s changed? Business Trends Technology Trends Data Trends
  • 8. Conceptual View
  • 9. Actionable Events Event Engine Data Reservoir Data Factory Enterprise Information Store Reporting Discovery Lab Actionable Information Actionable Insights Input Events Execution Innovation Discovery Output Events & Data Conceptual View Structured Enterprise Data Other Data
  • 10. Component Outline Data Engine Respond to R/T events in appropriate and/or optimised fashion Data Reservoir Raw data Reservoir – typically event data at lowest grain Data Factory Managed ETL onto, within and between platforms Enterprise Data Data stores for Information Management Reporting BI tools and infrastructure components Discovery Lab Platform, data and tools to support discovery process Execution – things you do every day Innovation – innovation to drive tomorrows business Line of Governance! Discovery Output – Possible outputs include new knowledge, mining models / parameters, scored data…
  • 11. Design Patterns
  • 12. Design Pattern: Discovery Lab  Specific focus on identifying commercial value for exploitation  Small group of highly skilled individuals (aka Data Scientists)  Iterative development approach – data oriented NOT development oriented  Wide range of tools and techniques applied  Data provisioned through Data Factory or own ETL  Typically separate infrastructure but could also be unified Reservoir if resource managed effectively
  • 13. Design Pattern : Information Platform  Build the next generation Information Management platform  Either Business Strategy driven or IT cost / capability driven initiative  Initial project may be specifically linked to lower data grain or retention BUT it is the platform as a whole that forms the solution required  Platform for consolidating other IM assets onto  Key issues related to differences in procurement, development process, governance and skills differences  Discovery Lab may be implemented as a pragmatic initial POV.
  • 14. Design Pattern : Data Application  Big Data technologies applied to a specific business problem e.g. Genome sequence analysis using BLAST or log data from pharmaceutical production plant and machinery required for traceabiliy  Limited or no integration to broader Information Management estate  Specific solution so Non-functional requirements have less impact on solution quality or long term costs  Platform costs and scalability are important considerations
  • 15. Design Pattern: Information Solution  Specific solution based on Big Data technologies requiring broader integration to the wider Information Management estate e.g. ETL pre-processor for the DW or affordably store a lower level of grain  Non-functional requirements more critical in this solution  Scalable integration to IM estate an important factor for success  Analysis may take place in Reservoir or Reservoir only used as an aggregator
  • 16. Design Pattern: Real-Time Events  May take place at multiple locations between place of data origination and the Data Centre – requiring careful design and implementation  May include Next-Best-Activity, declarative rules and Data Mining technologies to optimise decisions. i.e. optimise across declarative, data mining, customer preference & business-defined rules  May include considerations for personal preferences and privacy (e.g. opt-out) for customer related events  Common component seen across many industries & markets e.g. connected vehicle Real-Time optimisation of events
  • 17. Design Pattern against component usage map Design pattern Discovery Lab Information Platform Data Application Information Solution R/T Events Outline Data science lab Assess the value of the data Next Generation information platform to align IM capability with business strategy Addressing a specific data problem in Hadoop with no broader integration required. Addressing a specific data problem but requires broader enterprise wide integrations. e.g. ETL pre-processing, Event Store at lower grain than existing DW Execution platform to respond to R/T events Examples Gov. Healthcare Mobile operator Spanish Bank (Business led) UK Gov. Dept. (Tech. led) Pharma Genome project Pharma production archive Investment Bank – trade risk Mobile Operator – ETL processing Mobile operator – location based offers Data Engine Possible Yes Data Reservoir Yes Yes Yes Data Factory Yes Yes Yes Enterprise Data Yes Reporting Yes Discovery Lab Yes Implied Alternative approach to Reservoir + Factory above
  • 18. IM Logical View and Components
  • 19. Information Management – Logical View Data Sources Data Ingestion Methods and process to load data into our managed data store and manage data quality • Contemporary Information Management solutions must be able to ingest any type of data from any source in any format and mechanism and at any frequency. e.g. Flat file loads, streaming… • The data may be highly unstructured, mono-structured or highly poly-structured. • Data will vary in volume and in Data Quality. • Operational isolation should be considered to ensure operational applications will continue in the event of the loss of the Information Management system Data Engines & Poly-structured sources Content Docs Web & Social Media SMS Structured Data Sources • Operational Data • COTS Data • Master & Ref. Data • Streaming & BAM
  • 20. Information Management – Logical View Information Ingestion Data Ingestion Information Interpretation Methods and process to load data and manage Data Quality Methods and process needed to access information Managed Data Load All data under management Query • Data structure and processing required to load data into managed data stores • Shape represents the work done on the data to load data and/or process between layers • Layer may include file mechanism where required to facilitate loading (e.g. Fuse fs or ZFS for operational isolation and file concat) • Normal rules of micro-batch, taking all the data and KISS principles recommended • DQ and loading stats presented through BI dashboards as a non-judgemental mechanism to improve DQ. • Data may be landed in the Ingestion layer to facilitate loading but not typically stored for any length of time. e.g. Raw data loaded from web logs but sessionised data then loaded to Raw. Another example is data used to manage CDC may be stored in this layer.
  • 21. Information Management – Logical View Data Interpretation Data Ingestion Information Interpretation Methods and process to load data and manage Data Quality Methods and process needed to access information Managed Data Load All data under management Query • Methods and processes required to access information in each of the stores • Shape represents the cost of interpreting the data under management • For schema-on-read the cost may include the AVRO, SerDe or reader class as well as the associated processing code to select, filter and process the data. • For schema-on-write the cost is represented by the complexity of the SQL required to access the data only – more complex typically for 3NF than for a dimensional query.
  • 22. Information Management – Logical View Data Layers – cost, quality and concurrency trade off Managed DataAccess & Performance Layer Foundation Data Layer Raw Data Reservoir Immutable raw data reservoir Raw data at rest is not interpreted Immutable modelled data. Business Process Neutral form. Abstracted from business process changes Past, current and future interpretation of enterprise data. Structured to support agile access & navigation • Increasing enrichment • Increasing data quality • Reducing concurrency costs • Data under management includes 3 key layers – Raw, Foundation and Access and Performance layers. • Data normally loaded into Raw and Foundation layers BUT BI Apps loads data directly into APL and federated warehouses may well also load data at aggregate level from federated operating companies. • Data Factory is responsible for loading and then managing data between layers. • Work is done to elevate the data between layers – typically further enriching and improving data quality. • Work done in processing the data between the layers significantly reduce query costs. i.e. higher levels of concurrency can be sustained for the same processing power. • Increasing formalisation of definition
  • 23. Information Management – Logical View Data Layers – Analytical processing Managed DataAccess & Performance Layer Foundation Data Layer Raw Data Reservoir • Analytical processing capabilities of Hadoop and RDBMS used to elevate data between layers as previously described. • These analytical capabilities can also be leveraged by tools that access the data directly. Typically this would be by a Data Scientist for Discovery Lab operations or BI Tools and Services that are processing data using a model previously defined by the Data Scientist. OLAP Data Mining Statistics OLAP Text Mining Other Analytical Processing Data Mining Text Mining Image Processing • Increasing enrichment • Increasing data quality • Reducing concurrency costs • Increasing formalisation of definition
  • 24. Information Management – Logical View Data Layers – Raw Data Reservoir Managed DataAccess & Performance Layer Foundation Data Layer Raw Data Reservoir Immutable raw data reservoir Raw data at rest is not interpreted Immutable modelled data. Business Process Neutral form. Abstracted from business process changes Past, current and future interpretation of enterprise data. Structured to support agile access & navigation • Immutable data store with data at lowest level of grain. • Typically implemented in Hadoop or NoSQL for cost reasons but not always. • May be: • Queries directly, • Used to derive base level data for Foundation Layer. Data may be represented logically in Foundation or physically as the store is immutable BUT this effects ILM policy. • or used to derive values or aggregates for Access and Performance layer. (e.g. propensity score or total monthly SMS‟s)
  • 25. Information Management – Logical View Data Layers – Foundation Data Layer Managed DataAccess & Performance Layer Foundation Data Layer Raw Data Reservoir Immutable raw data reservoir Raw data at rest is not interpreted Immutable modelled data. Business Process Neutral form. Abstracted from business process changes Past, current and future interpretation of enterprise data. Structured to support agile access & navigation • Immutable integrated and standardised store of enterprise class data. Stuff the business has agreed and organises around. • Data at lowest level of grain of value for Enterprise data. • Stored in business process neutral fashion to avoid data maintenance tasks to keep in step with current business interpretations. • Typically close to 3NF. Special attention to modelling hierarchy, flexible entity attributions, customer / supplier etc. • ONLY implemented in relational technology BUT this could be logical as previously noted in Raw Data Reservoir. • May be queries directly by a select few individuals. Wider access to detail data provided through views in APL, often with VPD implemented to prevent queries to antecedent data. • Data in the Foundation Layer should be retained for as long as possible. • Consideration should be given to retaining data in Raw Data Reservoir rather than archiving.
  • 26. Information Management – Logical View Data Layers – Access and Performance Layer Managed DataAccess & Performance Layer Foundation Data Layer Raw Data Reservoir Immutable raw data reservoir Raw data at rest is not interpreted Immutable modelled data. Business Process Neutral form. Abstracted from business process changes Past, current and future interpretation of enterprise data. Structured to support agile access & navigation • Layer facilitates access, navigation and performance of queries. • Allows for multiple interpretations of data from Foundation or Raw data Reservoir. • Most structures can be thrown away and re-built from scratch based on Foundation and Raw Reservoir. • The exception is derived and aggregate data which may have to be retained if the underlying data/mechanism is archived. • Most users presenting information in a standardised fashion on dashboards and reports will access this layer only.
  • 27. Access and Performance Layer Information Interpretation Access & Performance Layer Foundation Data Layer Raw Data Reservoir Data Engines & Poly-structured sources Content Docs Web & Social Media SMS Structured Data Sources • Operational Data • COTS Data • Master & Ref. Data • BAM Data • Data destined for Raw Data Reservoir may be loaded directly (e.g. through Flume) or may be stored temporarily in fs prior to loading (e.g. Fuse fs) • Relational data ingested in most appropriate mechanism before persisting in Foundation Data Layer (usual rules apply…) • Ideally micro batch using simplest mechanism possible • Only data of agreed quality loaded in FDL • For efficient loading relationally data may be pre-staged in fs so a large number of small files can be concatenated Information Management – Logical View Data Factory Ingestion flow Data Ingestion Batch & Real-Time ETL / ELT CDC Stream File Ops.
  • 28. Access and Performance Layer Data Ingestion Information Interpretation Access & Performance Layer Foundation Data Layer Raw Data Reservoir Flow shown: 1. Data to be formalised from HDFS store extracted and loaded into Foundation Data Layer. e.g. where Flume/HDFS is being used as an ETL pre-processor for Enterprise Data or where HDFS data is being logically modelled in the foundation layer 2. Data is re-structured and/or aggregated to facilitate access by users and business processes 3. Data may also be re-structured and/or aggregated from HDFS store where there are no specific requirements to manage Enterprise Data in a more formal data store over time 1 2 3 Information Management – Logical View Data Factory intra data processing flow
  • 29. Access and Performance Layer Information Management – Logical View Information Provisioning – BI & Data Science Components Virtualisation& QueryFederation Enterprise Performance Management Pre-built & Ad-hoc BI Assets Information Services Data Ingestion Information Interpretation Access & Performance Layer Foundation Data Layer Raw Data Reservoir Virtualisation& QueryFederation • Data Virtualisation and the various components to access the data are as per our previous view on BI tools. • Data Virtualisation is a key components that helps to deliver tools independence, services integration and a future state roadmap • Big Data has focused considerable attention on Data Science • Analytical capabilities delivered through analytical processing in the data layers and Advanced Analytical Tools used to drive capabilities • Data Mining in particular often involves complex data processing to flatten data into a longitudinal form. This derived data and model results are typically written to a project based sandbox. • Agile discovery is often best served through a separate Discovery Lab infrastructure (see later details) Data Science
  • 30. Access and Performance Layer Information Management – Logical View Information Provisioning BI Flows Virtualisation& QueryFederation Enterprise Performance Management Pre-built & Ad-hoc BI Assets Information Services Data Science Data Ingestion Information Interpretation Access & Performance Layer Foundation Data Layer Raw Data Reservoir 2 3 1. Typical access mechanism for Enterprise data via Access and Performance layer structures 2. Access to Foundation Layer Data to specific functions, processes and users only 3. Data interpretation & DQ assured through encoded logic, Avro, SerDe, FileReader, HCat etc. 4. Diagonal flows shows how data can be joined between layers as well as accessed directly. e.g. Raw Data can be queried directly through HIVE connector or joined to the RDBMS data and queried. 1 4 4
  • 31. Information Management – Logical View Data / Information Quality Access and Performance Layer Data Ingestion Information Interpretation Access & Performance Layer Foundation Data Layer Raw Data Reservoir Virtualisation& QueryFederation Enterprise Performance Management Pre-built & Ad-hoc BI Assets Information Services Data Science  Quality of data at rest assured by a number of factors in addition to the underlying quality of data at source – File and event handling to ensure data is not missed (e.g. missing log files assured by log file sequence numbering) – The processing of data between Raw and FDL / APL layers. This can be seen as a DQ firewall to ensure only data of known and acceptable quality is loaded. Typically this involves an element of synchronisation as some data will need to be held off until required reference data is available due to the micro-batch incremental loading approach.  Quality of information presented to downstream tools and services determined by – Model quality, understanding and performance of provisioning from modelled layers – Consistency of definition, code quality and query performance when accessing Hadoop data (e.g. HR code, Avro definition…)
  • 32. Information Management – Logical View Data Reservoir & Enterprise Information Store Virtualisation& QueryFederation Enterprise Performance Management Pre-built & Ad-hoc BI Assets Information Services Data Ingestion Information Interpretation Access & Performance Layer Foundation Data Layer Raw Data Reservoir Data Science Data Engines & Poly-structured sources Content Docs Web & Social Media SMS Structured Data Sources • Operational Data • COTS Data • Master & Ref. Data • Streaming & BAM Immutable raw data reservoir Raw data at rest is not interpreted Immutable modelled data. Business Process Neutral form. Abstracted from business process changes Past, current and future interpretation of enterprise data. Structured to support agile access & navigation
  • 33. Discovery Lab
  • 34. Analysis Processing & Delivery Discovery Lab & Data Science Tooling Data Reservoir & Enterprise Data Data Science (Primary Toolset) Statistics Tools Data & Text Mining Tools Faceted Query Tools Programming & Scripting Data Modeling Tools Query & Search Tools Pre-Built Intelligence Assets Intelligence Analysis Tools Ad Hoc Query & Analysis Tools OLAP Tools Forecasting & Simulation Tools Reporting Tools Data Scientist Virtualisation& InformationServices Data Factory flow 1. Data Factory responsible for access provisioning to data or replication (all or sample) to Sandbox in Discovery Lab. 2. Direct connection from Data Science tools and analysis sandbox. Data Science tools read and write data from/to project sandboxes. 3. Data Scientist can also access standard dashboards, reports and KPI‟s through Data Virtualisation layer Data Quality & Profiling Graphical rendering tools Dashboards & Reports Scorecards Charts & Graphs Sandbox – Project 3 Sandbox – Project 2 Sandbox – Project 1 1 2 Data store Analytical Processing 3 Information Management – Logical View Discovery Lab data flow
  • 35. R/T event Engine – Logical View and Components
  • 36. Real-time Data Engine To Event Subscribers (Events / Data) Privacy Filter Data Transform Rules & Models Mediation Next Best Action Real-Time Data Store From Input Events Reference Data Models & Rules Privacy Data Analytics Real-Time Data Engine – Logical View Business Activity Monitoring Real-Time event monitoring
  • 37. Real-Time Data Engine  Message mediation service  Privacy filter for event data. i.e. apply customer specified privacy and preference filters to the data stream  Transformation of the message data to outbound form  Apply declarative rules and models to the data stream to detect events for further downstream processing  Next Best Activity (NBA) event detection and processing. NBA typically also includes control group management and global optimisation of rules  Business Activity Monitoring  Local data store – local persistence of rules and metadata Components Privacy Filter Data Transform Rules & Models Mediation Next Best Action Real-Time Data Store BAM
  • 38. Real-Time data engine flows  Describe each of the data flows Reference Data Models & Rules Privacy Data Event Analytics From Input Events To Event Subscribers (Events / Data) R/T Event Monitoring To Do
  • 39. Mapping from the previous release of the architecture
  • 40. Information Management Reference Architecture Version 2.0 of the Architecture
  • 41. Information Management Reference Architecture Interpretation layer shows the relative cost of reading data depending on its location Previous staging layer now split into Data Ingestion and Raw store. Ingestion layer includes methods and processes to load data and manage Data Quality. Shape represents the relative cost of these processes. i.e. from none for HDFS to lots in APL. Raw Reservoir is typically at the lowest level of grain. Often lower than the enterprise cares about and so may not have been included in previous representation. Renamed from Knowledge Discovery to Discovery Lab but otherwise unchanged. The role of Discovery Labs is becoming more central though so additional operational guidance will be added. Discovery Lab Still an immutable store but may be physically implemented in relational or non- relational technologies Key differences from 2.0 to 3.0 of the Architecture
  • 42. Discovery Lab and Governance considerations
  • 43. Data discovery for the Enterprise  Discovery phase – Unbounded discovery – Self-Service sandbox – Wide toolset – Agile methods  Promotion to Exploitation – Commercial exploitation – Narrower toolset – Integration to operations – Non-functional requirements – Code standardisation & governance Discovery and monetising steps have different requirements Business Value Commercial Exploitation Time / Effort Discovery phase Understanding of the data Governance
  • 44. To monetise fully you need to standardise It‟s smart to standardise as part of Governance  Discovery process requires a broad toolset  Standardisation is essential for Commercial exploitation  Sustainability depends on standardisation / rationalisation – Reduced training burden – Reduced support costs – Reduced license costs – Ongoing agility & alignment Data Discovery Toolset Data Exploitation Toolset Rationalised Components • Cloudera CDH, Oracle, No-SQL • Mammoth, Yarn, EM-plugin • MR, Hive, Pig, Impala, Accum. • Flume NG, Oozie • … • … • … Optional additions • Oracle Connectors • Additional corporate standard components Oraclestandard deployment Corporate standard Standardised Hadoop Zoo Standardised deployment
  • 45. Copyright © 2013, Oracle and/or its affiliates. All rights reserved.52 The kind of things we are looking to Discover  Data science skills required vary by the type of analysis  Data Management skills vary by the amount of data and its structure  So making data movement and manipulation easy will deliver a better result and deliver it faster Descriptive Diagnostic Predictive Prescriptive Business Impact AnalyticalSkills Insight Foresight
  • 46. Copyright © 2013, Oracle and/or its affiliates. All rights reserved.53 Discovery is a Data process not a Development Process Requirement Analysis High Level Design Low Level Design Coding Testing Acceptance Testing Three Versions of the BI Development Process Excel Spreadsheet Shared linked spreadsheets Local Access Database Shared Access Server SQL Server Database Oracle Datawarehouse Discovery & Profile Model Exploit What IT thinks it should be What normally happens What Big Data is trying to achieve
  • 47. Copyright © 2013, Oracle and/or its affiliates. All rights reserved.54 Sandbox delivery options • Separate Data Lab environment • Delivered as part of Information Management architecture Self Service Sandboxes • Self service provisioning of new sandboxes for Discovery phase • Automation of data access rights, resources and tools provisioning Data provision • Quickly take on new data to rapidly make available to Analysts • Tools such as “Data Factory” can fully automate data flows Sandboxes facilitate “Agile” Providing the technology platform for agile discovery
  • 48. Copyright © 2013, Oracle and/or its affiliates. All rights reserved.55 Monetise and Optimise steps are different  New insights deployed into business process in some form – Technical: e.g. Business rules, new customer segments – Non-technical: e.g. Observations about behaviours  Business Intelligence systems adapted to provide monitoring, feedback and control optimisation  The faster you iterate this cycle the greater the benefit BUT  Big Data does not change the fundamental need for accurate, consistent and integrated information What happens when we want to exploit insights? New insights Business Process
  • 49. Copyright © 2013, Oracle and/or its affiliates. All rights reserved.56 Rules of thumb for data Organised information leads to better analyses Information needs to be organised in order to analyse it RDBMS are great when information is organised Hadoop minimises the penalty for disorganisation The closer you are to insight, the more complete and organised information needs to be Data needs to be organised to monetise it effectively
  • 50. Copyright © 2013, Oracle and/or its affiliates. All rights reserved.57 What that really means is…  We need to apply structure to data in order to analyse it  Schema on read works well for us in Discovery as we can be agile about interpretation  As we move into Discovery schema on read can causes Governance & quality issues  Key lesson: The cost to store & manage is distinct from structural considerations between Big Data and RDBMS technologies
  • 51. Copyright © 2013, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 1358 De-mystifying schema on read DQ Bus. Rules Mapping ETL Data Reservoirs  Traditional “Schema on Write” – Data quality managed by formalised ETL process – Data persisted in tabular, agreed and consistent form – Data integration happens in ETL – Structure must be decided before writing  Big Data “Schema on Read” – Interpretation of data captured in code for each program accessing the data – Data quality dependent on code quality – Data integration happens in code
  • 52. Copyright © 2013, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 1359 Underlying storage capabilities are different 0 1 2 3 4 5 Tooling maturity Stringent Non-Functionals ACID transactional requirement Security Variety of data formats Data sparsity ETL simplicity Cost effectively store low value data Ingestion rate Straight Through Processing (STP) Hadoop Relational My Appllication
  • 53. Copyright © 2013, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 1360 Analytics 3.0 platform include both relational and non-relational technologies Ken Rudin* refers to this as the genius of AND vs the tyranny of OR (see his TDWI „13 presentation) Unified Reservoir simplifies access to all data regardless of characteristics & analysis requirements It’s smart to unify your data into a single Reservoir Fully expose your data for discovery and monetisation Ken Rudin is Director of Analytics at Facebook* All Data
  • 54. Copyright © 2013, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 1361 Access and Performance Layer Information Management – Logical View Virtualisation& QueryFederation Enterprise Performance Management Pre-built & Ad-hoc BI Assets Information Services Advanced Analytical Tools Information Provisioning Analysis Processing & Delivery Data Ingestion Information Access Access & Performance Layer Foundation Data Layer Raw Data Reservoir Immutable raw data reservoir Raw data at rest is not interpreted Immutable modelled data. Business Process Neutral form. Abstracted from business process changes Past, current and future interpretation of enterprise data. Structured to support agile access & navigation Methods and process to load data and manage Data Quality Methods and process needed to access information
  • 55. Copyright © 2013, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 1363 Access and Performance Layer Information Management – Logical View Analytical processing and delivery Virtualisation& QueryFederation Enterprise Performance Management Pre-built & Ad-hoc BI Assets Information Services Advanced Analytical Tools Data Ingestion Information Access Access & Performance Layer Foundation Data Layer Raw Data Reservoir Structures and processing required to load data (batch and Real-Time) and manage Data Quality Structures required to interpret the data under management. i.e. logical interpretation • Data Virtualisation and the various components to access the data are as per our previous view on Bo tools. • Data Virtualisation is a key components that helps to deliver tools independence, services integration and a future state roadmap • What has changed is the focused on Analytics • Analytical capabilities is delivered through analytical processing in the data layers and Advanced Analytical Tools used to drive capabilities • Data Mining in particular often involves complex data processing to flatten data into a longitudinal form. This derived data and model results are typically written to a project based sandbox. • Agile discovery is often best served through a separate Discovery Lab infrastructure (described later) OLAP Data Mining Statistics OLAP Text Mining Other Analytical Processing Data Mining Text Mining