The document discusses various aspects of ETL subsystems based on Ralph Kimball's model. It begins with an introduction and background on the presenter and Rittman Mead. It then covers the major subsystems involved in extracting, cleaning and conforming data when using Oracle Data Integration and Oracle GoldenGate. It includes descriptions of change data capture, data profiling with Oracle Enterprise Data Quality, data cleansing with ODI constraints and EDQ, building an error event schema in ODI, and deduplicating data with EDQ. Screenshots and diagrams are provided to illustrate many of these concepts.
A Walk Through the Kimball ETL Subsystems with Oracle Data Integration - Coll...Michael Rainey
Big Data integration is an excellent feature in the Oracle Data Integration product suite (Oracle Data Integrator, GoldenGate, & Enterprise Data Quality). But not all analytics require big data technologies, such as labor cost, revenue, or expense reporting. Ralph Kimball, an original architect of the dimensional model in data warehousing, spent much of his career working to build an enterprise data warehouse methodology that can meet these reporting needs. His book, "The Data Warehouse ETL Toolkit", is a guide for many ETL developers. This session will walk you through his ETL Subsystem categories; Extracting, Cleaning & Conforming, Delivering, and Managing, describing how the Oracle Data Integration products are perfectly suited for the Kimball approach.
Presented at Collaborate16 in Las Vegas.
A Walk Through the Kimball ETL Subsystems with Oracle Data IntegrationMichael Rainey
Big Data integration is an excellent feature in the Oracle Data Integration product suite (Oracle Data Integrator, GoldenGate, & Enterprise Data Quality). But not all analytics require big data technologies, such as labor cost, revenue, or expense reporting. Ralph Kimball, an original architect of the dimensional model in data warehousing, spent much of his career working to build an enterprise data warehouse methodology that can meet these reporting needs. His book, "The Data Warehouse ETL Toolkit", is a guide for many ETL developers. This session will walk you through his ETL Subsystem categories; Extracting, Cleaning & Conforming, Delivering, and Managing, describing how the Oracle Data Integration products are perfectly suited for the Kimball approach.
Presented at Oracle OpenWorld 2015 & BIWA Summit 2016.
Practical Tips for Oracle Business Intelligence Applications 11g ImplementationsMichael Rainey
The document provides practical tips for Oracle Business Intelligence Applications 11g implementations. It discusses scripting installations and configurations, LDAP integration challenges, implementing high availability, different methods for data extracts, and simplifying disaster recovery. Specific tips include scripting all processes, configuring the ODI agent JVM and connection pools for performance, understanding external LDAP authentication in ODI, implementing active-active high availability for ODI agents, choosing the right data extract method based on latency and volume, and using DataGuard and CNAMEs to simplify failover for disaster recovery.
Are you a young professional who just got out of college and unsure which career path to follow? Are you thinking about changing your career to something completely new and looking for options? Either way, this webinar is the right one for you. It’s the first in a series that the new ODTUG Career Track Community will bring you to show what Oracle careers look like and where/how to start with them.
During this webinar, we will talk about what an ETL developer career looks like, what the expectations are, challenges, rewards, and which steps are needed to be successful. We will discuss a wide range of topics, such as tools used on the job, certification paths, the importance of social media, user groups, and more. This webinar will be presented by Rodrigo Radtke de Souza, who has been working in the Oracle ETL world for quite some time now and has achieved great accomplishments as an ETL developer, such as Oracle ACE nomination, frequent Kscope speaker, ODTUG Leadership Program participant, and a successful career at Dell.
How To Leverage OBIEE Within A Big Data ArchitectureKevin McGinley
If you've invested in OBIEE and want to start exploring the use of Big Data technology, this presentation talks about how and why you might want to use OBIEE as the common visualization layer across both.
Offload, Transform, and Present - the New World of Data IntegrationMichael Rainey
How much time and effort (and budget) do organizations spend moving data around the enterprise? Unfortunately, quite a lot. These days, ETL developers are tasked with performing the Extract (E) and Load (L), and spending less time on their craft, building Transformations (T). This changes in the new world of data integration. By offloading data from the RDBMS to Hadoop, with the ability to present it back to the relational database, data can be seamlessly integrated between different source and target systems. Transformations occur on data offloaded to Hadoop, using the latest ETL technologies, or in the target database, with a standard ETL-on-RDBMS tool. In this session, we’ll discuss how the new world of data integration will provide focus on transforming data into insightful information by simplifying the data movement process.
Presented at Enkitec E4 2017.
Not Your Father’s Data Warehouse: Breaking Tradition with InnovationInside Analysis
The Briefing Room with Dr. Robin Bloor and Teradata
Live Webcast on May 20, 2014
Watch the archive: https://bloorgroup.webex.com/bloorgroup/lsr.php?RCID=f09e84f88e4ca6e0a9179c9a9e930b82
Traditional data warehouses have been the backbone of corporate decision making for over three decades. With the emergence of Big Data and popular technologies like open-source Apache™ Hadoop®, some analysts question the lifespan of the data warehouse and the future role it will play in enterprise information management. But it’s not practical to believe that emerging technologies provide a wholesale replacement of existing technologies and corporate investments in data management. Rather, a better approach is for new innovations and technologies to complement and build upon existing solutions.
Register for this episode of The Briefing Room to hear veteran Analyst Dr. Robin Bloor as he explains where tomorrow’s data warehouse fits in the information landscape. He’ll be briefed by Imad Birouty of Teradata, who will highlight the ways in which his company is evolving to meet the challenges presented by different types of data and applications. He will also tout Teradata’s recently-announced Teradata® Database 15 and Teradata® QueryGrid™, an analytics platform that enables data processing across the enterprise.
Visit InsideAnlaysis.com for more information.
These are high level considerations of when to use the Integrated Data Warehouse or Hadoop for a specific workload. There are times one if the clear choice and times when there us overlapping requirements to consider. We present both pros and cons for both. But you must get into the requirements details to make a sensible decision.
A Walk Through the Kimball ETL Subsystems with Oracle Data Integration - Coll...Michael Rainey
Big Data integration is an excellent feature in the Oracle Data Integration product suite (Oracle Data Integrator, GoldenGate, & Enterprise Data Quality). But not all analytics require big data technologies, such as labor cost, revenue, or expense reporting. Ralph Kimball, an original architect of the dimensional model in data warehousing, spent much of his career working to build an enterprise data warehouse methodology that can meet these reporting needs. His book, "The Data Warehouse ETL Toolkit", is a guide for many ETL developers. This session will walk you through his ETL Subsystem categories; Extracting, Cleaning & Conforming, Delivering, and Managing, describing how the Oracle Data Integration products are perfectly suited for the Kimball approach.
Presented at Collaborate16 in Las Vegas.
A Walk Through the Kimball ETL Subsystems with Oracle Data IntegrationMichael Rainey
Big Data integration is an excellent feature in the Oracle Data Integration product suite (Oracle Data Integrator, GoldenGate, & Enterprise Data Quality). But not all analytics require big data technologies, such as labor cost, revenue, or expense reporting. Ralph Kimball, an original architect of the dimensional model in data warehousing, spent much of his career working to build an enterprise data warehouse methodology that can meet these reporting needs. His book, "The Data Warehouse ETL Toolkit", is a guide for many ETL developers. This session will walk you through his ETL Subsystem categories; Extracting, Cleaning & Conforming, Delivering, and Managing, describing how the Oracle Data Integration products are perfectly suited for the Kimball approach.
Presented at Oracle OpenWorld 2015 & BIWA Summit 2016.
Practical Tips for Oracle Business Intelligence Applications 11g ImplementationsMichael Rainey
The document provides practical tips for Oracle Business Intelligence Applications 11g implementations. It discusses scripting installations and configurations, LDAP integration challenges, implementing high availability, different methods for data extracts, and simplifying disaster recovery. Specific tips include scripting all processes, configuring the ODI agent JVM and connection pools for performance, understanding external LDAP authentication in ODI, implementing active-active high availability for ODI agents, choosing the right data extract method based on latency and volume, and using DataGuard and CNAMEs to simplify failover for disaster recovery.
Are you a young professional who just got out of college and unsure which career path to follow? Are you thinking about changing your career to something completely new and looking for options? Either way, this webinar is the right one for you. It’s the first in a series that the new ODTUG Career Track Community will bring you to show what Oracle careers look like and where/how to start with them.
During this webinar, we will talk about what an ETL developer career looks like, what the expectations are, challenges, rewards, and which steps are needed to be successful. We will discuss a wide range of topics, such as tools used on the job, certification paths, the importance of social media, user groups, and more. This webinar will be presented by Rodrigo Radtke de Souza, who has been working in the Oracle ETL world for quite some time now and has achieved great accomplishments as an ETL developer, such as Oracle ACE nomination, frequent Kscope speaker, ODTUG Leadership Program participant, and a successful career at Dell.
How To Leverage OBIEE Within A Big Data ArchitectureKevin McGinley
If you've invested in OBIEE and want to start exploring the use of Big Data technology, this presentation talks about how and why you might want to use OBIEE as the common visualization layer across both.
Offload, Transform, and Present - the New World of Data IntegrationMichael Rainey
How much time and effort (and budget) do organizations spend moving data around the enterprise? Unfortunately, quite a lot. These days, ETL developers are tasked with performing the Extract (E) and Load (L), and spending less time on their craft, building Transformations (T). This changes in the new world of data integration. By offloading data from the RDBMS to Hadoop, with the ability to present it back to the relational database, data can be seamlessly integrated between different source and target systems. Transformations occur on data offloaded to Hadoop, using the latest ETL technologies, or in the target database, with a standard ETL-on-RDBMS tool. In this session, we’ll discuss how the new world of data integration will provide focus on transforming data into insightful information by simplifying the data movement process.
Presented at Enkitec E4 2017.
Not Your Father’s Data Warehouse: Breaking Tradition with InnovationInside Analysis
The Briefing Room with Dr. Robin Bloor and Teradata
Live Webcast on May 20, 2014
Watch the archive: https://bloorgroup.webex.com/bloorgroup/lsr.php?RCID=f09e84f88e4ca6e0a9179c9a9e930b82
Traditional data warehouses have been the backbone of corporate decision making for over three decades. With the emergence of Big Data and popular technologies like open-source Apache™ Hadoop®, some analysts question the lifespan of the data warehouse and the future role it will play in enterprise information management. But it’s not practical to believe that emerging technologies provide a wholesale replacement of existing technologies and corporate investments in data management. Rather, a better approach is for new innovations and technologies to complement and build upon existing solutions.
Register for this episode of The Briefing Room to hear veteran Analyst Dr. Robin Bloor as he explains where tomorrow’s data warehouse fits in the information landscape. He’ll be briefed by Imad Birouty of Teradata, who will highlight the ways in which his company is evolving to meet the challenges presented by different types of data and applications. He will also tout Teradata’s recently-announced Teradata® Database 15 and Teradata® QueryGrid™, an analytics platform that enables data processing across the enterprise.
Visit InsideAnlaysis.com for more information.
These are high level considerations of when to use the Integrated Data Warehouse or Hadoop for a specific workload. There are times one if the clear choice and times when there us overlapping requirements to consider. We present both pros and cons for both. But you must get into the requirements details to make a sensible decision.
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...StampedeCon
This session will be a detailed recount of the design, implementation, and launch of the next-generation Shutterstock Data Platform, with strong emphasis on conveying clear, understandable learnings that can be transferred to your own organizations and projects. This platform was architected around the prevailing use of Kafka as a highly-scalable central data hub for shipping data across your organization in batch or streaming fashion. It also relies heavily on Avro as a serialization format and a global schema registry to provide structure that greatly improves quality and usability of our data sets, while also allowing the flexibility to evolve schemas and maintain backwards compatibility.
As a company, Shutterstock has always focused heavily on leveraging open source technologies in developing its products and infrastructure, and open source has been a driving force in big data more so than almost any other software sub-sector. With this plethora of constantly evolving data technologies, it can be a daunting task to select the right tool for your problem. We will discuss our approach for choosing specific existing technologies and when we made decisions to invest time in home-grown components and solutions.
We will cover advantages and the engineering process of developing language-agnostic APIs for publishing to and consuming from the data platform. These APIs can power some very interesting streaming analytics solutions that are easily accessible to teams across our engineering organization.
We will also discuss some of the massive advantages a global schema for your data provides for downstream ETL and data analytics. ETL into Hadoop and creation and maintenance of Hive databases and tables becomes much more reliable and easily automated with historically compatible schemas. To complement this schema-based approach, we will cover results of performance testing various file formats and compression schemes in Hadoop and Hive, the massive performance benefits you can gain in analytical workloads by leveraging highly optimized columnar file formats such as ORC and Parquet, and how you can use good old fashioned Hive as a tool for easily and efficiently converting exiting datasets into these formats.
Finally, we will cover lessons learned in launching this platform across our organization, future improvements and further design, and the need for data engineers to understand and speak the languages of data scientists and web, infrastructure, and network engineers.
The document discusses linked data, ontologies, and inference. It provides examples of using RDFS and OWL to infer new facts from schemas and ontologies. Key points include:
- Linked Data uses URIs and HTTP to identify things and provide useful information about them via standards like RDF and SPARQL.
- Projects like LOD aim to develop best practices for publishing interlinked open datasets. FactForge and LinkedLifeData are examples that contain billions of statements across life science and general knowledge datasets.
- RDFS and OWL allow defining schemas and ontologies that enable inferring new facts through reasoning. Rules like rdfs:domain and rdfs:range allow inferring type information
OBIEE12c and Embedded Essbase 12c - An Initial Look at Query Acceleration Use...Mark Rittman
OBIEE12c comes with an updated version of Essbase that focuses entirely in this release on the query acceleration use-case. This presentation looks at this new release and explains how the new BI Accelerator Wizard manages the creation of Essbase cubes to accelerate OBIEE query performance
Semantic Technologies and Triplestores for Business IntelligenceMarin Dimitrov
This document provides an introduction to semantic technologies and triplestores. It discusses the Semantic Web vision of making data on the web more accessible and linked. Key concepts covered include RDF, ontologies, OWL, SPARQL and Linked Data. It also introduces triplestores as RDF databases for storing and querying semantic data and compares their features to traditional databases.
Presentation to discuss major shift in enterprise data management. Describes movement away from older hub and spoke data architecture and towards newer, more modern Kappa data architecture
This document discusses ETL (extract, transform, load) processes using the Talend open source tool. It provides an overview of ETL, describes the extract, transform and load steps. It also outlines a tutorial demonstrating loading data from a database to HDFS, running a Hive query to analyze the data, and outputting the results to HBase. The tutorial steps show how to set up Talend and run an ETL job that extracts data from a database, loads it to HDFS, runs a Hive query, and loads the results to HBase.
A two day training session for colleagues at Aimia, to introduce them to R. Topics covered included basics of R, I/O with R, data analysis and manipulation, and visualisation.
This document provides an overview of big data concepts and technologies for managers. It discusses problems with relational databases for large, unstructured data and introduces NoSQL databases and Hadoop as solutions. It also summarizes common big data applications, frameworks like MapReduce, Spark, and Flink, and different NoSQL database categories including key-value, column-family, document, and graph stores.
This document discusses navigating user data management and data discovery. It provides an overview of evaluating and selecting data management tools for a Hadoop data lake. Key criteria for evaluation include metadata curation, lineage and versioning, integration capabilities, and performance. Several vendors were evaluated, with Global ID, Attivio, and Waterline Data scoring highest based on the criteria. The presentation emphasizes selecting a limited number of tools based on business and user requirements.
The document discusses Oracle's strategy to enable spatial and graph use cases on big data platforms. It provides an overview of Oracle's Big Data Spatial and Graph product, which allows for property graph analysis and spatial analysis on Hadoop. The spatial features allow for location data enrichment, proximity analysis, and preparation of map and imagery data. The graph features are useful for analysis of social media relationships, internet of things interactions, and cybersecurity.
The document discusses Extract, Transform, Load (ETL) processes. It defines extract as reading data from a database, transform as converting extracted data into a form suitable for another database, and load as writing transformed data into the target database. It then lists several common ETL tools and databases they can connect to.
Reaching scale limits on a Hadoop platform: issues and errors created by spee...DataWorks Summit
Santander UK’s Big Data journey began in 2014, using Hadoop to make the most of our data and generate value for customers. Within 9 months, we created a highly available real-time customer facing application for customer analytics. We currently have 500 different people doing their own analysis and projects with this data, spanning a total of 50 different use cases. This data, (consisting of over 40 million customer records with billions of transactions), provides our business new insights that were inaccessible before.
Our business moves quickly, with several products and 20 use cases currently in production. We currently have a customer data lake and a technical data lake. Having a platform with very different workloads has proven to be challenging.
Our success in generating value created such growth in terms of data, use cases, analysts and usage patterns that 3 years later we find issues with scalability in HDFS, Hive metastore and Hadoop operations and challenges with highly available architectures with Hbase, Flume and Kafka. Going forward we are exploring alternative architectures including a hybrid cloud model, and moving towards streaming.
Our goal with this session is to assist people in the early part of their journey by building a solid foundation. We hope that others can benefit from us sharing our experiences and lessons learned during our journey.
Speaker
Nicolette Bullivant, Head of Data Engineering at Santander UK Technology, Santander UK Technology
This document discusses metadata and the importance of metadata management. It introduces Apache Atlas as an open source platform for metadata management and governance. Key points include:
- Metadata is important for data reuse, analytics, and governance. It provides context and meaning about data.
- Current reality is that metadata is often not well supported or integrated across tools. Apache Atlas aims to provide an open, unified approach.
- Apache Atlas has graduated to a top-level Apache project. It provides a type-agnostic metadata store and interfaces that can be accessed by various tools.
- The vision is for an open ecosystem where metadata is shared and federated across repositories from different vendors and tools.
On-Demand RDF Graph Databases in the CloudMarin Dimitrov
slides from the S4 webinar "On-Demand RDF Graph Databases in the Cloud"
RDF database-as-a-service running on the Self-Service Semantic Suite (S4) platform: http://s4.ontotext.com
video recording of the talk is available at http://info.ontotext.com/on-demand-rdf-graph-database
Talend provides data integration and management solutions. It focuses on combining data from different sources into a unified view for users. Talend offers an open source tool called Talend Open Studio that allows users to visually design procedures to extract, transform, and load data between various databases and file types. It also offers features for data quality, storage optimization, master data management, and reporting.
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...Deepak Chandramouli
PayPal Data Lake Journey | 2017-Oct | San Diego | Teradata Edge of Next
Gimel [http://www.gimel.io] is a Big Data Processing Library, open sourced by PayPal.
https://www.youtube.com/watch?v=52PdNno_9cU&t=3s
Gimel empowers analysts, scientists, data engineers alike to access a variety of Big Data / Traditional Data Stores - with just SQL or a single line of code (Unified Data API).
This is possible via the Catalog of Technical properties abstracted from users, along with a rich collection of Data Store Connectors available in Gimel Library.
A Catalog provider can be Hive or User Supplied (runtime) or UDC.
In addition, PayPal recently open sourced UDC [Unified Data Catalog], which can host and serve the Technical Metatada of the Data Stores & Objects. Visit http://www.unifieddatacatalog.io to experience first hand.
The document discusses Oracle's data integration products and big data solutions. It outlines five core capabilities of Oracle's data integration platform, including data availability, data movement, data transformation, data governance, and streaming data. It then describes eight core products that address real-time and streaming integration, ELT integration, data preparation, streaming analytics, dataflow ML, metadata management, data quality, and more. The document also outlines five cloud solutions for data integration including data migrations, data warehouse integration, development and test environments, high availability, and heterogeneous cloud. Finally, it discusses pragmatic big data solutions for data ingestion, transformations, governance, connectors, and streaming big data.
Lightning Talk: Get Even More Value from MongoDB ApplicationsMongoDB
MongoDB is the hottest trend in NoSQL databases, helping customers all over the world. Once the application is built, you can generate even more revenue or cost savings by apply deep analytics to all that rich JSON data. Teradata and MongoDB are building a high speed ability to swap data to and from MongoDB amazingly fast. The Teradata Data Warehouse can then combine the MongoDB real time data with all kinds of analytic data such as customer profitability, historic inventory levels, consumer profiles, logistics, suppliers, and so on. If you have a MongoDB application running, it’s a good time to get even more value out of your JSON data. This talk will cover use cases and a quick review of the technical architecture.
Apache Atlas provides centralized metadata services and cross-component dataset lineage tracking for Hadoop components. It aims to enable transparent, reproducible, auditable and consistent data governance across structured, unstructured, and traditional database systems. The near term roadmap includes dynamic access policy driven by metadata and enhanced Hive integration. Apache Atlas also pursues metadata exchange with non-Hadoop systems and third party vendors through REST APIs and custom reporters.
Data Scientists and Machine Learning practitioners, nowadays, seem to be churning out models by the dozen and they continuously experiment to find ways to improve their accuracies. They also use a variety of ML and DL frameworks & languages , and a typical organization may find that this results in a heterogenous, complicated bunch of assets that require different types of runtimes, resources and sometimes even specialized compute to operate efficiently.
But what does it mean for an enterprise to actually take these models to "production" ? How does an organization scale inference engines out & make them available for real-time applications without significant latencies ? There needs to be different techniques for batch (offline) inferences and instant, online scoring. Data needs to be accessed from various sources and cleansing, transformations of data needs to be enabled prior to any predictions. In many cases, there maybe no substitute for customized data handling with scripting either.
Enterprises also require additional auditing and authorizations built in, approval processes and still support a "continuous delivery" paradigm whereby a data scientist can enable insights faster. Not all models are created equal, nor are consumers of a model - so enterprises require both metering and allocation of compute resources for SLAs.
In this session, we will take a look at how machine learning is operationalized in IBM Data Science Experience (DSX), a Kubernetes based offering for the Private Cloud and optimized for the HortonWorks Hadoop Data Platform. DSX essentially brings in typical software engineering development practices to Data Science, organizing the dev->test->production for machine learning assets in much the same way as typical software deployments. We will also see what it means to deploy, monitor accuracies and even rollback models & custom scorers as well as how API based techniques enable consuming business processes and applications to remain relatively stable amidst all the chaos.
Speaker
Piotr Mierzejewski, Program Director Development IBM DSX Local, IBM
Oracle GoldenGate and Apache Kafka: A Deep Dive Into Real-Time Data StreamingMichael Rainey
We produce quite a lot of data! Much of the data are business transactions stored in a relational database. More frequently, the data are non-structured, high volume and rapidly changing datasets known in the industry as Big Data. The challenge for data integration professionals is to combine and transform the data into useful information. Not just that, but it must also be done in near real-time and using a target system such as Hadoop. The topic of this session, real-time data streaming, provides a great solution for this challenging task. By integrating GoldenGate, Oracle’s premier data replication technology, and Apache Kafka, the latest open-source streaming and messaging system, we can implement a fast, durable, and scalable solution.
Presented at Oracle OpenWorld 2016
Oracle Data Integrator 12c - Getting StartedMichael Rainey
I think it’s time for a fresh look at Oracle Data Integrator 12c. What is ODI? How has it evolved over the years and where is it going? And, of course, how do you get started with Oracle Data Integrator? I plan to share what I love about ODI, how to get started building your first ODI project, and what makes Oracle Data Integrator 12c the premier ETL and data warehousing tool on the market. It’s time to get back to the basics!
Presented at UTOUG Training Days 2017.
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...StampedeCon
This session will be a detailed recount of the design, implementation, and launch of the next-generation Shutterstock Data Platform, with strong emphasis on conveying clear, understandable learnings that can be transferred to your own organizations and projects. This platform was architected around the prevailing use of Kafka as a highly-scalable central data hub for shipping data across your organization in batch or streaming fashion. It also relies heavily on Avro as a serialization format and a global schema registry to provide structure that greatly improves quality and usability of our data sets, while also allowing the flexibility to evolve schemas and maintain backwards compatibility.
As a company, Shutterstock has always focused heavily on leveraging open source technologies in developing its products and infrastructure, and open source has been a driving force in big data more so than almost any other software sub-sector. With this plethora of constantly evolving data technologies, it can be a daunting task to select the right tool for your problem. We will discuss our approach for choosing specific existing technologies and when we made decisions to invest time in home-grown components and solutions.
We will cover advantages and the engineering process of developing language-agnostic APIs for publishing to and consuming from the data platform. These APIs can power some very interesting streaming analytics solutions that are easily accessible to teams across our engineering organization.
We will also discuss some of the massive advantages a global schema for your data provides for downstream ETL and data analytics. ETL into Hadoop and creation and maintenance of Hive databases and tables becomes much more reliable and easily automated with historically compatible schemas. To complement this schema-based approach, we will cover results of performance testing various file formats and compression schemes in Hadoop and Hive, the massive performance benefits you can gain in analytical workloads by leveraging highly optimized columnar file formats such as ORC and Parquet, and how you can use good old fashioned Hive as a tool for easily and efficiently converting exiting datasets into these formats.
Finally, we will cover lessons learned in launching this platform across our organization, future improvements and further design, and the need for data engineers to understand and speak the languages of data scientists and web, infrastructure, and network engineers.
The document discusses linked data, ontologies, and inference. It provides examples of using RDFS and OWL to infer new facts from schemas and ontologies. Key points include:
- Linked Data uses URIs and HTTP to identify things and provide useful information about them via standards like RDF and SPARQL.
- Projects like LOD aim to develop best practices for publishing interlinked open datasets. FactForge and LinkedLifeData are examples that contain billions of statements across life science and general knowledge datasets.
- RDFS and OWL allow defining schemas and ontologies that enable inferring new facts through reasoning. Rules like rdfs:domain and rdfs:range allow inferring type information
OBIEE12c and Embedded Essbase 12c - An Initial Look at Query Acceleration Use...Mark Rittman
OBIEE12c comes with an updated version of Essbase that focuses entirely in this release on the query acceleration use-case. This presentation looks at this new release and explains how the new BI Accelerator Wizard manages the creation of Essbase cubes to accelerate OBIEE query performance
Semantic Technologies and Triplestores for Business IntelligenceMarin Dimitrov
This document provides an introduction to semantic technologies and triplestores. It discusses the Semantic Web vision of making data on the web more accessible and linked. Key concepts covered include RDF, ontologies, OWL, SPARQL and Linked Data. It also introduces triplestores as RDF databases for storing and querying semantic data and compares their features to traditional databases.
Presentation to discuss major shift in enterprise data management. Describes movement away from older hub and spoke data architecture and towards newer, more modern Kappa data architecture
This document discusses ETL (extract, transform, load) processes using the Talend open source tool. It provides an overview of ETL, describes the extract, transform and load steps. It also outlines a tutorial demonstrating loading data from a database to HDFS, running a Hive query to analyze the data, and outputting the results to HBase. The tutorial steps show how to set up Talend and run an ETL job that extracts data from a database, loads it to HDFS, runs a Hive query, and loads the results to HBase.
A two day training session for colleagues at Aimia, to introduce them to R. Topics covered included basics of R, I/O with R, data analysis and manipulation, and visualisation.
This document provides an overview of big data concepts and technologies for managers. It discusses problems with relational databases for large, unstructured data and introduces NoSQL databases and Hadoop as solutions. It also summarizes common big data applications, frameworks like MapReduce, Spark, and Flink, and different NoSQL database categories including key-value, column-family, document, and graph stores.
This document discusses navigating user data management and data discovery. It provides an overview of evaluating and selecting data management tools for a Hadoop data lake. Key criteria for evaluation include metadata curation, lineage and versioning, integration capabilities, and performance. Several vendors were evaluated, with Global ID, Attivio, and Waterline Data scoring highest based on the criteria. The presentation emphasizes selecting a limited number of tools based on business and user requirements.
The document discusses Oracle's strategy to enable spatial and graph use cases on big data platforms. It provides an overview of Oracle's Big Data Spatial and Graph product, which allows for property graph analysis and spatial analysis on Hadoop. The spatial features allow for location data enrichment, proximity analysis, and preparation of map and imagery data. The graph features are useful for analysis of social media relationships, internet of things interactions, and cybersecurity.
The document discusses Extract, Transform, Load (ETL) processes. It defines extract as reading data from a database, transform as converting extracted data into a form suitable for another database, and load as writing transformed data into the target database. It then lists several common ETL tools and databases they can connect to.
Reaching scale limits on a Hadoop platform: issues and errors created by spee...DataWorks Summit
Santander UK’s Big Data journey began in 2014, using Hadoop to make the most of our data and generate value for customers. Within 9 months, we created a highly available real-time customer facing application for customer analytics. We currently have 500 different people doing their own analysis and projects with this data, spanning a total of 50 different use cases. This data, (consisting of over 40 million customer records with billions of transactions), provides our business new insights that were inaccessible before.
Our business moves quickly, with several products and 20 use cases currently in production. We currently have a customer data lake and a technical data lake. Having a platform with very different workloads has proven to be challenging.
Our success in generating value created such growth in terms of data, use cases, analysts and usage patterns that 3 years later we find issues with scalability in HDFS, Hive metastore and Hadoop operations and challenges with highly available architectures with Hbase, Flume and Kafka. Going forward we are exploring alternative architectures including a hybrid cloud model, and moving towards streaming.
Our goal with this session is to assist people in the early part of their journey by building a solid foundation. We hope that others can benefit from us sharing our experiences and lessons learned during our journey.
Speaker
Nicolette Bullivant, Head of Data Engineering at Santander UK Technology, Santander UK Technology
This document discusses metadata and the importance of metadata management. It introduces Apache Atlas as an open source platform for metadata management and governance. Key points include:
- Metadata is important for data reuse, analytics, and governance. It provides context and meaning about data.
- Current reality is that metadata is often not well supported or integrated across tools. Apache Atlas aims to provide an open, unified approach.
- Apache Atlas has graduated to a top-level Apache project. It provides a type-agnostic metadata store and interfaces that can be accessed by various tools.
- The vision is for an open ecosystem where metadata is shared and federated across repositories from different vendors and tools.
On-Demand RDF Graph Databases in the CloudMarin Dimitrov
slides from the S4 webinar "On-Demand RDF Graph Databases in the Cloud"
RDF database-as-a-service running on the Self-Service Semantic Suite (S4) platform: http://s4.ontotext.com
video recording of the talk is available at http://info.ontotext.com/on-demand-rdf-graph-database
Talend provides data integration and management solutions. It focuses on combining data from different sources into a unified view for users. Talend offers an open source tool called Talend Open Studio that allows users to visually design procedures to extract, transform, and load data between various databases and file types. It also offers features for data quality, storage optimization, master data management, and reporting.
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...Deepak Chandramouli
PayPal Data Lake Journey | 2017-Oct | San Diego | Teradata Edge of Next
Gimel [http://www.gimel.io] is a Big Data Processing Library, open sourced by PayPal.
https://www.youtube.com/watch?v=52PdNno_9cU&t=3s
Gimel empowers analysts, scientists, data engineers alike to access a variety of Big Data / Traditional Data Stores - with just SQL or a single line of code (Unified Data API).
This is possible via the Catalog of Technical properties abstracted from users, along with a rich collection of Data Store Connectors available in Gimel Library.
A Catalog provider can be Hive or User Supplied (runtime) or UDC.
In addition, PayPal recently open sourced UDC [Unified Data Catalog], which can host and serve the Technical Metatada of the Data Stores & Objects. Visit http://www.unifieddatacatalog.io to experience first hand.
The document discusses Oracle's data integration products and big data solutions. It outlines five core capabilities of Oracle's data integration platform, including data availability, data movement, data transformation, data governance, and streaming data. It then describes eight core products that address real-time and streaming integration, ELT integration, data preparation, streaming analytics, dataflow ML, metadata management, data quality, and more. The document also outlines five cloud solutions for data integration including data migrations, data warehouse integration, development and test environments, high availability, and heterogeneous cloud. Finally, it discusses pragmatic big data solutions for data ingestion, transformations, governance, connectors, and streaming big data.
Lightning Talk: Get Even More Value from MongoDB ApplicationsMongoDB
MongoDB is the hottest trend in NoSQL databases, helping customers all over the world. Once the application is built, you can generate even more revenue or cost savings by apply deep analytics to all that rich JSON data. Teradata and MongoDB are building a high speed ability to swap data to and from MongoDB amazingly fast. The Teradata Data Warehouse can then combine the MongoDB real time data with all kinds of analytic data such as customer profitability, historic inventory levels, consumer profiles, logistics, suppliers, and so on. If you have a MongoDB application running, it’s a good time to get even more value out of your JSON data. This talk will cover use cases and a quick review of the technical architecture.
Apache Atlas provides centralized metadata services and cross-component dataset lineage tracking for Hadoop components. It aims to enable transparent, reproducible, auditable and consistent data governance across structured, unstructured, and traditional database systems. The near term roadmap includes dynamic access policy driven by metadata and enhanced Hive integration. Apache Atlas also pursues metadata exchange with non-Hadoop systems and third party vendors through REST APIs and custom reporters.
Data Scientists and Machine Learning practitioners, nowadays, seem to be churning out models by the dozen and they continuously experiment to find ways to improve their accuracies. They also use a variety of ML and DL frameworks & languages , and a typical organization may find that this results in a heterogenous, complicated bunch of assets that require different types of runtimes, resources and sometimes even specialized compute to operate efficiently.
But what does it mean for an enterprise to actually take these models to "production" ? How does an organization scale inference engines out & make them available for real-time applications without significant latencies ? There needs to be different techniques for batch (offline) inferences and instant, online scoring. Data needs to be accessed from various sources and cleansing, transformations of data needs to be enabled prior to any predictions. In many cases, there maybe no substitute for customized data handling with scripting either.
Enterprises also require additional auditing and authorizations built in, approval processes and still support a "continuous delivery" paradigm whereby a data scientist can enable insights faster. Not all models are created equal, nor are consumers of a model - so enterprises require both metering and allocation of compute resources for SLAs.
In this session, we will take a look at how machine learning is operationalized in IBM Data Science Experience (DSX), a Kubernetes based offering for the Private Cloud and optimized for the HortonWorks Hadoop Data Platform. DSX essentially brings in typical software engineering development practices to Data Science, organizing the dev->test->production for machine learning assets in much the same way as typical software deployments. We will also see what it means to deploy, monitor accuracies and even rollback models & custom scorers as well as how API based techniques enable consuming business processes and applications to remain relatively stable amidst all the chaos.
Speaker
Piotr Mierzejewski, Program Director Development IBM DSX Local, IBM
Oracle GoldenGate and Apache Kafka: A Deep Dive Into Real-Time Data StreamingMichael Rainey
We produce quite a lot of data! Much of the data are business transactions stored in a relational database. More frequently, the data are non-structured, high volume and rapidly changing datasets known in the industry as Big Data. The challenge for data integration professionals is to combine and transform the data into useful information. Not just that, but it must also be done in near real-time and using a target system such as Hadoop. The topic of this session, real-time data streaming, provides a great solution for this challenging task. By integrating GoldenGate, Oracle’s premier data replication technology, and Apache Kafka, the latest open-source streaming and messaging system, we can implement a fast, durable, and scalable solution.
Presented at Oracle OpenWorld 2016
Oracle Data Integrator 12c - Getting StartedMichael Rainey
I think it’s time for a fresh look at Oracle Data Integrator 12c. What is ODI? How has it evolved over the years and where is it going? And, of course, how do you get started with Oracle Data Integrator? I plan to share what I love about ODI, how to get started building your first ODI project, and what makes Oracle Data Integrator 12c the premier ETL and data warehousing tool on the market. It’s time to get back to the basics!
Presented at UTOUG Training Days 2017.
Oracle data integrator 12c - getting startedMichael Rainey
Oracle Data Integrator (ODI) is a data integration tool that can extract, load, and transform heterogeneous data sources. It is flexible and uses a flow-based mapping approach. The presentation provided an overview of ODI and guidance on installation, configuration, and getting started with the developer quickstart to create models, schemas, and mappings between data stores. Key components like knowledge modules generate integration code, while packages and load plans orchestrate the data integration processes.
GoldenGate and ODI - A Perfect Match for Real-Time Data WarehousingMichael Rainey
Oracle Data Integrator and Oracle GoldenGate excel as standalone products, but paired together they are the perfect match for real-time data warehousing. Following Oracle’s Next Generation Reference Data Warehouse Architecture, this discussion will provide best practices on how to configure, implement, and process data in real-time using ODI and GoldenGate. Attendees will see common real-time challenges solved, including parent-child relationships within micro-batch ETL.
Presented at RMOUG Training Days 2013 & KScope13.
Go Faster - Remove Inhibitors to Rapid InnovationFred George
"Going faster" is the underlying theme to many current process and technology movements. I explore, in turn, inhibitors in technology, process, and organization, as well as how I have dealt with these in real situations.
Oracle GoldenGate and Apache Kafka A Deep Dive Into Real-Time Data StreamingMichael Rainey
This document provides an overview and summary of a presentation on integrating Oracle GoldenGate and Apache Kafka for real-time data streaming. It introduces the speaker, describes Rittman Mead as a specialist in Oracle data integration and analytics, and outlines the challenges of integrating new data sources. The bulk of the document then dives into a step-by-step example of using GoldenGate to replicate transactional data from an Oracle database to Kafka in real-time via Kafka's publish-subscribe capabilities.
GoldenGate and Oracle Data Integrator - A Perfect Match- Upgrade to 12cMichael Rainey
- The document discusses upgrading Oracle GoldenGate 11g and Oracle Data Integrator 11g to their 12c versions. It provides an overview of the steps to upgrade each product including preparing the source and target systems, installing 12c, updating supplemental logging, and finalizing the upgrade by altering processes to write a new sequence number. It also discusses using the convprm tool to convert GoldenGate parameter files during the upgrade process.
The document discusses the scripting capabilities of Groovy and how it simplifies scripting compared to Java. Groovy allows separating code into simple scripts without unnecessary class and method definitions. It supports various approaches for running scripts, including via a GroovyShell, GroovyScriptEngine, or by implementing the JSR-223 scripting API. Groovy also allows predefining variables and methods in a script's binding to make domain-specific languages more natural to use.
GoldenGate and Oracle Data Integrator - A Perfect Match...Michael Rainey
Oracle Data Integrator and Oracle GoldenGate excel as standalone products, but paired together they are the perfect match for real-time data warehousing. Following Oracle’s Next Generation Reference Data Warehouse Architecture, this discussion will provide best practices on how to configure, implement, and process data in real-time using ODI and GoldenGate. Attendees will see common real-time challenges solved, including parent-child relationships within micro-batch ETL.
Presented at Rittman Mead BI Forum 2013 Masterclass.
Real-time Data Warehouse Upgrade – Success StoriesMichael Rainey
Providing a real-time BI solution for its global customers and operations department is a necessity for IFPI, the International Federation of the Phonographic Industry, whose primary objective is to safeguard the rights of record producers through various anti-piracy strategies.
For the data warehousing team at IFPI, using Oracle Streams and Oracle Warehouse Builder (OWB) for real-time data replication and integration was becoming a challenge. The solution was difficult to maintain and overall throughput was degrading as data volumes increased. The need for greater stability and performance led IFPI to implement Oracle GoldenGate and Oracle Data Integrator.
Co-presented with Nick Hurt at Rittman Mead BI Forum 2014 and KScope14.
Real-Time Data Replication to Hadoop using GoldenGate 12c AdaptorsMichael Rainey
Oracle GoldenGate 12c is well known for its highly performant data replication between relational databases. With the GoldenGate Adaptors, the tool can now apply the source transactions to a Big Data target, such as HDFS. In this session, we'll explore the different options for utilizing Oracle GoldenGate 12c to perform real-time data replication from a relational source database into HDFS. The GoldenGate Adaptors will be used to load movie data from the source to HDFS for use by Hive. Next, we'll take the demo a step further and publish the source transactions to a Flume agent, allowing Flume to handle the final load into the targets.
Presented at the Oracle Technology Network Virtual Technology Summit February/March 2015.
Oracle GoldenGate and Apache Kafka A Deep Dive Into Real-Time Data StreamingMichael Rainey
We produce quite a lot of data. Some of this data comes in the form of business transactions and is stored in a relational database. This relational data is often combined with other non-structured, high volume and rapidly changing datasets known in the industry as Big Data. The challenge for us as data integration professionals is to then combine this data and transform it into something useful. Not just that, but we must also do it in near real-time and using a big data target system such as Hadoop. The topic of this session, real-time data streaming, provides us a great solution for that challenging task. By combining GoldenGate, Oracle’s premier data replication technology, and Apache Kafka, the latest open-source streaming and messaging system for big data, we can implement a fast, durable, and scalable solution. This session will walk through the implementation of GoldenGate and Kafka.
Presented at Collaborate16 in Las Vegas.
Data warehouse migration to oracle data integrator 11gMichael Rainey
Pacific Northwest National Laboratory migrated their data warehouse from SQL Server and Visual Basic to Oracle Data Integrator. They developed a SQL parsing tool and used the ODI SDK to programmatically build ODI objects from the SQL metadata, automating the migration of over 4,900 packages. This minimized implementation risks and allowed them to complete the migration in a fraction of the originally estimated 2-3 years. The automated approach reduced human errors and is now used for ongoing operations.
As a data integration professional, it’s almost a guarantee that you’ve heard of real-time stream processing of Big Data. The usual players in the open source world are Apache Kafka, used to move data in real-time, and Spark Streaming, built for in-flight transformations. But what about relational data? Quite often we forget that products incubated in the Apache Foundation can also serve a purpose for “standard” relational databases as well. But how? Well, let’s introduce Oracle GoldenGate and Oracle Data Integrator for Big Data. GoldenGate can extract relational data in real time and produce Kafka messages, ensuring relational data is a part of the enterprise data bus. These messages can then be ingested via ODI through a Spark Streaming process, integrating with additional data sources, such as other relational tables, flat files, etc, as needed. Finally, the output can be sent to multiple locations: on through to a data warehouse for analytical reporting, back to Kafka for additional targets to consume, or any number of targets. Attendees will walk away with a framework on which they can build their data streaming projects, combining relational data with big data and using a common, structured approach via the Oracle Data Integration product stack.
Presented at BIWA Summit 2017.
Tame Big Data with Oracle Data IntegrationMichael Rainey
In this session, Oracle Product Management covers how Oracle Data Integrator and Oracle GoldenGate are vital to big data initiatives across the enterprise, providing the movement, translation, and transformation of information and data not only heterogeneously but also in big data environments. Through a metadata-focused approach for cataloging, defining, and reusing big data technologies such as Hive, Hadoop Distributed File System (HDFS), HBase, Sqoop, Pig, Oracle Loader for Hadoop, Oracle SQL Connector for Hadoop Distributed File System, and additional big data projects, Oracle Data Integrator bridges the gap in the ability to unify data across these systems and helps deliver timely and trusted data to analytic and decision support platforms.
Co-presented with Alex Kotopoulis at Oracle OpenWorld 2014.
Apache Kafka 0.8 basic training - VerisignMichael Noll
Apache Kafka 0.8 basic training (120 slides) covering:
1. Introducing Kafka: history, Kafka at LinkedIn, Kafka adoption in the industry, why Kafka
2. Kafka core concepts: topics, partitions, replicas, producers, consumers, brokers
3. Operating Kafka: architecture, hardware specs, deploying, monitoring, P&S tuning
4. Developing Kafka apps: writing to Kafka, reading from Kafka, testing, serialization, compression, example apps
5. Playing with Kafka using Wirbelsturm
Audience: developers, operations, architects
Created by Michael G. Noll, Data Architect, Verisign, https://www.verisigninc.com/
Verisign is a global leader in domain names and internet security.
Tools mentioned:
- Wirbelsturm (https://github.com/miguno/wirbelsturm)
- kafka-storm-starter (https://github.com/miguno/kafka-storm-starter)
Blog post at:
http://www.michael-noll.com/blog/2014/08/18/apache-kafka-training-deck-and-tutorial/
Many thanks to the LinkedIn Engineering team (the creators of Kafka) and the Apache Kafka open source community!
Unlock the value in your big data reservoir using oracle big data discovery a...Mark Rittman
The document discusses Oracle Big Data Discovery and how it can be used to analyze and gain insights from data stored in a Hadoop data reservoir. It provides an example scenario where Big Data Discovery is used to analyze website logs, tweets, and website posts and comments to understand popular content and influencers for a company. The data is ingested into the Big Data Discovery tool, which automatically enriches the data. Users can then explore the data, apply additional transformations, and visualize relationships to gain insights.
KScope14 - Real-Time Data Warehouse Upgrade - Success StoriesMichael Rainey
Providing real-time data to its global customers is a necessity for IFPI (International Federation of the Phonographic Industry), a not-for-profit organization with a mission to safeguard the rights of record producers and promote the value of recorded music. Using Oracle Streams and Oracle Warehouse Builder (OWB) for real-time data replication and integration, meeting this goal was becoming a challenge. The solution was difficult to maintain and overall throughput was degrading as data volume increased. The need for greater stability and performance led IFPI to implement Oracle GoldenGate and Oracle Data Integrator. This session will describe the innovative approach taken to complete the migration from a Streams and OWB implementation to a more robust, maintainable, and performant GoldenGate and ODI integrated solution.
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...Mark Rittman
Mark Rittman from Rittman Mead presented on Oracle Big Data Discovery. He discussed how many organizations are running big data initiatives involving loading large amounts of raw data into data lakes for analysis. Oracle Big Data Discovery provides a visual interface for exploring, analyzing, and transforming this raw data. It allows users to understand relationships in the data, perform enrichments, and prepare the data for use in tools like Oracle Business Intelligence.
Integrating Oracle Data Integrator with Oracle GoldenGate 12cEdelweiss Kammermann
The document discusses integrating Oracle Data Integrator (ODI) with Oracle GoldenGate (OGG) for real-time data integration. It describes how OGG captures change data from source systems and delivers it to ODI. Key steps include configuring OGG installations and JAgents, defining OGG data servers in ODI, applying journalizing to ODI models, and creating and starting ODI processes that integrate with the OGG capture and delivery processes. The integration provides benefits like low impact on sources, great performance for real-time integration, and support for heterogeneous databases.
Deploying Full Oracle BI Platforms to Oracle Cloud - OOW2015Mark Rittman
- Mark Rittman presented on deploying full OBIEE systems to Oracle Cloud. This involves migrating the data warehouse to Oracle Database Cloud Service, updating the RPD to connect to the cloud database, and uploading the RPD to Oracle BI Cloud Service. Using the wider Oracle PaaS ecosystem allows hosting a full BI platform in the cloud.
Citizens Bank: Data Lake Implementation – Selecting BigInsights ViON Spark/Ha...Seeling Cheung
Citizens Bank was implementing a BigInsights Hadoop Data Lake with PureData System for Analytics to support all internal data initiatives and improve the customer experience. Testing BigInsights on the ViON Hadoop Appliance yielded the productivity, maintenance, and performance Citizens was looking for. Citizens Bank moved some analytics processing from Teradata to Netezza for better cost and performance, implemented BigInsights Hadoop for a data lake, and avoided large capital expenditures for additional Teradata capacity.
Riga dev day 2016 adding a data reservoir and oracle bdd to extend your ora...Mark Rittman
This talk focus is on what a data reservoir is, how it related to the RDBMS DW, and how Big Data Discovery provides access to it to business and BI users
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...DATAVERSITY
Many data scientists are well grounded in creating accomplishment in the enterprise, but many come from outside – from academia, from PhD programs and research. They have the necessary technical skills, but it doesn’t count until their product gets to production and in use. The speaker recently helped a struggling data scientist understand his organization and how to create success in it. That turned into this presentation, because many new data scientists struggle with the complexities of an enterprise.
Learning to Rank (LTR) presentation at RELX Search Summit 2018. Contains information about history of LTR, taxonomy of LTR algorithms, popular algorithms, and case studies of applying LTR using the TMDB dataset using Solr, Elasticsearch and without index support.
Assessing New Databases– Translytical Use CasesDATAVERSITY
Organizations run their day-in-and-day-out businesses with transactional applications and databases. On the other hand, organizations glean insights and make critical decisions using analytical databases and business intelligence tools.
The transactional workloads are relegated to database engines designed and tuned for transactional high throughput. Meanwhile, the big data generated by all the transactions require analytics platforms to load, store, and analyze volumes of data at high speed, providing timely insights to businesses.
Thus, in conventional information architectures, this requires two different database architectures and platforms: online transactional processing (OLTP) platforms to handle transactional workloads and online analytical processing (OLAP) engines to perform analytics and reporting.
Today, a particular focus and interest of operational analytics includes streaming data ingest and analysis in real time. Some refer to operational analytics as hybrid transaction/analytical processing (HTAP), translytical, or hybrid operational analytic processing (HOAP). We’ll address if this model is a way to create efficiencies in our environments.
Did you know that the tech elite does not work at all like you do? Most people don't, and don't want to know. The State of DevOps report concluded a span of 1000x in delivery time and reliability between the elite and low performers. There is a similar gap for delivery time of data or ML pipelines to production. The gap in ability to compute datasets is higher, somewhere around a million times. We call this the data divide or the AI divide. It is widening over time, since most companies are not aware of its width.
We will share the principles we applied in the most successful Scandinavian crossing of the data divide. We never explicitly shared or described, nor fully understood the principles at the time, but it is long due to explicitly enumerate them.
The presentation will likely be uncomfortable and surprising, because it does not match what you do and what your vendors say. You will have no practical use of the information, since you cannot apply the principles, because they contradict many contemporary trends and popular technologies on the market, and you would be unable to overcome the forces of trends, popularity, and messages from vendors. They worked beautifully for us at the time.
5 Amazing Reasons DBAs Need to Love Extended EventsJason Strate
Extended events provide DBAs with a powerful tool that can be used to troubleshoot and investigate SQL Server. Throughout this session, you’ll walk through five great reasons, with demos. By the end of the webcast, you’ll be itching to grab the scripts from the demos to start building your own extended event sessions today.
ITCamp 2018 - Damian Widera U-SQL in great depthITCamp
I would like to invite to the session about Microsoft Azure Data Lake and the USQL. I would like to show how quickly you can do data analysis using traditional C# and a new language that is a bit similar to the TSQL. I will also show more complicated things -how to run Python and R scripts to perform even more robust analysis
Oracle BI Hybrid BI : Mode 1 + Mode 2, Cloud + On-Premise Business AnalyticsMark Rittman
Mark Rittman, founder of Rittman Mead, discusses Oracle's approach to hybrid BI deployments and how it aligns with Gartner's vision of a modern BI platform. He explains how Oracle BI 12c supports both traditional top-down modeling and bottom-up data discovery. It also enables deploying components on-premises or in the cloud for flexibility. Rittman believes the future is bi-modal, with IT enabling self-service analytics alongside centralized governance.
DataOps @ Scale: A Modern Framework for Data Management in the Public SectorTamrMarketing
Within the last 6 months, the U.S. agencies have begun defining a “Data Science Occupational Series”.
This means adding the term “(Data Scientist)” at the end of a job title to increase the odds of finding a candidate that understands data.
Watch the full presentation: https://resources.tamr.com/govdataops
[DSC Europe 22] Overview of the Databricks Platform - Petar ZecevicDataScienceConferenc1
This document provides an overview of the Databricks platform. It discusses how Databricks combines features of data warehouses and data lakes to create a "data lakehouse" that supports both business intelligence/reporting and data science/machine learning use cases. Key components of the Databricks platform include Apache Spark, Delta Lake, MLFlow, Jupyter notebooks, and Delta Live Tables. The platform aims to unify data engineering, data warehousing, streaming, and data science tasks on a single open-source platform.
Data-Centric Analytics and Understanding the Full Data Supply ChainDATAVERSITY
While model development is an important part of analytics, this activity can be compromised by a lack of understanding of the data used in these models and poor Data Quality. For insights to be relied upon and truly actionable, data-related issues must be addressed.
The data supply chain (the set of architectural components that moves data around the enterprise from points where it is created or acquired to points where it is used) must be managed to supply the needs of analytics and other constituencies.
This webinar describes how the data supply chain should be designed and operated to provide analytics with the data it needs, and how Data Scientists should interact with the data supply chain to obtain the data they need. It also covers:
Data-centric considerations that must be taken into account in the development of analytic models
Features of a modern data supply chain
Major components in the data supply chain, with a focus on Data Lakes
Major roles and responsibilities in the data supply chain
How analytics must interact with the data supply chain
Continuous Intelligence - Intersecting Event-Based Business Logic and MLParis Carbone
Continuous intelligence involves integrating real-time analytics within business operations to prescribe actions in response to events based on current and historical data. It represents a paradigm shift from retrospective querying of data to providing real-time answers using stream processing as a 24/7 execution model. Technologies like Apache Flink enable this through scalable, fault-tolerant stream processing with stream SQL, complex event processing, and other abstractions.
Similar to A Walk Through the Kimball ETL Subsystems with Oracle Data Integration (20)
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/06/building-and-scaling-ai-applications-with-the-nx-ai-manager-a-presentation-from-network-optix/
Robin van Emden, Senior Director of Data Science at Network Optix, presents the “Building and Scaling AI Applications with the Nx AI Manager,” tutorial at the May 2024 Embedded Vision Summit.
In this presentation, van Emden covers the basics of scaling edge AI solutions using the Nx tool kit. He emphasizes the process of developing AI models and deploying them globally. He also showcases the conversion of AI models and the creation of effective edge AI pipelines, with a focus on pre-processing, model conversion, selecting the appropriate inference engine for the target hardware and post-processing.
van Emden shows how Nx can simplify the developer’s life and facilitate a rapid transition from concept to production-ready applications.He provides valuable insights into developing scalable and efficient edge AI solutions, with a strong focus on practical implementation.
Full-RAG: A modern architecture for hyper-personalizationZilliz
Mike Del Balso, CEO & Co-Founder at Tecton, presents "Full RAG," a novel approach to AI recommendation systems, aiming to push beyond the limitations of traditional models through a deep integration of contextual insights and real-time data, leveraging the Retrieval-Augmented Generation architecture. This talk will outline Full RAG's potential to significantly enhance personalization, address engineering challenges such as data management and model training, and introduce data enrichment with reranking as a key solution. Attendees will gain crucial insights into the importance of hyperpersonalization in AI, the capabilities of Full RAG for advanced personalization, and strategies for managing complex data integrations for deploying cutting-edge AI solutions.
20 Comprehensive Checklist of Designing and Developing a WebsitePixlogix Infotech
Dive into the world of Website Designing and Developing with Pixlogix! Looking to create a stunning online presence? Look no further! Our comprehensive checklist covers everything you need to know to craft a website that stands out. From user-friendly design to seamless functionality, we've got you covered. Don't miss out on this invaluable resource! Check out our checklist now at Pixlogix and start your journey towards a captivating online presence today.
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...Zilliz
Join us to introduce Milvus Lite, a vector database that can run on notebooks and laptops, share the same API with Milvus, and integrate with every popular GenAI framework. This webinar is perfect for developers seeking easy-to-use, well-integrated vector databases for their GenAI apps.
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!SOFTTECHHUB
As the digital landscape continually evolves, operating systems play a critical role in shaping user experiences and productivity. The launch of Nitrux Linux 3.5.0 marks a significant milestone, offering a robust alternative to traditional systems such as Windows 11. This article delves into the essence of Nitrux Linux 3.5.0, exploring its unique features, advantages, and how it stands as a compelling choice for both casual users and tech enthusiasts.
4. info@rittmanmead.com www.rittmanmead.com @rittmanmead
About Rittman Mead
4
•World’s leading specialist partner for technical
excellence, solutions delivery and innovation in
Oracle Data Integration, Business Intelligence,
Analytics and Big Data
•Providing our customers targeted expertise; we are a
company that doesn’t try to do everything… only
what we excel at
•70+ consultants worldwide including 1 Oracle ACE
Director and 3 Oracle ACEs, offering training
courses, global services, and consulting
•Founded on the values of collaboration, learning,
integrity and getting things done
Unlock the potential of your organization’s data
•Comprehensive service portfolio designed to
support the full lifecycle of any analytics solution
5. info@rittmanmead.com www.rittmanmead.com @rittmanmead 5
Visual Redesign Business User Training
Ongoing SupportEngagement Toolkit
Average user adoption for BI
platforms is below 25%
Rittman Mead’s User Engagement Service can help
More info: http://ritt.md/ue
8. info@rittmanmead.com www.rittmanmead.com @rittmanmead
Wait! What are Kimball ETL Subsystems?
Do you all know of Ralph Kimball?
8
www.kimballgroup.com
Ralph Kimball founded the Kimball Group. Since the mid-1980s, he has
been the DW/BI industry’s thought leader on the dimensional approach
and trained more than 20,000 students. Prior to working at Metaphor and
founding Red Brick Systems, Ralph co-invented the first commercially-
available workstation with a graphical user interface at Xerox’s Palo Alto
Research Center (PARC). Ralph has his Ph.D. in Electrical Engineering
from Stanford University.
11. info@rittmanmead.com www.rittmanmead.com @rittmanmead
The Kimball 34 Subsystems of ETL
11
• Cleaning and Conforming Data
- Data Cleansing System
- Error Event Schema
- Audit Dimension Assembler
- Deduplication System
- Conforming System
12. info@rittmanmead.com www.rittmanmead.com @rittmanmead
The Kimball 34 Subsystems of ETL
12
• Delivering Data for Presentation
- Slowly Changing Dimension
Manager
- Surrogate Key Generator
- Hierarchy Manager
- Special Dimensions Manager
- Fact Table Builders
- Surrogate Key Pipeline
- Late Arriving Data Handler
- Multi-Valued Dimension Bridge
Table Builder
- Dimension Manager System
- Fact Provider System
- Aggregate Builder
- OLAP Cube Builder
- Data Propagation Manager
13. info@rittmanmead.com www.rittmanmead.com @rittmanmead
The Kimball 34 Subsystems of ETL
13
• Managing the ETL Environment
- Job Scheduler
- Backup System
- Recovery and Restart System
- Version Control System
- Version Migration System
- Workflow Monitor
- Sorting System
- Lineage & Dependency
Analyzer
- Problem Escalation System
- Parallelizing / Pipelining System
- Security System
- Compliance Manager
- Metadata Repository Manager
32. info@rittmanmead.com www.rittmanmead.com @rittmanmead
Extracting Data - Oracle Data Integrator
21
• Extract from many different
systems? Yes!
- Multiple technologies OOTB
- Custom technologies can be added
• Data Server - connection to the
data source
- Physical Schema
- Logical Schema
41. info@rittmanmead.com www.rittmanmead.com @rittmanmead
Extracting Data - Changed Data Only
24
• Change Data Capture
- Extract only the changed data since the last ETL extract
• Methods
- Audit columns
- Timed extract
- Full “diff compare”
- Database log scraping
42. info@rittmanmead.com www.rittmanmead.com @rittmanmead
Extracting Data - Changed Data Only
24
• Change Data Capture
- Extract only the changed data since the last ETL extract
• Methods
- Audit columns
- Timed extract
- Full “diff compare”
- Database log scraping
74. info@rittmanmead.com www.rittmanmead.com @rittmanmead
Delivering Data
37
• Slowly Changing Dimension
Manager
• Surrogate Key Generator
• Hierarchy Manager
• Special Dimensions Manager
• Fact Table Builders
• Surrogate Key Pipeline
• Late Arriving Data Handler
•Multi-Valued
Dimension Bridge Table
Builder
•Dimension Manager System
•Fact Provider System
•Aggregate Builder
•OLAP Cube Builder
•Data Propagation Manager
75. info@rittmanmead.com www.rittmanmead.com @rittmanmead
Delivering Data
38
• Slowly Changing Dimension Manager
- ODI Integration Knowledge Module
- Set SCD behavior type for each
target column
• Surrogate Key Generator
- Database Sequence objects and ODI Sequences
• Fact Table Builder
- Lookups in ODI
85. info@rittmanmead.com www.rittmanmead.com @rittmanmead
Delivering Data
42
• Slowly Changing Dimension Manager
- ODI Integration Knowledge Module
- Set SCD behavior type for each
target column
• Surrogate Key Generator
- Database Sequence objects and ODI Sequences
• Fact Table Builder
- Lookups in ODI
86. info@rittmanmead.com www.rittmanmead.com @rittmanmead
Managing the ETL Environment
43
• Job Scheduler
• Backup System
• Recovery and Restart System
• Version Control System
• Version Migration System
• Workflow Monitor
• Sorting System
• Lineage &
Dependency Analyzer
• Problem Escalation System
• Parallelizing / Pipelining
System
• Security System
• Compliance Manager
• Metadata Repository Manager
88. info@rittmanmead.com www.rittmanmead.com @rittmanmead
Managing the ETL Environment - Job Scheduler
45
• Alternative to ODI scheduler - external scheduling tool
- ODI Scenarios and Load Plans can be executed via command
line script or web service
./startloadplan.sh LOAD_EDW GLOBAL 6
-AGENT_URL=http://localhost:20910/oraclediagent
94. info@rittmanmead.com www.rittmanmead.com @rittmanmead
Managing the ETL Environment
48
• Job Scheduler
• Backup System
• Recovery and Restart System
• Version Control System
• Version Migration System
• Workflow Monitor
• Sorting System
• Lineage &
Dependency Analyzer
• Problem Escalation System
• Parallelizing / Pipelining
System
• Security System
• Compliance Manager
• Metadata Repository Manager
95. info@rittmanmead.com www.rittmanmead.com @rittmanmead
Where did we end up?
49
• The Kimball ETL Subsystems
will guide your data warehouse
program
• Oracle Data Integration can
help you fully implement the
ETL Subsystems
- Extract, Load, Transform with
ODI and GoldenGate
- Profile and cleanse data with
Enterprise Data Quality
99. info@rittmanmead.com www.rittmanmead.com @rittmanmead
Rittman Mead at KScope16
53
Oracle GoldenGate and Apache Kafka: A Deep Dive into Real-Time
Data Streaming
Michael Rainey | Monday Jun 27, 4:30pm | Level 2 - Missouri
Free-Form Data Visualizations: First Session
Charles Elliott | Tuesday Jun 28, 8:30am | Level 2 - Superior A
Lunch & Learn: BI and Data Warehousing
Michael Rainey | Tuesday Jun 28, 12:45pm | Ballroom Level -
Sheraton II
Lunch & Learn: Big Data and Advanced Analytics
Mark Rittman | Tuesday Jun 28, 12:45pm | Ballroom Level - Sheraton
III
OBIEE 12c and Essbase: What’s New for Integration and Reporting
Against EPM Sources
Mark Rittman | Wednesday Jun 29, 10:15am | Ballroom Level -
Sheraton III
A Walk Through the Kimball ETL Subsystems with Oracle Data
Integration
Michael Rainey | Wednesday Jun 29, 11:30am | Level 2 - Mayfair
How to Brand and Own Your OBIEE Interface: Past,
Present, and Future
Andy Rocha & Pete Tamisin | Wednesday Jun 29, 2:00 pm |
Ballroom Level - Sheraton III
Free-Form Data Visualizations: Second Session
Charles Elliott | Wednesday Jun 29, 2:00pm | Level 2 -
Superior A
Oracle Big Data Discovery: Extending into Machine
Learning and Advanced Visualizations
Mark Rittman | Wednesday Jun 29, 3:15pm | Level 2 -
Missouri