EPM environments are generally supported by a Data Warehouse, however, we often see that those DWs are not optimized for the EPM tools. During the years, we have witnessed that modeling a DW thinking about the EPM tools may greatly increase the overall architecture performance.
The most common situation found in several projects is that the people that develops the data warehouse does not have a great knowledge about EPM tools and vice-versa. This may create a big gap between those two concepts which may severally impact performance.
This session will show a lot of techniques to model the right Data Warehouse for EPM tools. We will discuss how to improve performance using partitioned tables, create hierarchical queries with “Connect by Prior”, the correct way to use Multi-Period tables for block data load using Pivot/Unpivot and more. And if you want to go ever further, we will show you how to leverage all those techniques using ODI, which will create the perfect mix to perform any process between your DW and EPM environments.
Essbase Statistics DW: How to Automatically Administrate Essbase Using ODIRodrigo Radtke de Souza
In order to have a performatic Essbase cube, we must keep vigilance and follow up its growth and its data movements so we can distribute caches and adjust the database parameters accordingly. But this is a very difficult task to achieve, since Essbase statistics are not temporal and only tell you the cube statistics is in that specific time frame.
This session will present how ODI can be used to create a historical statistical DW containing Essbase cube’s information and how to identify trends and patterns, giving us the ability for programmatically tune our Essbase databases automatically.
Incredible ODI tips to work with Hyperion tools that you ever wanted to knowRodrigo Radtke de Souza
ODI is an incredible and flexible development tool that goes beyond simple data integration. But most of its development power comes from outside-the-box ideas.
* Did you ever want to dynamically run any number of “OS” commands using a single ODI component?
* Did you ever want to have only one data store and loop different sources without the need of different ODI contexts?
* Did you ever want to have only one interface and loop any number of ODI objects with a lot of control?
* Did you ever need to have a “third command tab” in your procedures or KMs to improve ODI powers?
* Do you still use an old version of ODI and miss a way to know the values of the variables in a scenario execution?
* Did you know ODI has four “substitution tags”? And do you know how useful they are?
* Do you use “dynamic variables” and know how powerful they can be?
* Do you know how to have control over you ODI priority jobs automatically (stop, start, and restart scenarios)?
In a fast-moving business environment, finance leaders are successfully leveraging technology advancements to transform their finance organizations and generate value for the business.
Oracle’s Enterprise Performance Management (EPM) applications are an integrated, modular suite that supports a broad range of strategic and financial performance management tools that help business to unlock their potential.
Dell’s global financial environment contains over 10,000 users around the world and relies on a range of EPM tools such as Hyperion Planning, Essbase, Smart View, DRM, and ODI to meet its needs.
This session shows the complexity of this environment, describing all relationships between those tools, the techniques used to maintain such a large environment in sync, and meeting the most varied needs from the different business and laws around the world to create a complete and powerful business decision engine that takes Dell to the next level.
Apache Apex and Apache Geode are two of the most promising incubating open source projects. Combined, they promise to fill gaps of existing big data analytics platforms. Apache Apex is an enterprise grade native YARN big data-in-motion platform that unifies stream and batch processing. Apex is highly scalable, performant, fault tolerant, and strong in operability. Apache Geode provides a database-like consistency model, reliable transaction processing and a shared-nothing architecture to maintain very low latency performance with high concurrency processing. We will also look at some use cases where how these two projects can be used together to form distributed, fault tolerant, reliable in memory data processing layer.
Powering Interactive Data Analysis at Pinterest by Amazon RedshiftJie Li
In the last six month, we have set up Amazon Redshift to power our interactive data analysis at Pinterest. It has tremendously improved the speed of analyzing our data.
Cloudera Impala: A Modern SQL Engine for HadoopCloudera, Inc.
Cloudera Impala is a modern SQL query engine for Apache Hadoop that provides high performance for both analytical and transactional workloads. It runs directly within Hadoop clusters, reading common Hadoop file formats and communicating with Hadoop storage systems. Impala uses a C++ implementation and runtime code generation for high performance compared to other Hadoop SQL query engines like Hive that use Java and MapReduce.
Essbase Statistics DW: How to Automatically Administrate Essbase Using ODIRodrigo Radtke de Souza
In order to have a performatic Essbase cube, we must keep vigilance and follow up its growth and its data movements so we can distribute caches and adjust the database parameters accordingly. But this is a very difficult task to achieve, since Essbase statistics are not temporal and only tell you the cube statistics is in that specific time frame.
This session will present how ODI can be used to create a historical statistical DW containing Essbase cube’s information and how to identify trends and patterns, giving us the ability for programmatically tune our Essbase databases automatically.
Incredible ODI tips to work with Hyperion tools that you ever wanted to knowRodrigo Radtke de Souza
ODI is an incredible and flexible development tool that goes beyond simple data integration. But most of its development power comes from outside-the-box ideas.
* Did you ever want to dynamically run any number of “OS” commands using a single ODI component?
* Did you ever want to have only one data store and loop different sources without the need of different ODI contexts?
* Did you ever want to have only one interface and loop any number of ODI objects with a lot of control?
* Did you ever need to have a “third command tab” in your procedures or KMs to improve ODI powers?
* Do you still use an old version of ODI and miss a way to know the values of the variables in a scenario execution?
* Did you know ODI has four “substitution tags”? And do you know how useful they are?
* Do you use “dynamic variables” and know how powerful they can be?
* Do you know how to have control over you ODI priority jobs automatically (stop, start, and restart scenarios)?
In a fast-moving business environment, finance leaders are successfully leveraging technology advancements to transform their finance organizations and generate value for the business.
Oracle’s Enterprise Performance Management (EPM) applications are an integrated, modular suite that supports a broad range of strategic and financial performance management tools that help business to unlock their potential.
Dell’s global financial environment contains over 10,000 users around the world and relies on a range of EPM tools such as Hyperion Planning, Essbase, Smart View, DRM, and ODI to meet its needs.
This session shows the complexity of this environment, describing all relationships between those tools, the techniques used to maintain such a large environment in sync, and meeting the most varied needs from the different business and laws around the world to create a complete and powerful business decision engine that takes Dell to the next level.
Apache Apex and Apache Geode are two of the most promising incubating open source projects. Combined, they promise to fill gaps of existing big data analytics platforms. Apache Apex is an enterprise grade native YARN big data-in-motion platform that unifies stream and batch processing. Apex is highly scalable, performant, fault tolerant, and strong in operability. Apache Geode provides a database-like consistency model, reliable transaction processing and a shared-nothing architecture to maintain very low latency performance with high concurrency processing. We will also look at some use cases where how these two projects can be used together to form distributed, fault tolerant, reliable in memory data processing layer.
Powering Interactive Data Analysis at Pinterest by Amazon RedshiftJie Li
In the last six month, we have set up Amazon Redshift to power our interactive data analysis at Pinterest. It has tremendously improved the speed of analyzing our data.
Cloudera Impala: A Modern SQL Engine for HadoopCloudera, Inc.
Cloudera Impala is a modern SQL query engine for Apache Hadoop that provides high performance for both analytical and transactional workloads. It runs directly within Hadoop clusters, reading common Hadoop file formats and communicating with Hadoop storage systems. Impala uses a C++ implementation and runtime code generation for high performance compared to other Hadoop SQL query engines like Hive that use Java and MapReduce.
The document discusses building a data platform for analytics in Azure. It outlines common issues with traditional data warehouse architectures and recommends building a data lake approach using Azure Synapse Analytics. The key elements include ingesting raw data from various sources into landing zones, creating a raw layer using file formats like Parquet, building star schemas in dedicated SQL pools or Spark tables, implementing alerting using Log Analytics, and loading data into Power BI. Building the platform with Python pipelines, notebooks, and GitHub integration is emphasized for flexibility, testability and collaboration.
Presentations from the Cloudera Impala meetup on Aug 20 2013Cloudera, Inc.
Presentations from the Cloudera Impala meetup on Aug 20 2013:
- Nong Li on Parquet+Impala and UDF support
- Henry Robinson on performance tuning for Impala
PandasUDFs: One Weird Trick to Scaled EnsemblesDatabricks
When I was tasked with improving our predictions of when customers were likely to purchase in a category, I ran into a problem – we had one model that was trying to predict everything from milk and eggs to batteries and tea. I was able to improve our predictions by creating category-specific models, but how could I possibly handle every category we had?
Turns out, PandasUDFs were my One Weird Trick to solving this problem and many others. By using them, I was able to take already-written development code, add a function decorator, and scale my analysis to every category with minimal effort. 10 hour runtimes finished in 30 minutes. You too can use this One Weird Trick to scale from one model to whole ensembles of models.
Topics covered will include:
General outline of use and fitting in your workflows
Types of PandasUDFs
The Ser/De limit and how to work around it
Equivalents in R and Koalas
Near Real-Time Data Analysis With FlyData FlyData Inc.
This document describes our products. FlyData makes it easy to load data automatically and continuously to Amazon Redshift. You can also refer to our HP ( http://flydata.com/ ) for more information.
Friction-free ETL: Automating data transformation with Impala | Strata + Hado...Cloudera, Inc.
Speaker: Marcel Kornacker
As data is ingested into Apache Hadoop at an increasing rate from a diverse range of data sources, it is becoming more and more important for users that new data be accessible for analysis as quickly as possible—because “data freshness” can have a direct impact on business results.
In the traditional ETL process, raw data is transformed from the source into a target schema, possibly requiring flattening and condensing, and then loaded into an MPP DBMS. However, this approach has multiple drawbacks that make it unsuitable for real-time, “at-source” analytics—for example, the “ETL lag” reduces data freshness, and the inherent complexity of the process makes it costly to deploy and maintain, and reduces the speed at which new analytic applications can be introduced.
In this talk, attendees will learn about Impala’s approach to on-the-fly, automatic data transformation, which in conjunction with the ability to handle nested structures such as JSON and XML documents, addresses the needs of at-source analytics—including direct querying of your input schema, immediate querying of data as it lands in HDFS, and high performance on par with specialized engines. This performance level is attained in spite of the most challenging and diverse input formats, which are addressed through an automated background conversion process into Parquet, the high-performance, open source columnar format that has been widely adopted across the Hadoop ecosystem.
In this talk, attendees will learn about Impala’s upcoming features that will enable at-source analytics: support for nested structures such as JSON and XML documents, which allows direct querying of the source schema; automated background file format conversion into Parquet, the high-performance, open source columnar format that has been widely adopted across the Hadoop ecosystem; and automated creation of declaratively-specified derived data for simplified data cleansing and transformation.
Learn how Cloudera Impala empowers you to:
- Perform interactive, real-time analysis directly on source data stored in Hadoop
- Interact with data in HDFS and HBase at the “speed of thought”
- Reduce data movement between systems & eliminate double storage
Magnet Shuffle Service: Push-based Shuffle at LinkedInDatabricks
The number of daily Apache Spark applications at LinkedIn has increased by 3X in the past year. The shuffle process alone, which is one of the most costly operators in batch computation, is processing PBs of data and billions of blocks daily in our clusters. With such a rapid increase of Apache Spark workloads, we quickly realized that the shuffle process can become a severe bottleneck for both infrastructure scalability and workloads efficiency. In our production clusters, we have observed both reliability issues due to shuffle fetch connection failures and efficiency issues due to the random reads of small shuffle blocks on HDDs.
To tackle those challenges and optimize shuffle performance in Apache Spark, we have developed Magnet shuffle service, a push-based shuffle mechanism that works natively with Apache Spark. Our paper on Magnet has been accepted by VLDB 2020. In this talk, we will introduce how push-based shuffle can drastically increase shuffle efficiency when compared with the existing pull-based shuffle. In addition, by combining push-based shuffle and pull-based shuffle, we show how Magnet shuffle service helps to harden shuffle infrastructure at LinkedIn scale by both reducing shuffle related failures and removing scaling bottlenecks. Furthermore, we will share our experiences of productionizing Magnet at LinkedIn to process close to 10 PB of daily shuffle data.
AWS July Webinar Series: Amazon redshift migration and load data 20150722Amazon Web Services
Amazon Redshift is a fast, petabyte-scale data warehouse that makes it easy to analyze your data for a fraction of the cost of traditional data warehouses.
In this webinar, you will learn how to easily migrate your data from other data warehouses into Amazon Redshift, efficiently load your data with Amazon Redshift's massively parallel processing (MPP) capabilities, and automate data loading with AWS Lambda and AWS Data Pipeline. You will also learn about ETL tools from our partners to extract, transform, and prepare data from disparate data sources before loading it into Amazon Redshift.
Learning Objectives:
Understand common patterns for migrating your data to Amazon Redshift
See live examples of the Copy command that fully parallelizes data ingestion
Learn how to automate the load process using AWS Lambda & AWS Data Pipleline
Techniques for real time data loading
Options for ETL tools from our partners
(ISM303) Migrating Your Enterprise Data Warehouse To Amazon RedshiftAmazon Web Services
Learn how Boingo Wireless and online media provider Edmunds gained substantial business insights and saved money and time by migrating to Amazon Redshift. Get an inside look into how they accomplished their migration from on-premises solutions. Learn how they tuned their schema and queries to take full advantage of the columnar MPP architecture in Amazon Redshift, how they leveraged third party solutions, and how they met their business intelligence needs in record time.
The document summarizes strengths and weaknesses of Cloudera Impala. Key strengths include excellent performance for analytical queries over large datasets, SQL compliance, and integration with Hadoop ecosystem. Weaknesses are slow random access, lack of fault tolerance, tedious data updating process, and memory intensive queries. The conclusion is that Impala is well-suited for analytics on immutable data but not for workloads with frequent updates.
AWS July Webinar Series: Amazon Redshift Reporting and Advanced AnalyticsAmazon Web Services
Amazon Redshift is a fast, petabyte-scale data warehouse that makes it easy to analyze your data with existing BI tools for a fraction of the cost of traditional data warehouses.
This webinar will familiarize you with reporting, visualization, and business intelligence options for your Amazon Redshift data warehouse. You will learn how to effectively use exisiting BI tools and SQL clients with your Amazon Redshift data warehouse as well as techniques for performing advanced analytics.
Learning Objectives:
Options for processing, analyzing, and visualizing data in Amazon Redshift
Extending the Amazon Redshift SQL query capabilities
Optimizing query performance with Redshift ODBC / JDBC driver
Overview of BI solutions from our partners
This document discusses connecting Hadoop and Oracle databases. It introduces the author Tanel Poder and his expertise in databases and big data. It then covers tools like Sqoop that can be used to load data between Hadoop and Oracle databases. It also discusses using query offloading to query Hadoop data directly from Oracle as if it were in an Oracle database.
Amazon Redshift in Action: Enterprise, Big Data, and SaaS Use Cases (DAT205) ...Amazon Web Services
Since Amazon Redshift launched last year, it has been adopted by a wide variety of companies for data warehousing. In this session, learn how customers NASDAQ, HauteLook, and Roundarch Isobar are taking advantage of Amazon Redshift for three unique use cases: enterprise, big data, and SaaS. Learn about their implementations and how they made data analysis faster, cheaper, and easier with Amazon Redshift.
Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...Databricks
At Spark Summit 2017, we described our framework to migrate production Hive workload to Spark with minimal user intervention. After a year of migration, Spark now powers an important part of our batch processing workload. The migration framework supports syntax compatibility analysis, offline/online shadowing, and data validation.
In this session, we first introduce new features and improvements in the migration framework to support bucketed tables and increase automation. Next, we will deep dive into the top technical challenges we encountered and how we addressed them. We improved the the syntax compatibility between Hive and Spark from around 51% to 85% by identifying/developing top missing features, fixing incompatible UDFs, and implementing a UDF testing framework. In addition, we developed reliable join operators to improve Spark stability in production when leveraging optimizations such as ShuffledHashJoin.
Finally, we will share an update on our overall migration effort and examples of migrations wins. For example, we were able to migrate one of the most complicated workloads in Facebook from Hive to Spark with more than 2.5X performance gain.
Optimizing Your Amazon Redshift Cluster for Peak Performance - AWS Summit Syd...Amazon Web Services
Optimising Your Amazon Redshift Cluster for Peak Performance
In this session we take an in-depth look at the latest features in Amazon Redshift, including analysing data store in and outside of your cluster with Amazon Redshift Spectrum, query and platform enhancements, and more. We will dive deep into best practices on how to design optimal schemas, load data efficiently, and optimise your queries to deliver high throughput and performance.
Eric Ferreira , Principal Database Engineer, Amazon Web Services
Best Practices for Migrating your Data Warehouse to Amazon RedshiftAmazon Web Services
You can gain substantially more business insights and save costs by migrating your existing data warehouse to Amazon Redshift. This session will cover the key benefits of migrating to Amazon Redshift, migration strategies, and tools and resources that can help you in the process.
Impala is an open-source SQL query engine for Apache Hadoop that allows for fast, interactive queries directly against data stored in HDFS and other data storage systems. It provides low-latency queries in seconds by using a custom query engine instead of MapReduce. Impala allows users to interact with data using standard SQL and business intelligence tools while leveraging existing metadata in Hadoop. It is designed to be integrated with the Hadoop ecosystem for distributed, fault-tolerant and scalable data processing and analytics.
How We Optimize Spark SQL Jobs With parallel and sync IODatabricks
Although NVMe has been more and more popular these years, a large amount of HDD are still widely used in super-large scale big data clusters. In a EB-level data platform, IO(including decompression and decode) cost contributes a large proportion of Spark jobs’ cost. In another word, IO operation is worth optimizing.
In ByteDancen, we do a series of IO optimization to improve performance, including parallel read and asynchronized shuffle. Firstly we implement file level parallel read to improve performance when there are a lot of small files. Secondly, we design row group level parallel read to accelerate queries for big-file scenario. Thirdly, implement asynchronized spill to improve job peformance. Besides, we design parquet column family, which will split a table into a few column families and different column family will be in different Parquets files. Different column family can be read in parallel, so the read performance is much higher than the existing approach. In our practice, the end to end performance is improved by 5% to 30%
In this talk, I will illustrate how we implement these features and how they accelerate Apache Spark jobs.
Oracle Cloud services products, including Planning and Budget Cloud Service (PBCS), enables companies to focus on their own business instead of spending money and resources on maintaining big IT infrastructures. It also gives them the possibility to be connected 24x7 from any place in the world.
But what happens if this company already has an ODI on-premise infrastructure and they want to integrate the new PBCS with it? Can we use our existing ODI on-premise? How hard is to accomplish this?
This session will show how to use your ODI on-premise to integrate and orchestrate your PBCS seamlessly.
Relational data modeling trends for transactional applicationsIke Ellis
This document provides a summary of Ike Ellis's presentation on data modeling priorities and design patterns for transactional applications. The presentation discusses how data modeling priorities have changed from focusing on writes and normalization to emphasizing reads, flexibility, and performance. It outlines several current design priorities including optimizing the schema for reads, making it easy to change and discoverable, and designing for the network instead of the disk. The presentation concludes with practicing modeling data for example transactional applications like a blog, online store, and refrigeration trucks.
The document discusses building a data platform for analytics in Azure. It outlines common issues with traditional data warehouse architectures and recommends building a data lake approach using Azure Synapse Analytics. The key elements include ingesting raw data from various sources into landing zones, creating a raw layer using file formats like Parquet, building star schemas in dedicated SQL pools or Spark tables, implementing alerting using Log Analytics, and loading data into Power BI. Building the platform with Python pipelines, notebooks, and GitHub integration is emphasized for flexibility, testability and collaboration.
Presentations from the Cloudera Impala meetup on Aug 20 2013Cloudera, Inc.
Presentations from the Cloudera Impala meetup on Aug 20 2013:
- Nong Li on Parquet+Impala and UDF support
- Henry Robinson on performance tuning for Impala
PandasUDFs: One Weird Trick to Scaled EnsemblesDatabricks
When I was tasked with improving our predictions of when customers were likely to purchase in a category, I ran into a problem – we had one model that was trying to predict everything from milk and eggs to batteries and tea. I was able to improve our predictions by creating category-specific models, but how could I possibly handle every category we had?
Turns out, PandasUDFs were my One Weird Trick to solving this problem and many others. By using them, I was able to take already-written development code, add a function decorator, and scale my analysis to every category with minimal effort. 10 hour runtimes finished in 30 minutes. You too can use this One Weird Trick to scale from one model to whole ensembles of models.
Topics covered will include:
General outline of use and fitting in your workflows
Types of PandasUDFs
The Ser/De limit and how to work around it
Equivalents in R and Koalas
Near Real-Time Data Analysis With FlyData FlyData Inc.
This document describes our products. FlyData makes it easy to load data automatically and continuously to Amazon Redshift. You can also refer to our HP ( http://flydata.com/ ) for more information.
Friction-free ETL: Automating data transformation with Impala | Strata + Hado...Cloudera, Inc.
Speaker: Marcel Kornacker
As data is ingested into Apache Hadoop at an increasing rate from a diverse range of data sources, it is becoming more and more important for users that new data be accessible for analysis as quickly as possible—because “data freshness” can have a direct impact on business results.
In the traditional ETL process, raw data is transformed from the source into a target schema, possibly requiring flattening and condensing, and then loaded into an MPP DBMS. However, this approach has multiple drawbacks that make it unsuitable for real-time, “at-source” analytics—for example, the “ETL lag” reduces data freshness, and the inherent complexity of the process makes it costly to deploy and maintain, and reduces the speed at which new analytic applications can be introduced.
In this talk, attendees will learn about Impala’s approach to on-the-fly, automatic data transformation, which in conjunction with the ability to handle nested structures such as JSON and XML documents, addresses the needs of at-source analytics—including direct querying of your input schema, immediate querying of data as it lands in HDFS, and high performance on par with specialized engines. This performance level is attained in spite of the most challenging and diverse input formats, which are addressed through an automated background conversion process into Parquet, the high-performance, open source columnar format that has been widely adopted across the Hadoop ecosystem.
In this talk, attendees will learn about Impala’s upcoming features that will enable at-source analytics: support for nested structures such as JSON and XML documents, which allows direct querying of the source schema; automated background file format conversion into Parquet, the high-performance, open source columnar format that has been widely adopted across the Hadoop ecosystem; and automated creation of declaratively-specified derived data for simplified data cleansing and transformation.
Learn how Cloudera Impala empowers you to:
- Perform interactive, real-time analysis directly on source data stored in Hadoop
- Interact with data in HDFS and HBase at the “speed of thought”
- Reduce data movement between systems & eliminate double storage
Magnet Shuffle Service: Push-based Shuffle at LinkedInDatabricks
The number of daily Apache Spark applications at LinkedIn has increased by 3X in the past year. The shuffle process alone, which is one of the most costly operators in batch computation, is processing PBs of data and billions of blocks daily in our clusters. With such a rapid increase of Apache Spark workloads, we quickly realized that the shuffle process can become a severe bottleneck for both infrastructure scalability and workloads efficiency. In our production clusters, we have observed both reliability issues due to shuffle fetch connection failures and efficiency issues due to the random reads of small shuffle blocks on HDDs.
To tackle those challenges and optimize shuffle performance in Apache Spark, we have developed Magnet shuffle service, a push-based shuffle mechanism that works natively with Apache Spark. Our paper on Magnet has been accepted by VLDB 2020. In this talk, we will introduce how push-based shuffle can drastically increase shuffle efficiency when compared with the existing pull-based shuffle. In addition, by combining push-based shuffle and pull-based shuffle, we show how Magnet shuffle service helps to harden shuffle infrastructure at LinkedIn scale by both reducing shuffle related failures and removing scaling bottlenecks. Furthermore, we will share our experiences of productionizing Magnet at LinkedIn to process close to 10 PB of daily shuffle data.
AWS July Webinar Series: Amazon redshift migration and load data 20150722Amazon Web Services
Amazon Redshift is a fast, petabyte-scale data warehouse that makes it easy to analyze your data for a fraction of the cost of traditional data warehouses.
In this webinar, you will learn how to easily migrate your data from other data warehouses into Amazon Redshift, efficiently load your data with Amazon Redshift's massively parallel processing (MPP) capabilities, and automate data loading with AWS Lambda and AWS Data Pipeline. You will also learn about ETL tools from our partners to extract, transform, and prepare data from disparate data sources before loading it into Amazon Redshift.
Learning Objectives:
Understand common patterns for migrating your data to Amazon Redshift
See live examples of the Copy command that fully parallelizes data ingestion
Learn how to automate the load process using AWS Lambda & AWS Data Pipleline
Techniques for real time data loading
Options for ETL tools from our partners
(ISM303) Migrating Your Enterprise Data Warehouse To Amazon RedshiftAmazon Web Services
Learn how Boingo Wireless and online media provider Edmunds gained substantial business insights and saved money and time by migrating to Amazon Redshift. Get an inside look into how they accomplished their migration from on-premises solutions. Learn how they tuned their schema and queries to take full advantage of the columnar MPP architecture in Amazon Redshift, how they leveraged third party solutions, and how they met their business intelligence needs in record time.
The document summarizes strengths and weaknesses of Cloudera Impala. Key strengths include excellent performance for analytical queries over large datasets, SQL compliance, and integration with Hadoop ecosystem. Weaknesses are slow random access, lack of fault tolerance, tedious data updating process, and memory intensive queries. The conclusion is that Impala is well-suited for analytics on immutable data but not for workloads with frequent updates.
AWS July Webinar Series: Amazon Redshift Reporting and Advanced AnalyticsAmazon Web Services
Amazon Redshift is a fast, petabyte-scale data warehouse that makes it easy to analyze your data with existing BI tools for a fraction of the cost of traditional data warehouses.
This webinar will familiarize you with reporting, visualization, and business intelligence options for your Amazon Redshift data warehouse. You will learn how to effectively use exisiting BI tools and SQL clients with your Amazon Redshift data warehouse as well as techniques for performing advanced analytics.
Learning Objectives:
Options for processing, analyzing, and visualizing data in Amazon Redshift
Extending the Amazon Redshift SQL query capabilities
Optimizing query performance with Redshift ODBC / JDBC driver
Overview of BI solutions from our partners
This document discusses connecting Hadoop and Oracle databases. It introduces the author Tanel Poder and his expertise in databases and big data. It then covers tools like Sqoop that can be used to load data between Hadoop and Oracle databases. It also discusses using query offloading to query Hadoop data directly from Oracle as if it were in an Oracle database.
Amazon Redshift in Action: Enterprise, Big Data, and SaaS Use Cases (DAT205) ...Amazon Web Services
Since Amazon Redshift launched last year, it has been adopted by a wide variety of companies for data warehousing. In this session, learn how customers NASDAQ, HauteLook, and Roundarch Isobar are taking advantage of Amazon Redshift for three unique use cases: enterprise, big data, and SaaS. Learn about their implementations and how they made data analysis faster, cheaper, and easier with Amazon Redshift.
Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...Databricks
At Spark Summit 2017, we described our framework to migrate production Hive workload to Spark with minimal user intervention. After a year of migration, Spark now powers an important part of our batch processing workload. The migration framework supports syntax compatibility analysis, offline/online shadowing, and data validation.
In this session, we first introduce new features and improvements in the migration framework to support bucketed tables and increase automation. Next, we will deep dive into the top technical challenges we encountered and how we addressed them. We improved the the syntax compatibility between Hive and Spark from around 51% to 85% by identifying/developing top missing features, fixing incompatible UDFs, and implementing a UDF testing framework. In addition, we developed reliable join operators to improve Spark stability in production when leveraging optimizations such as ShuffledHashJoin.
Finally, we will share an update on our overall migration effort and examples of migrations wins. For example, we were able to migrate one of the most complicated workloads in Facebook from Hive to Spark with more than 2.5X performance gain.
Optimizing Your Amazon Redshift Cluster for Peak Performance - AWS Summit Syd...Amazon Web Services
Optimising Your Amazon Redshift Cluster for Peak Performance
In this session we take an in-depth look at the latest features in Amazon Redshift, including analysing data store in and outside of your cluster with Amazon Redshift Spectrum, query and platform enhancements, and more. We will dive deep into best practices on how to design optimal schemas, load data efficiently, and optimise your queries to deliver high throughput and performance.
Eric Ferreira , Principal Database Engineer, Amazon Web Services
Best Practices for Migrating your Data Warehouse to Amazon RedshiftAmazon Web Services
You can gain substantially more business insights and save costs by migrating your existing data warehouse to Amazon Redshift. This session will cover the key benefits of migrating to Amazon Redshift, migration strategies, and tools and resources that can help you in the process.
Impala is an open-source SQL query engine for Apache Hadoop that allows for fast, interactive queries directly against data stored in HDFS and other data storage systems. It provides low-latency queries in seconds by using a custom query engine instead of MapReduce. Impala allows users to interact with data using standard SQL and business intelligence tools while leveraging existing metadata in Hadoop. It is designed to be integrated with the Hadoop ecosystem for distributed, fault-tolerant and scalable data processing and analytics.
How We Optimize Spark SQL Jobs With parallel and sync IODatabricks
Although NVMe has been more and more popular these years, a large amount of HDD are still widely used in super-large scale big data clusters. In a EB-level data platform, IO(including decompression and decode) cost contributes a large proportion of Spark jobs’ cost. In another word, IO operation is worth optimizing.
In ByteDancen, we do a series of IO optimization to improve performance, including parallel read and asynchronized shuffle. Firstly we implement file level parallel read to improve performance when there are a lot of small files. Secondly, we design row group level parallel read to accelerate queries for big-file scenario. Thirdly, implement asynchronized spill to improve job peformance. Besides, we design parquet column family, which will split a table into a few column families and different column family will be in different Parquets files. Different column family can be read in parallel, so the read performance is much higher than the existing approach. In our practice, the end to end performance is improved by 5% to 30%
In this talk, I will illustrate how we implement these features and how they accelerate Apache Spark jobs.
Oracle Cloud services products, including Planning and Budget Cloud Service (PBCS), enables companies to focus on their own business instead of spending money and resources on maintaining big IT infrastructures. It also gives them the possibility to be connected 24x7 from any place in the world.
But what happens if this company already has an ODI on-premise infrastructure and they want to integrate the new PBCS with it? Can we use our existing ODI on-premise? How hard is to accomplish this?
This session will show how to use your ODI on-premise to integrate and orchestrate your PBCS seamlessly.
Relational data modeling trends for transactional applicationsIke Ellis
This document provides a summary of Ike Ellis's presentation on data modeling priorities and design patterns for transactional applications. The presentation discusses how data modeling priorities have changed from focusing on writes and normalization to emphasizing reads, flexibility, and performance. It outlines several current design priorities including optimizing the schema for reads, making it easy to change and discoverable, and designing for the network instead of the disk. The presentation concludes with practicing modeling data for example transactional applications like a blog, online store, and refrigeration trucks.
No more unknown members! Smart data load validation for Hyperion Planning usi...Rodrigo Radtke de Souza
Usually, ODI data load interfaces for Essbase are simple and fast to be developed. But, depending on the data source quality, those interfaces may become a performance challenge. Essbase demands that all POV members in which we are trying to insert data to exist in the Essbase outline and when this is not true, Essbase switches its load method from Block Mode to Cell Mode. When this situation happens, one data load that would take only five minutes to complete may take several hours, degrading the Hyperion environment performance. Join us in this session to discover how we solved this problem in Dell's Computers in a dynamic way for any number of Hyperion Planning applications only using ODI data constraints and Hyperion Planning metadata repository to validate all POV members that will be used in the data load, guaranteeing the best performance and data quality on the Hyperion Planning environment.
This document discusses using Oracle Data Integrator (ODI) to validate data against Hyperion Planning metadata before loading the data into Essbase cubes. It proposes using a single generic inbound table in ODI to hold data for multiple Planning applications. ODI constraints would validate the data against Planning repositories to ensure only valid members are loaded to Essbase. This prevents slow cell-by-cell loads and allows adding new Planning applications easily with minimal ODI changes.
Webinar: Successful Data Migration to Microsoft Dynamics 365 CRM | InSyncAPPSeCONNECT
This #Webinar will cover everything you should know to prepare for a Successful CRM Data Migration. Understand the intricacies of data and it's importance in your organization and explore the possibilities of successful Data Migration to your Microsoft Dynamics CRM Platform.
A CRM or Customer Relationship Management (CRM) solution is an essential component in a business as it takes into account all the details of the customers and their journey. But a CRM is never functional without data! That is why, moving data from one system to another is essential in order to set up a new system to utilize the data that already exists in the current system(s). This a must for organizations who want to nurture and help their customers grow.
Data Migration can be a complex and cumbersome process, more complex than people realize, but with a solid strategy in place, it can help organizations seamlessly transfer data from one system to another.
Most Data Migration solutions only transfer Master data, but Transactional data is as much valuable and the right solution and tools can manage that as well. While you need to consider data sources, data fields and other aspects while Migrating Data to Microsoft Dynamics CRM, this webinar will help you learn about the correct approach, best practices and actions involved during the process.
#MSDyn365 #MSDynCRM
The key points to be covered in the webinar are:
- Introduction to Data Migration
- A Guide to Prepare Templates
- Ways to do Data Cleaning
- Options for Data Import
- How to do Data Verification
- Successfully Migrating Data to Dynamics 365 CRM
If you are planning to employ Microsoft Dynamics 365 CRM in your organization, this webinar will help you strategize about CRM data migration and plan for a seamless experience.
Start your #DataMigration today: https://insync.co.in/data-migration/
Are we there Yet?? (The long journey of Migrating from close source to opens...Marco Tusa
Migrating from Oracle to MySQL or another Open source RDBMS like Postgres is not as straightforward as many think if not well guided. Check what it means doing with someone that has done it already.
Integrating Oracle Data Integrator with Oracle GoldenGate 12cEdelweiss Kammermann
The document discusses integrating Oracle Data Integrator (ODI) with Oracle GoldenGate (OGG) for real-time data integration. It describes how OGG captures change data from source systems and delivers it to ODI. Key steps include configuring OGG installations and JAgents, defining OGG data servers in ODI, applying journalizing to ODI models, and creating and starting ODI processes that integrate with the OGG capture and delivery processes. The integration provides benefits like low impact on sources, great performance for real-time integration, and support for heterogeneous databases.
The document discusses developing dynamic integrations for loading metadata from multiple Hyperion Planning applications. It describes the default ODI development process which involves separate interfaces for each dimension. The motivation is to create a more flexible solution that can load any number of applications and dimensions without rewriting interfaces. Key aspects of the dynamic solution include gathering metadata from various sources into common tables, comparing to existing data to identify changes, and using dynamic options in the ODI integration to specify which application and dimension to load.
Big data primarily refers to data sets that are too large or complex to be dealt with by traditional data-processing application software. Data with many entries (rows) offer greater statistical power, while data with higher complexity (more attributes or columns) may lead to a higher false discovery rate.[2] Though used sometimes loosely partly due to a lack of formal definition, the best interpretation is that it is a large body of information that cannot be comprehended when used in small amounts only.[Big data primarily refers to data sets that are too large or complex to be dealt with by traditional data-processing application software. Data with many entries (rows) offer greater statistical power, while data with higher complexity (more attributes or columns) may lead to a higher false discovery rate.[2] Though used sometimes loosely partly due to a lack of formal definition, the best interpretation is that it is a large body of information that cannot be comprehended when used in small amounts only.[Big data primarily refers to data sets that are too large or complex to be dealt with by traditional data-processing application software. Data with many entries (rows) offer greater statistical power, while data with higher complexity (more attributes or columns) may lead to a higher false discovery rate.[2] Though used sometimes loosely partly due to a lack of formal definition, the best interpretation is that it is a large body of information that cannot be comprehended when used in small amounts only.[Big data primarily refers to data sets that are too large or complex to be dealt with by traditional data-processing application software. Data with many entries (rows) offer greater statistical power, while data with higher complexity (more attributes or columns) may lead to a higher false discovery rate.[2] Though used sometimes loosely partly due to a lack of formal definition, the best interpretation is that it is a large body of information that cannot be comprehended when used in small amounts only.[Big data primarily refers to data sets that are too large or complex to be dealt with by traditional data-processing application software. Data with many entries (rows) offer greater statistical power, while data with higher complexity (more attributes or columns) may lead to a higher false discovery rate.[2] Though used sometimes loosely partly due to a lack of formal definition, the best interpretation is that it is a large body of information that cannot be comprehended when used in small amounts only.[Big data primarily refers to data sets that are too large or complex to be dealt with by traditional data-processing application software. Data with many entries (rows) offer greater statistical power, while data with higher complexity (more attributes or columns) may lead to a higher false discovery rate.[2] Though used sometimes loosely partly due to a lack of formal definition, the best interpretation is that it is a large body of informa
Are you a young professional who just got out of college and unsure which career path to follow? Are you thinking about changing your career to something completely new and looking for options? Either way, this webinar is the right one for you. It’s the first in a series that the new ODTUG Career Track Community will bring you to show what Oracle careers look like and where/how to start with them.
During this webinar, we will talk about what an ETL developer career looks like, what the expectations are, challenges, rewards, and which steps are needed to be successful. We will discuss a wide range of topics, such as tools used on the job, certification paths, the importance of social media, user groups, and more. This webinar will be presented by Rodrigo Radtke de Souza, who has been working in the Oracle ETL world for quite some time now and has achieved great accomplishments as an ETL developer, such as Oracle ACE nomination, frequent Kscope speaker, ODTUG Leadership Program participant, and a successful career at Dell.
Not Your Father’s Data Warehouse: Breaking Tradition with InnovationInside Analysis
The Briefing Room with Dr. Robin Bloor and Teradata
Live Webcast on May 20, 2014
Watch the archive: https://bloorgroup.webex.com/bloorgroup/lsr.php?RCID=f09e84f88e4ca6e0a9179c9a9e930b82
Traditional data warehouses have been the backbone of corporate decision making for over three decades. With the emergence of Big Data and popular technologies like open-source Apache™ Hadoop®, some analysts question the lifespan of the data warehouse and the future role it will play in enterprise information management. But it’s not practical to believe that emerging technologies provide a wholesale replacement of existing technologies and corporate investments in data management. Rather, a better approach is for new innovations and technologies to complement and build upon existing solutions.
Register for this episode of The Briefing Room to hear veteran Analyst Dr. Robin Bloor as he explains where tomorrow’s data warehouse fits in the information landscape. He’ll be briefed by Imad Birouty of Teradata, who will highlight the ways in which his company is evolving to meet the challenges presented by different types of data and applications. He will also tout Teradata’s recently-announced Teradata® Database 15 and Teradata® QueryGrid™, an analytics platform that enables data processing across the enterprise.
Visit InsideAnlaysis.com for more information.
Data Warehouse:
A physical repository where relational data are specially organized to provide enterprise-wide, cleansed data in a standardized format.
Reconciled data: detailed, current data intended to be the single, authoritative source for all decision support.
Extraction:
The Extract step covers the data extraction from the source system and makes it accessible for further processing. The main objective of the extract step is to retrieve all the required data from the source system with as little resources as possible.
Data Transformation:
Data transformation is the component of data reconcilation that converts data from the format of the source operational systems to the format of enterprise data warehouse.
Data Loading:
During the load step, it is necessary to ensure that the load is performed correctly and with as little resources as possible. The target of the Load process is often a database. In order to make the load process efficient, it is helpful to disable any constraints and indexes before the load and enable them back only after the load completes. The referential integrity needs to be maintained by ETL tool to ensure consistency.
Data Migration Done Right for Microsoft Dynamics 365/CRMDaniel Cai
This is the deck that I presented in CRMUG Summit 2017 and Collaborate Canada across the country in November 2017 in Montreal, Toronto, Calgary and Vancouver. This presentation is about how to get legacy data into Microsoft Dynamics 365/CRM efficiently with the help of right tools and right strategies. Getting the migration done can be challenging due to many platform reasons. In this presentation I discuss some typical pitfalls in CRM migration projects, I also discuss some best practices that can be used in CRM migration or integration projects. Hope this is helpful.
ETL is a process that extracts data from multiple sources, transforms it to fit operational needs, and loads it into a data warehouse or other destination system. It migrates, converts, and transforms data to make it accessible for business analysis. The ETL process extracts raw data, transforms it by cleaning, consolidating, and formatting the data, and loads the transformed data into the target data warehouse or data marts.
Want to know more about Common Data Model and Service? You need to understant what's the difference between CDS for Apps and Analytics? Feel free to use these slides and send me your feed backs.
Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra
So many buzzwords of late: Data Lakehouse, Data Mesh, and Data Fabric. What do all these terms mean and how do they compare to a data warehouse? In this session I’ll cover all of them in detail and compare the pros and cons of each. I’ll include use cases so you can see what approach will work best for your big data needs.
Transform your DBMS to drive engagement innovation with Big DataAshnikbiz
This document discusses how organizations can save money on database management systems (DBMS) by moving from expensive commercial DBMS to more affordable open-source options like PostgreSQL. It notes that PostgreSQL has matured and can now handle mission critical workloads. The document recommends partnering with EnterpriseDB to take advantage of their commercial support and features for PostgreSQL. It highlights how customers have seen cost savings of 35-80% by switching to PostgreSQL and been able to reallocate funds to new business initiatives.
Key Methodologies for Migrating from Oracle to PostgresEDB
This presentation reviews the key methodologies that all members of your team should consider, before planning a migration from Oracle to Postgres including:
• Prioritizing the right application or project for your first Oracle migration
• Planning a well-defined, phased migration process to minimize risk and increase time to value
• Handling common concerns and pitfalls related to a migration project
• Leveraging resources before, during, and after your migration
• Becoming independent from an Oracle database – without sacrificing performance
With EDB Postgres’ database compatibility for Oracle, it is easy to migrate from your existing Oracle databases. The compatibility feature set includes compatibility for PL/SQL, Oracle’s SQL syntax, and built in SQL functions. This means that many applications can be easily migrated over to EDB Postgres. It also allows you to continue using your existing Oracle skills.
For more information please contact us at sales@enterprisedb.com
Designing dashboards for performance shridhar wip 040613Mrunal Shridhar
Session from EUTCC13 London on Designing Dashboards for Performance. The session provides tips, tricks, and skills on how to improve performance of your visual analysis.
Similar to Data Warehouse 2.0: Master Techniques for EPM Guys (Powered by ODI) (20)
Codeless Generative AI Pipelines
(GenAI with Milvus)
https://ml.dssconf.pl/user.html#!/lecture/DSSML24-041a/rate
Discover the potential of real-time streaming in the context of GenAI as we delve into the intricacies of Apache NiFi and its capabilities. Learn how this tool can significantly simplify the data engineering workflow for GenAI applications, allowing you to focus on the creative aspects rather than the technical complexities. I will guide you through practical examples and use cases, showing the impact of automation on prompt building. From data ingestion to transformation and delivery, witness how Apache NiFi streamlines the entire pipeline, ensuring a smooth and hassle-free experience.
Timothy Spann
https://www.youtube.com/@FLaNK-Stack
https://medium.com/@tspann
https://www.datainmotion.dev/
milvus, unstructured data, vector database, zilliz, cloud, vectors, python, deep learning, generative ai, genai, nifi, kafka, flink, streaming, iot, edge
State of Artificial intelligence Report 2023kuntobimo2016
Artificial intelligence (AI) is a multidisciplinary field of science and engineering whose goal is to create intelligent machines.
We believe that AI will be a force multiplier on technological progress in our increasingly digital, data-driven world. This is because everything around us today, ranging from culture to consumer products, is a product of intelligence.
The State of AI Report is now in its sixth year. Consider this report as a compilation of the most interesting things we’ve seen with a goal of triggering an informed conversation about the state of AI and its implication for the future.
We consider the following key dimensions in our report:
Research: Technology breakthroughs and their capabilities.
Industry: Areas of commercial application for AI and its business impact.
Politics: Regulation of AI, its economic implications and the evolving geopolitics of AI.
Safety: Identifying and mitigating catastrophic risks that highly-capable future AI systems could pose to us.
Predictions: What we believe will happen in the next 12 months and a 2022 performance review to keep us honest.
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataKiwi Creative
Harness the power of AI-backed reports, benchmarking and data analysis to predict trends and detect anomalies in your marketing efforts.
Peter Caputa, CEO at Databox, reveals how you can discover the strategies and tools to increase your growth rate (and margins!).
From metrics to track to data habits to pick up, enhance your reporting for powerful insights to improve your B2B tech company's marketing.
- - -
This is the webinar recording from the June 2024 HubSpot User Group (HUG) for B2B Technology USA.
Watch the video recording at https://youtu.be/5vjwGfPN9lw
Sign up for future HUG events at https://events.hubspot.com/b2b-technology-usa/
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
5. DevEpm.com
@RZGiampaoli
@RodrigoRadtke
@DEVEPM
EPM Tools
• The architecture of EPM applications are very similar and for the
simplicity purpose this presentation is going to use
Planning/Essbase as example
• Three main possible processes that an EPM applications could have:
• Metadata process: sync the metadata between the source system and the
EPM applications
• Data Load process: load the data to the EPM Applications
• Data Extract process: extract data from the EPM Applications
• Normally it is done manually or using a script to load a text file or
SQL with all data/metadata to the EPM application.
6. Why this is not so good?
• Manual processes are always error
prone
• Tons of files to load/manage
• Not centralized
• Not scalable for big environments
• Not change friendly
• Data quality issues
• Harder to achieve audit standards
• Not feasible for huge volume of data
All can be fixed
by creating a
supporting Data
Warehouse (DW)
7. DevEpm.com
@RZGiampaoli
@RodrigoRadtke
@DEVEPM
Traditional Data Warehouse
• The DW should be implemented in a relational database (RDBMS)
since they are more suitable for the Central Data Warehouse role
than multidimensional databases (OLAP servers)
• The data model for the DW should be based on a dimensional
design (Star Schema, Snow Flake or Hybrid) to facilitate integration
and scalability, and provide greater performance for analytical
processing.
• No matter if is Star Schema, Snow Flake or Hybrid all models are based in
Dimensions that joins with a fact table thought PK’s and FK’s
• The DW can then provide data directly to other systems like EPM
Applications
8. DevEpm.com
@RZGiampaoli
@RodrigoRadtke
@DEVEPM
DW for EPM Applications
Traditional DW
• The data is spread over
numerous tables
• The data is related between
the table by PKs and FKs
• We can have different data in
different tables that has no
direct relationship
• We can query any table to get
any data
• The metadata inside the tables
has no meaning for the
database (it’s just data)
EPM Applications
• The data will be confine inside a
cube
• The data is directly related with
the members of the dimensions
• It’s impossible to have a data that
is not related with all dimensions
• To query we must inform at least
one member of each dimension
• The metadata has a parent/child
relationship, has a specific order
and each member will behave
depending on its dimension type
9. DevEpm.com
@RZGiampaoli
@RodrigoRadtke
@DEVEPM
DW for EPM Applications
• The problem is: A DW for EPM applications should be totally
different from a Traditional DW
• EPM is already a “DW” since it has all dimensions on it and stores all data
inside the cubes
• We don’t need a Star Schema, Snow Flake or Hybrid model to manage
dimensions inside EPM
• We can manage dimensions more efficiently using a “metadata repository”
• The relationship between the EPM apps and the outside systems are the
members POV, and this information is already inside our data table
• We don’t need any PK’s or FK’s to our “metadata repository”
• We need to model our DW thinking about EPM concepts/needs
11. DevEpm.com
@RZGiampaoli
@RodrigoRadtke
@DEVEPM
Metadata Process
• The first process needs to be the Metadata Process since without
the members in the EPM we cannot load data to the cubes
• Depending of the EPM application and the dimension we want to
load we will have different properties and its values
• But for all EPM suit we always have the member information like its parent,
type of storage, consolidation sign and more
• To create a good Metadata process we need to design our table in
the most efficient way and for that we need to know what each
EPM Applications requires
12. DevEpm.com
@RZGiampaoli
@RodrigoRadtke
@DEVEPM
Metadata Process: Dimensions
• Planning/Essbase has 4
different types of
Dimensions
• Account
• Entity
• User Defined Dimension
• Attribute Dimension
• Each Dimension has its
own properties but most
of them are the same
Account Dimension Entity Dimension User Defined Dimension Attibute Dimension
Member Member Member Member
Parent Parent Parent Parent
Alias: Default Alias: Default Alias: Default Alias: Default
Operation Operation Operation Operation
Valid For Consolidations Valid For Consolidations Valid For Consolidations
Data Storage Data Storage Data Storage
Two Pass Calculation Two Pass Calculation Two Pass Calculation
Description Description Description
Formula Formula Formula
UDA UDA UDA
Smart List Smart List Smart List
Data Type Data Type Data Type
Aggregation Aggregation Aggregation
Plan Type Plan Type Plan Type
Account Type
Time Balance
Skip Value
Exchange Rate Type
Variance Reporting
Source Plan Type
Base Currency
13. DevEpm.com
@RZGiampaoli
@RodrigoRadtke
@DEVEPM
Metadata Process: Generic Table
• One Table to “rule”
them all
• Instead of having one
table per dimension, a
generic table will have
one unique column for
each source (white)
• One extra column to
identify to where that
member belongs
(yellow)
• Any other useful
information (Orange)
Account Dimension Entity Dimension User Defined Dimension Attibute Dimension Metadata Table
Account Entity Products Prod_Attrib MEMBER
Parent Parent Parent Parent PARENT
Alias: Default Alias: Default Alias: Default Alias: Default ALIAS
Operation Operation Operation Operation OPERATION
Valid For Consolidations Valid For Consolidations Valid For Consolidations VALID_FOR_CONSOL
Data Storage Data Storage Data Storage DATASTORAGE
Two Pass Calculation Two Pass Calculation Two Pass Calculation TWOPASS_CALC
Description Description Description DESCRIPTION
Formula Formula Formula FORMULA
UDA UDA UDA UDA
Smart List Smart List Smart List SMARTLIST
Data Type Data Type Data Type DATA_TYPE
Aggregation Aggregation Aggregation CONS_PLAN_TYPE1
Plan Type Plan Type Plan Type PLAN_TYPE1
Account Type ACCOUNT_TYPE
Time Balance TIME_BALANCE
Skip Value SKIP_VALUE
Exchange Rate Type EXC_RATE
Variance Reporting VARIANCE_REP
Source Plan Type SRC_PLAN_TYPE
Base Currency CURRENCY
APP_NAME
DIM_TYPE
HIER_NAME
GENERATION
HAS_CHILDREN
POSITION
15. DevEpm.com
@RZGiampaoli
@RodrigoRadtke
@DEVEPM
Metadata Process: Generic Table Benefits
• Centralized: metadata repository that contains all metadata for all
EPM applications
• Scalable: architecture that can have any number of Metadata
without need of changes
• Dynamic: can use generic objects to load any number of EPM
applications
• Accessible: All metadata from all EPM application are easily
available if needed (Data quality, queries, as metadata to other
systems…)
• Performance: Table can be partitioned by Application or/and
Hierarchy
17. DevEpm.com
@RZGiampaoli
@RodrigoRadtke
@DEVEPM
DW Powered by ODI: Metadata Process
• ODI can read the EPM application repositories to understand the
structure and configuration of that application
• Based on the repository ODI can create dynamic code
• ODI can tie out metadata from the source based on the application repository
• Metadata load becomes more efficient and powerful allowing better
management of Moved Members, Attribute Member movement, Reorder
sibling members, Deleted or move Shared Members
• No extra code to add new applications/dimensions
• Complete details at
https://devepm.com/2014/12/18/otnarchbeat-publication/
18. DevEpm.com
@RZGiampaoli
@RodrigoRadtke
@DEVEPM
Data Load Processes
• To load data into any EPM Application we must inform one member
for each dimension and the value we want to load
• Depending of the Application we can have more or less Dimensions
but by default we have some standard dimensions that exists in all
Apps
• Accounts
• Entity
• Years
• Periods
• Scenario
• Version
• Currency
19. DevEpm.com
@RZGiampaoli
@RodrigoRadtke
@DEVEPM
• We can create only one generic inbound table (Fact table) that
contains one column from each planning dimension (Distinct of all
dimensions from all Applications) to build a centralized structure to
hold all data
Data Load Processes: Inbound Tables
20. DevEpm.com
@RZGiampaoli
@RodrigoRadtke
@DEVEPM
• We can go further in the inbound design
and create one column for each period
• Smaller table (less rows) and faster to query
• Load performance greatly improved (one line
has the entire year information)
• In either case we have
• Centralized repository of data (easy to add
new applications)
• Scalable to all EPM Applications
• Data is reusable (No data replication)
• Generic objects (to load, error handling, email
sending…)
Data Load Processes: Multi-Periods
21. DevEpm.com
@RZGiampaoli
@RodrigoRadtke
@DEVEPM
Data Load Processes: Pivot/Unpivot
• To use the Multi-Periods architecture we will need to have the
ability to pivot and unpivot data
• Most of the source system will not have the capability to provide data in
multi-period format as well to receive in this format
• To these we can use PIVOT/UNPIVOT command for Oracle DB
• The PIVOT operator takes data in separate rows, aggregates it and converts it
into columns
• The UNPIVOT operator converts column-based data into separate rows
22. DevEpm.com
@RZGiampaoli
@RodrigoRadtke
@DEVEPM
Data Load Processes: Pivot
1. Define the columns to be Pivoted
2. Use an consolidation function on
the data column
1. SUM, AVG, MIN, MAX, COUNT…
2. Specify the data to me Pivoted
3. The data MUST be a constant in the
“IN” Clouse
3. Data is Pivoted
23. DevEpm.com
@RZGiampaoli
@RodrigoRadtke
@DEVEPM
Data Load Processes: Unpivot
1. Define the columns to be
UnPivoted
2. Select a name for the Data
column and the Member column
1. Specify the data to me UnPivoted
2. The data MUST be a constant in
the “IN” Clouse
3. Data is UnPivoted
24. DevEpm.com
@RZGiampaoli
@RodrigoRadtke
@DEVEPM
Data Load Processes: Data Quality
• EPM applications does not like bad data
• For example, if we try to load an invalid member in Essbase using ODI, it
switches to cell mode greatly impacting the load process performance
• Having just one metadata and inbound table makes the data quality
process way simpler
• All metadata is stored in a single place
• All data is stored in a single place
• Data quality check can be done for all applications in a single process
• Error handling/send email process are easy to create since everything is
gathered in the same place
25. DevEpm.com
@RZGiampaoli
@RodrigoRadtke
@DEVEPM
• With only one inbound generic table,
we will have only one generic E$ table
• Stores all the POV and the data that fails
the validation
• ODI_Cons_name, Interface_Name,
App_Name, Cube and ODI_Sess_NO
identifies what was the error, from which
package that error came from and to
which application it should have loaded
DW Powered by ODI: Data Quality
26. DevEpm.com
@RZGiampaoli
@RodrigoRadtke
@DEVEPM
Data Load Processes: Overview
Sources
Oracle
Stage
Area
Table 1
Table 2
Table 3
Table 4
Table N
EPM
App
1
App
2
App
N
SQL Server
Teradata
Excel
XML
E$ Table
E$ Inbound
Generic
Inbound Table
Inbound
Generic
Generic Components
Send
Email
Error
Handling
App
3
1
2
3
4
3
27. DevEpm.com
@RZGiampaoli
@RodrigoRadtke
@DEVEPM
Data Extract Processes
• The structure for the outbound table is the same as the inbound
and the benefits are almost the same
• Faster to export (mainly if is one year export and a BSO cube)
• Smaller table (less rows) and faster to query
• Centralized repository of data (easy to add new applications)
• Scalable to all EPM Applications
• Data is reusable (No data replication)
• Create views for the target system access the data
• In the same way that we have multi-periods in the inbound table
we can have it in the outbound table
29. DevEpm.com
@RZGiampaoli
@RodrigoRadtke
@DEVEPM
Oracle Partition
• Partitioning enhances the performance, manageability, and
availability of a wide variety of applications and helps
reduce the total cost of ownership for storing large amounts
of data
• Partitioning allows tables, indexes, and index-organized tables to
be subdivided into smaller pieces, enabling these database
objects to be managed and accessed at a finer level of granularity
• Oracle provides a rich variety of partitioning strategies and
extensions to address every business requirement
• Since it is entirely transparent, partitioning can be applied to
almost any application without the need for potentially expensive
and time consuming application changes.
31. DevEpm.com
@RZGiampaoli
@RodrigoRadtke
@DEVEPM
Oracle Sub Partition: Types
• Composite Partitioning
• Range-Range
• Range-Hash
• Range-List
• List-Range
• List-Hash
• List-List
Composite Partitioning
List - Range
Scenario
Actual
Forecast
Budget
Period
Jan to Mar Apr to Jun Jul to Sep Oct to Dec
32. DevEpm.com
@RZGiampaoli
@RodrigoRadtke
@DEVEPM
DW Powered by ODI: Partitioning
• ODI can be used to Manage table
partitions
• Using command on source to query
All_Tab_Partitions and verify if the
partition exists or not
• Using command on target to
Truncate/Drop/Create partitions
• ODI can also manage Sub-
partitions
• Harder to maintain
• Better to use a composite key
33. DevEpm.com
@RZGiampaoli
@RodrigoRadtke
@DEVEPM
Overview of our environment
• 10000+ users around the world
• 24x7 operation
• 10+ source systems
• 18 billions+ inserts/month
• 50 millions+ updates/month
• 60 millions+ deletes/month
• 14 thousand+ ODI sessions/month