This document discusses an end-to-end example of using Hadoop, OBIEE, ODI and Oracle Big Data Discovery to analyze big data from various sources. It describes ingesting website log data and Twitter data into a Hadoop cluster, processing and transforming the data using tools like Hive and Spark, and using the results for reporting in OBIEE and data discovery in Oracle Big Data Discovery. ODI is used to automate the data integration process.
ODI12c as your Big Data Integration HubMark Rittman
Ā
Presentation from the recent Oracle OTN Virtual Technology Summit, on using Oracle Data Integrator 12c to ingest, transform and process data on a Hadoop cluster.
Leveraging Hadoop with OBIEE 11g and ODI 11g - UKOUG Tech'13Mark Rittman
Ā
The latest releases of OBIEE and ODI come with the ability to connect to Hadoop data sources, using MapReduce to integrate data from clusters of "big data" servers complementing traditional BI data sources. In this presentation, we will look at how these two tools connect to Apache Hadoop and access "big data" sources, and share tips and tricks on making it all work smoothly.
Part 2 - Hadoop Data Loading using Hadoop Tools and ODI12cMark Rittman
Ā
Delivered as a one-day seminar at the SIOUG and HROUG Oracle User Group Conferences, October 2014.
There are many ways to ingest (load) data into a Hadoop cluster, from file copying using the Hadoop Filesystem (FS) shell through to real-time streaming using technologies such as Flume and Hadoop streaming. In this session weāll take a high-level look at the data ingestion options for Hadoop, and then show how Oracle Data Integrator and Oracle GoldenGate leverage these technologies to load and process data within your Hadoop cluster. Weāll also consider the updated Oracle Information Management Reference Architecture and look at the best places to land and process your enterprise data, using Hadoopās schema-on-read approach to hold low-value, low-density raw data, and then use the concept of a ādata factoryā to load and process your data into more traditional Oracle relational storage, where we hold high-density, high-value data.
Part 1 - Introduction to Hadoop and Big Data Technologies for Oracle BI & DW ...Mark Rittman
Ā
Delivered as a one-day seminar at the SIOUG and HROUG Oracle User Group Conferences, October 2014
In this presentation we cover some key Hadoop concepts including HDFS, MapReduce, Hive and NoSQL/HBase, with the focus on Oracle Big Data Appliance and Cloudera Distribution including Hadoop. We explain how data is stored on a Hadoop system and the high-level ways it is accessed and analysed, and outline Oracleās products in this area including the Big Data Connectors, Oracle Big Data SQL, and Oracle Business Intelligence (OBI) and Oracle Data Integrator (ODI).
What is Big Data Discovery, and how it complements traditional business anal...Mark Rittman
Ā
Data Discovery is an analysis technique that complements traditional business analytics, and enables users to combine, explore and analyse disparate datasets to spot opportunities and patterns that lie hidden within your data. Oracle Big Data discovery takes this idea and applies it to your unstructured and big data datasets, giving users a way to catalogue, join and then analyse all types of data across your organization.
In this session we'll look at Oracle Big Data Discovery and how it provides a "visual face" to your big data initatives, and how it complements and extends the work that you currently do using business analytics tools.
Part 4 - Hadoop Data Output and Reporting using OBIEE11gMark Rittman
Ā
Delivered as a one-day seminar at the SIOUG and HROUG Oracle User Group Conferences, October 2014.
Once insights and analysis have been produced within your Hadoop cluster by analysts and technical staff, itās usually the case that you want to share the output with a wider audience in the organisation. Oracle Business Intelligence has connectivity to Hadoop through Apache Hive compatibility, and other Oracle tools such as Oracle Big Data Discovery and Big Data SQL can be used to visualise and publish Hadoop data. In this final session weāll look at whatās involved in connecting these tools to your Hadoop environment, and also consider where data is optimally located when large amounts of Hadoop data need to be analysed alongside more traditional data warehouse datasets
ODI12c as your Big Data Integration HubMark Rittman
Ā
Presentation from the recent Oracle OTN Virtual Technology Summit, on using Oracle Data Integrator 12c to ingest, transform and process data on a Hadoop cluster.
Leveraging Hadoop with OBIEE 11g and ODI 11g - UKOUG Tech'13Mark Rittman
Ā
The latest releases of OBIEE and ODI come with the ability to connect to Hadoop data sources, using MapReduce to integrate data from clusters of "big data" servers complementing traditional BI data sources. In this presentation, we will look at how these two tools connect to Apache Hadoop and access "big data" sources, and share tips and tricks on making it all work smoothly.
Part 2 - Hadoop Data Loading using Hadoop Tools and ODI12cMark Rittman
Ā
Delivered as a one-day seminar at the SIOUG and HROUG Oracle User Group Conferences, October 2014.
There are many ways to ingest (load) data into a Hadoop cluster, from file copying using the Hadoop Filesystem (FS) shell through to real-time streaming using technologies such as Flume and Hadoop streaming. In this session weāll take a high-level look at the data ingestion options for Hadoop, and then show how Oracle Data Integrator and Oracle GoldenGate leverage these technologies to load and process data within your Hadoop cluster. Weāll also consider the updated Oracle Information Management Reference Architecture and look at the best places to land and process your enterprise data, using Hadoopās schema-on-read approach to hold low-value, low-density raw data, and then use the concept of a ādata factoryā to load and process your data into more traditional Oracle relational storage, where we hold high-density, high-value data.
Part 1 - Introduction to Hadoop and Big Data Technologies for Oracle BI & DW ...Mark Rittman
Ā
Delivered as a one-day seminar at the SIOUG and HROUG Oracle User Group Conferences, October 2014
In this presentation we cover some key Hadoop concepts including HDFS, MapReduce, Hive and NoSQL/HBase, with the focus on Oracle Big Data Appliance and Cloudera Distribution including Hadoop. We explain how data is stored on a Hadoop system and the high-level ways it is accessed and analysed, and outline Oracleās products in this area including the Big Data Connectors, Oracle Big Data SQL, and Oracle Business Intelligence (OBI) and Oracle Data Integrator (ODI).
What is Big Data Discovery, and how it complements traditional business anal...Mark Rittman
Ā
Data Discovery is an analysis technique that complements traditional business analytics, and enables users to combine, explore and analyse disparate datasets to spot opportunities and patterns that lie hidden within your data. Oracle Big Data discovery takes this idea and applies it to your unstructured and big data datasets, giving users a way to catalogue, join and then analyse all types of data across your organization.
In this session we'll look at Oracle Big Data Discovery and how it provides a "visual face" to your big data initatives, and how it complements and extends the work that you currently do using business analytics tools.
Part 4 - Hadoop Data Output and Reporting using OBIEE11gMark Rittman
Ā
Delivered as a one-day seminar at the SIOUG and HROUG Oracle User Group Conferences, October 2014.
Once insights and analysis have been produced within your Hadoop cluster by analysts and technical staff, itās usually the case that you want to share the output with a wider audience in the organisation. Oracle Business Intelligence has connectivity to Hadoop through Apache Hive compatibility, and other Oracle tools such as Oracle Big Data Discovery and Big Data SQL can be used to visualise and publish Hadoop data. In this final session weāll look at whatās involved in connecting these tools to your Hadoop environment, and also consider where data is optimally located when large amounts of Hadoop data need to be analysed alongside more traditional data warehouse datasets
Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Ar...Mark Rittman
Ā
Presentation from the Rittman Mead BI Forum 2015 masterclass, pt.2 of a two-part session that also covered creating the Discovery Lab. Goes through setting up Flume log + twitter feeds into CDH5 Hadoop using ODI12c Advanced Big Data Option, then looks at the use of OBIEE11g with Hive, Impala and Big Data SQL before finally using Oracle Big Data Discovery for faceted search and data mashup on-top of Hadoop
Deep-Dive into Big Data ETL with ODI12c and Oracle Big Data ConnectorsMark Rittman
Ā
Presented at Oracle Openworld 2014 - a look at the ETL process within a Hadoop cluster, how data gets in, out and around, and how ODI12c and Oracle's Big Data Connectors can be used for this process.
UKOUG Tech'14 Super Sunday : Deep-Dive into Big Data ETL with ODI12cMark Rittman
Ā
Slides from my 2hr session at the UKOUG Tech'14 Super Sunday event, covering Hadoop basics and use of Oracle Data Integrator 12c for ETL on the Hadoop platform. Also some coverage of Oracle big data product announcements from OOW2014.
Innovation in the Data Warehouse - StampedeCon 2016StampedeCon
Ā
Enterprise Holdingās first started with Hadoop as a POC in 2013. Today, we have clusters on premises and in the cloud. This talk will explore our experience with Big Data and outline three common big data architectures (batch, lambda, and kappa). Then, weāll dive into the decision points to necessary for your own cluster, for example: cloud vs on premises, physical vs virtual, workload, and security. These decisions will help you understand what direction to take. Finally, weāll share some lessons learned with the pieces of our architecture worked well and rant about those which didnāt. No deep Hadoop knowledge is necessary, architect or executive level.
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...StampedeCon
Ā
This session will be a detailed recount of the design, implementation, and launch of the next-generation Shutterstock Data Platform, with strong emphasis on conveying clear, understandable learnings that can be transferred to your own organizations and projects. This platform was architected around the prevailing use of Kafka as a highly-scalable central data hub for shipping data across your organization in batch or streaming fashion. It also relies heavily on Avro as a serialization format and a global schema registry to provide structure that greatly improves quality and usability of our data sets, while also allowing the flexibility to evolve schemas and maintain backwards compatibility.
As a company, Shutterstock has always focused heavily on leveraging open source technologies in developing its products and infrastructure, and open source has been a driving force in big data more so than almost any other software sub-sector. With this plethora of constantly evolving data technologies, it can be a daunting task to select the right tool for your problem. We will discuss our approach for choosing specific existing technologies and when we made decisions to invest time in home-grown components and solutions.
We will cover advantages and the engineering process of developing language-agnostic APIs for publishing to and consuming from the data platform. These APIs can power some very interesting streaming analytics solutions that are easily accessible to teams across our engineering organization.
We will also discuss some of the massive advantages a global schema for your data provides for downstream ETL and data analytics. ETL into Hadoop and creation and maintenance of Hive databases and tables becomes much more reliable and easily automated with historically compatible schemas. To complement this schema-based approach, we will cover results of performance testing various file formats and compression schemes in Hadoop and Hive, the massive performance benefits you can gain in analytical workloads by leveraging highly optimized columnar file formats such as ORC and Parquet, and how you can use good old fashioned Hive as a tool for easily and efficiently converting exiting datasets into these formats.
Finally, we will cover lessons learned in launching this platform across our organization, future improvements and further design, and the need for data engineers to understand and speak the languages of data scientists and web, infrastructure, and network engineers.
Big Data for Oracle Devs - Towards Spark, Real-Time and Predictive AnalyticsMark Rittman
Ā
This is a session for Oracle DBAs and devs that looks at the cutting edge big data techs like Spark, Kafka etc, and through demos shows how Hadoop is now a a real-time platform for fast analytics, data integration and predictive modeling
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...Mark Rittman
Ā
Hadoop and NoSQL platforms initially focused on Java developers and slow but massively-scalable MapReduce jobs as an alternative to high-end but limited-scale analytics RDBMS engines. Apache Hive opened-up Hadoop to non-programmers by adding a SQL query engine and relational-style metadata layered over raw HDFS storage, and since then open-source initiatives such as Hive Stinger, Cloudera Impala and Apache Drill along with proprietary solutions from closed-source vendors have extended SQL-on-Hadoopās capabilities into areas such as low-latency ad-hoc queries, ACID-compliant transactions and schema-less data discovery ā at massive scale and with compelling economics.
In this session weāll focus on technical foundations around SQL-on-Hadoop, first reviewing the basic platform Apache Hive provides and then looking in more detail at how ad-hoc querying, ACID-compliant transactions and data discovery engines work along with more specialised underlying storage that each now work best with ā and weāll take a look to the future to see how SQL querying, data integration and analytics are likely to come together in the next five years to make Hadoop the default platform running mixed old-world/new-world analytics workloads.
Using Oracle Big Data SQL 3.0 to add Hadoop & NoSQL to your Oracle Data Wareh...Mark Rittman
Ā
As presented at OGh SQL Celebration Day in June 2016, NL. Covers new features in Big Data SQL including storage indexes, storage handlers and ability to install + license on commodity hardware
Introduction to Kudu - StampedeCon 2016StampedeCon
Ā
Over the past several years, the Hadoop ecosystem has made great strides in its real-time access capabilities, narrowing the gap compared to traditional database technologies. With systems such as Impala and Spark, analysts can now run complex queries or jobs over large datasets within a matter of seconds. With systems such as Apache HBase and Apache Phoenix, applications can achieve millisecond-scale random access to arbitrarily-sized datasets.
Despite these advances, some important gaps remain that prevent many applications from transitioning to Hadoop-based architectures. Users are often caught between a rock and a hard place: columnar formats such as Apache Parquet offer extremely fast scan rates for analytics, but little to no ability for real-time modification or row-by-row indexed access. Online systems such as HBase offer very fast random access, but scan rates that are too slow for large scale data warehousing workloads.
This talk will investigate the trade-offs between real-time transactional access and fast analytic performance from the perspective of storage engine internals. It will also describe Kudu, the new addition to the open source Hadoop ecosystem that fills the gap described above, complementing HDFS and HBase to provide a new option to achieve fast scans and fast random access from a single API.
Using Oracle Big Data Discovey as a Data Scientist's ToolkitMark Rittman
Ā
As delivered at Trivadis Tech Event 2016 - how Big Data Discovery along with Python and pySpark was used to build predictive analytics models against wearables and smart home data
Whatās New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016StampedeCon
Ā
Spark 2.0 includes many exciting new features including Structured Streaming, and the unification of Datasets (new in 1.6) with DataFrames. Structured Streaming allows one to define recurrent queries on a stream of data that is handled as an infinite DataFrame. This query is incrementally updated with new data. This allows for code reuse between batch and streaming and an easier logical model to reason about. Datasets, an extension of DataFrames, were added as an experimental feature in Spark 1.6. They allow us to manipulate collections of objects in a type-safe fashion. In Spark 2.0 the two abstractions have been unified and now DataFrame = Dataset[Row]. We will discuss both of these new features and look at practical real world examples.
Keentechnologies-is the best online training institute for ORACLE DATA INTEGRATOR in India.Trainings will be provided by experts. Job oriented real time trainings will be provided
How To Leverage OBIEE Within A Big Data ArchitectureKevin McGinley
Ā
If you've invested in OBIEE and want to start exploring the use of Big Data technology, this presentation talks about how and why you might want to use OBIEE as the common visualization layer across both.
Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Ar...Mark Rittman
Ā
Presentation from the Rittman Mead BI Forum 2015 masterclass, pt.2 of a two-part session that also covered creating the Discovery Lab. Goes through setting up Flume log + twitter feeds into CDH5 Hadoop using ODI12c Advanced Big Data Option, then looks at the use of OBIEE11g with Hive, Impala and Big Data SQL before finally using Oracle Big Data Discovery for faceted search and data mashup on-top of Hadoop
Deep-Dive into Big Data ETL with ODI12c and Oracle Big Data ConnectorsMark Rittman
Ā
Presented at Oracle Openworld 2014 - a look at the ETL process within a Hadoop cluster, how data gets in, out and around, and how ODI12c and Oracle's Big Data Connectors can be used for this process.
UKOUG Tech'14 Super Sunday : Deep-Dive into Big Data ETL with ODI12cMark Rittman
Ā
Slides from my 2hr session at the UKOUG Tech'14 Super Sunday event, covering Hadoop basics and use of Oracle Data Integrator 12c for ETL on the Hadoop platform. Also some coverage of Oracle big data product announcements from OOW2014.
Innovation in the Data Warehouse - StampedeCon 2016StampedeCon
Ā
Enterprise Holdingās first started with Hadoop as a POC in 2013. Today, we have clusters on premises and in the cloud. This talk will explore our experience with Big Data and outline three common big data architectures (batch, lambda, and kappa). Then, weāll dive into the decision points to necessary for your own cluster, for example: cloud vs on premises, physical vs virtual, workload, and security. These decisions will help you understand what direction to take. Finally, weāll share some lessons learned with the pieces of our architecture worked well and rant about those which didnāt. No deep Hadoop knowledge is necessary, architect or executive level.
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...StampedeCon
Ā
This session will be a detailed recount of the design, implementation, and launch of the next-generation Shutterstock Data Platform, with strong emphasis on conveying clear, understandable learnings that can be transferred to your own organizations and projects. This platform was architected around the prevailing use of Kafka as a highly-scalable central data hub for shipping data across your organization in batch or streaming fashion. It also relies heavily on Avro as a serialization format and a global schema registry to provide structure that greatly improves quality and usability of our data sets, while also allowing the flexibility to evolve schemas and maintain backwards compatibility.
As a company, Shutterstock has always focused heavily on leveraging open source technologies in developing its products and infrastructure, and open source has been a driving force in big data more so than almost any other software sub-sector. With this plethora of constantly evolving data technologies, it can be a daunting task to select the right tool for your problem. We will discuss our approach for choosing specific existing technologies and when we made decisions to invest time in home-grown components and solutions.
We will cover advantages and the engineering process of developing language-agnostic APIs for publishing to and consuming from the data platform. These APIs can power some very interesting streaming analytics solutions that are easily accessible to teams across our engineering organization.
We will also discuss some of the massive advantages a global schema for your data provides for downstream ETL and data analytics. ETL into Hadoop and creation and maintenance of Hive databases and tables becomes much more reliable and easily automated with historically compatible schemas. To complement this schema-based approach, we will cover results of performance testing various file formats and compression schemes in Hadoop and Hive, the massive performance benefits you can gain in analytical workloads by leveraging highly optimized columnar file formats such as ORC and Parquet, and how you can use good old fashioned Hive as a tool for easily and efficiently converting exiting datasets into these formats.
Finally, we will cover lessons learned in launching this platform across our organization, future improvements and further design, and the need for data engineers to understand and speak the languages of data scientists and web, infrastructure, and network engineers.
Big Data for Oracle Devs - Towards Spark, Real-Time and Predictive AnalyticsMark Rittman
Ā
This is a session for Oracle DBAs and devs that looks at the cutting edge big data techs like Spark, Kafka etc, and through demos shows how Hadoop is now a a real-time platform for fast analytics, data integration and predictive modeling
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...Mark Rittman
Ā
Hadoop and NoSQL platforms initially focused on Java developers and slow but massively-scalable MapReduce jobs as an alternative to high-end but limited-scale analytics RDBMS engines. Apache Hive opened-up Hadoop to non-programmers by adding a SQL query engine and relational-style metadata layered over raw HDFS storage, and since then open-source initiatives such as Hive Stinger, Cloudera Impala and Apache Drill along with proprietary solutions from closed-source vendors have extended SQL-on-Hadoopās capabilities into areas such as low-latency ad-hoc queries, ACID-compliant transactions and schema-less data discovery ā at massive scale and with compelling economics.
In this session weāll focus on technical foundations around SQL-on-Hadoop, first reviewing the basic platform Apache Hive provides and then looking in more detail at how ad-hoc querying, ACID-compliant transactions and data discovery engines work along with more specialised underlying storage that each now work best with ā and weāll take a look to the future to see how SQL querying, data integration and analytics are likely to come together in the next five years to make Hadoop the default platform running mixed old-world/new-world analytics workloads.
Using Oracle Big Data SQL 3.0 to add Hadoop & NoSQL to your Oracle Data Wareh...Mark Rittman
Ā
As presented at OGh SQL Celebration Day in June 2016, NL. Covers new features in Big Data SQL including storage indexes, storage handlers and ability to install + license on commodity hardware
Introduction to Kudu - StampedeCon 2016StampedeCon
Ā
Over the past several years, the Hadoop ecosystem has made great strides in its real-time access capabilities, narrowing the gap compared to traditional database technologies. With systems such as Impala and Spark, analysts can now run complex queries or jobs over large datasets within a matter of seconds. With systems such as Apache HBase and Apache Phoenix, applications can achieve millisecond-scale random access to arbitrarily-sized datasets.
Despite these advances, some important gaps remain that prevent many applications from transitioning to Hadoop-based architectures. Users are often caught between a rock and a hard place: columnar formats such as Apache Parquet offer extremely fast scan rates for analytics, but little to no ability for real-time modification or row-by-row indexed access. Online systems such as HBase offer very fast random access, but scan rates that are too slow for large scale data warehousing workloads.
This talk will investigate the trade-offs between real-time transactional access and fast analytic performance from the perspective of storage engine internals. It will also describe Kudu, the new addition to the open source Hadoop ecosystem that fills the gap described above, complementing HDFS and HBase to provide a new option to achieve fast scans and fast random access from a single API.
Using Oracle Big Data Discovey as a Data Scientist's ToolkitMark Rittman
Ā
As delivered at Trivadis Tech Event 2016 - how Big Data Discovery along with Python and pySpark was used to build predictive analytics models against wearables and smart home data
Whatās New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016StampedeCon
Ā
Spark 2.0 includes many exciting new features including Structured Streaming, and the unification of Datasets (new in 1.6) with DataFrames. Structured Streaming allows one to define recurrent queries on a stream of data that is handled as an infinite DataFrame. This query is incrementally updated with new data. This allows for code reuse between batch and streaming and an easier logical model to reason about. Datasets, an extension of DataFrames, were added as an experimental feature in Spark 1.6. They allow us to manipulate collections of objects in a type-safe fashion. In Spark 2.0 the two abstractions have been unified and now DataFrame = Dataset[Row]. We will discuss both of these new features and look at practical real world examples.
Keentechnologies-is the best online training institute for ORACLE DATA INTEGRATOR in India.Trainings will be provided by experts. Job oriented real time trainings will be provided
How To Leverage OBIEE Within A Big Data ArchitectureKevin McGinley
Ā
If you've invested in OBIEE and want to start exploring the use of Big Data technology, this presentation talks about how and why you might want to use OBIEE as the common visualization layer across both.
Using OBIEE and Data Vault to Virtualize Your BI Environment: An Agile ApproachKent Graziano
Ā
First we interview the users, then we design a reporting model based on those interviews. We follow that up with mounds of ETL development to load the new model, basically keeping the user community in the dark during all that development. Does this sound familiar?
This presentation will demonstrate an alternative approach using the Data Vault Data Modeling technique to build a flexible, easily-extensible āFoundationā layer in our data warehouse with an Agile, iterative methodology. Relying on the Business Model and Mapping (BMM) functionality of OBIEE, we can rapidly virtualize a dimensional reporting model using the pattern-based Data Vault Foundation layer to decrease the time, and money, it takes to get BI content in front of end users. Attendees will see a sample Data Vault model designed iteratively and deployed to the semantic model of OBIEE.
Big Data Discovery + Analytics = Datengetriebene Innovation!Harald Erb
Ā
Vortrag von der DOAG 2015-Konferenz: Die Umsetzung von Datenprojekten muss man nicht zwangslƤufig den sog. Data Scientists allein Ć¼berlassen werden. Daten- und Tool-KomplexitƤt im Umgang mit Big Data sind keine unĆ¼berwindbaren HĆ¼rden mehr fĆ¼r die Teams, die heute im Unternehmen bereits fĆ¼r Aufbau und Bewirtschaftung des Data Warehouses sowie dem Management bzw. der Weiterentwicklung der Business Intelligence-Plattform zustƤndig sind. In einem interdisziplinƤren Team bringen neben den technischen Rollen auch Fachanwender und Business Analysten von Anfang an ihr DomƤnenwissen in das Datenprojekt mit ein,
In this presentation at DAMA New York, Joe started by asking a key question: why are we doing this? Why analyze and share all these massive amounts of data? Basically, it comes down to the belief that in any organization, in any situation, if we can get the data and make it correct and timely, insights from it will become instantly actionable for companies to function more nimbly and successfully. Enabling the use of data can be a world-changing, world-improving activity and this session presents the steps necessary to get you there. Joe explained the concept of the "data lake" and also emphasizes the role of a strong data governance strategy that incorporates seven components needed for a successful program.
For more information on this presentation or Caserta Concepts, visit our website at http://casertaconcepts.com/.
Oracle Big Data Discovery working together with Cloudera Hadoop is the fastest way to ingest and understand data. Powerful data transformation capabilities mean that data can quickly be prepared for consumption by the extended organisation.
Informatica to ODI Migration ā What, Why and How | Informatica to Oracle Dat...Jade Global
Ā
Learn about the First and Only Automated Solution for Informatica to Oracle Data Integrator (ODI) conversion
Do you want to know:
āWhatā is Informatica vs ODI?
āWhyā do you need to move to ODI?
āHowā is the migration from Informatica to ODI possible?
Learn how you can achieve up to 90% automated conversion, up to 90% reduced implementation time, up to 50% cost savings and up to 5X productivity gain.
Know more please visit: http://informaticatoodi.jadeglobal.com/
BIG Data & Hadoop Applications in RetailSkillspeed
Ā
Explore the Applications of BIG Data & Hadoop in Retail Industry via Skillspeed.
BIG Data & Hadoop in Retail is a key differentiator, especially in terms of generating memorable customer experiences. They are used for brand sentiment analysis, consumer insights, optimizing store layouts and inventory-demand cycles.
To get more details regarding BIG Data & Hadoop, please visit - www.SkillSpeed.com
Oracle's BigData solutions consist of a number of new products and solutions to support customers looking to gain maximum business value from data sets such as weblogs, social media feeds, smart meters, sensors and other devices that generate massive volumes of data (commonly defined as āBig Dataā) that isnāt readily accessible in enterprise data warehouses and business intelligence applications today.
Deploying OBIEE in the Cloud - Oracle Openworld 2014Mark Rittman
Ā
Introduction to Oracle BI Cloud Service (BICS) including administration, data upload, creating the repository and creating dashboards and reports. Also includes a short case-study around Salesforce.com reporting created for the BICS beta program.
Real-Time Data Replication to Hadoop using GoldenGate 12c AdaptorsMichael Rainey
Ā
Oracle GoldenGate 12c is well known for its highly performant data replication between relational databases. With the GoldenGate Adaptors, the tool can now apply the source transactions to a Big Data target, such as HDFS. In this session, we'll explore the different options for utilizing Oracle GoldenGate 12c to perform real-time data replication from a relational source database into HDFS. The GoldenGate Adaptors will be used to load movie data from the source to HDFS for use by Hive. Next, we'll take the demo a step further and publish the source transactions to a Flume agent, allowing Flume to handle the final load into the targets.
Presented at the Oracle Technology Network Virtual Technology Summit February/March 2015.
TimesTen - Beyond the Summary Advisor (ODTUG KScope'14)Mark Rittman
Ā
Presentation from ODTUG KScope'14, Seattle, on using TimesTen as a standalone analytic database, and going beyond the use of the Exalytics Summary Advisor.
Using Endeca with Oracle Exalytics - Oracle France BI Customer Event, October...Mark Rittman
Ā
Short presentation at the Oracle France BI Customer event in Paris, October 2013, on the advantage of running Endeca Information Discovery on Oracle Exalytics In-Memory Machine.
2-in-1 : RPD Magic and Hyperion Planning "Adapter"Gianni Ceresa
Ā
The best way to explain the powerful modeling capabilities of OBIEE is with a real-use case, and for this session it will be the relational database of Hyperion Planning applications, not a model designed for reporting but for application needs. Use the advanced modeling capabilities of OBIEE to transform the relational schema of Hyperion Planning with what we call some "RPD magic." The target is to fill the gap when reporting against Planning, using only Essbase and missing the content of the relational database, to end with a federation of Essbase and the relational source without ETL, just with OBIEE and some magic.
As 11.1.1.9 with a native Planning driver was released just few weeks before the presentation it was changed to cover this new driver and then completed with some proper RPD magic to still keep the dual content.
Oracle Data Integrator is the strategic Data Integration tool replacing Oracle Warehouse Builder, offering more flexibility and supporting more technologies. Based on the Eurocontrol ā the European Organisation for the Safety of Air Navigation ā story, we will review a migration from Oracle Warehouse Builder to Oracle Data Integrator using the migration utility and custom scripting. After looking at the roadmap and the architecture used, we will see how to automatically migrate supported components how to handle the remaining ones and what needs to be fine-tuned. Finally we will talk about the testing, the challenges, the risks and the lessons learnt so you will be ready to successfully achieve such a migration for your company.
Presentation by Mark Rittman, Technical Director, Rittman Mead, on ODI 11g features that support enterprise deployment and usage. Delivered at BIWA Summit 2013, January 2013.
GoldenGate and Oracle Data Integrator - A Perfect Match...Michael Rainey
Ā
Oracle Data Integrator and Oracle GoldenGate excel as standalone products, but paired together they are the perfect match for real-time data warehousing. Following Oracleās Next Generation Reference Data Warehouse Architecture, this discussion will provide best practices on how to configure, implement, and process data in real-time using ODI and GoldenGate. Attendees will see common real-time challenges solved, including parent-child relationships within micro-batch ETL.
Presented at Rittman Mead BI Forum 2013 Masterclass.
Oracle Warehouse Builder (OWB) and Oracle Data Integrator (ODI) are both Oracle products used for data integration. ODI is the strategic tool Oracle chose for the future and therefore further versions of OWB will not be released anymore. So the question is : How can OWB developers make the switch to ODI?
This talk aims at introducing this product with a particular focus for Oracle Warehouse Builder developers. It covers key aspects of the product while similarities and differences with its predecessor are highlighted. The big question is of course covered : How to migrate from Oracle Warehouse Builder to Oracle Data Integrator?
After this discovery, the OWB developer can serenely start its journey.
Big Data Warehousing Meetup: Real-time Trade Data Monitoring with Storm & Cas...Caserta
Ā
Caserta Concepts' implementation team presented a solution that performs big data analytics on active trade data in real-time. They presented the core components ā Storm for the real-time ingest, Cassandra, a NoSQL database, and others. For more information on future events, please check out http://www.casertaconcepts.com/.
Presentation from the Rittman Mead BI Forum 2013 on ODI11g's Hadoop connectivity. Provides a background to Hadoop, HDFS and Hive, and talks about how ODI11g, and OBIEE 11.1.1.7+, uses Hive to connect to "big data" sources.
KScope14 - Real-Time Data Warehouse Upgrade - Success StoriesMichael Rainey
Ā
Providing real-time data to its global customers is a necessity for IFPI (International Federation of the Phonographic Industry), a not-for-profit organization with a mission to safeguard the rights of record producers and promote the value of recorded music. Using Oracle Streams and Oracle Warehouse Builder (OWB) for real-time data replication and integration, meeting this goal was becoming a challenge. The solution was difficult to maintain and overall throughput was degrading as data volume increased. The need for greater stability and performance led IFPI to implement Oracle GoldenGate and Oracle Data Integrator. This session will describe the innovative approach taken to complete the migration from a Streams and OWB implementation to a more robust, maintainable, and performant GoldenGate and ODI integrated solution.
OBIEE, Endeca, Hadoop and ORE Development (on Exalytics) (ODTUG 2013)Mark Rittman
Ā
A presentation from ODTUG 2013 on tools other than OBIEE for Exalytics, focusing on analysis of non-traditional data via Endeca, "big data" via Hadoop and statistical analysis / predictive modeling through Oracle R Enterprise, and the benefits of running these tools on Oracle Exalytics
GoldenGate and ODI - A Perfect Match for Real-Time Data WarehousingMichael Rainey
Ā
Oracle Data Integrator and Oracle GoldenGate excel as standalone products, but paired together they are the perfect match for real-time data warehousing. Following Oracleās Next Generation Reference Data Warehouse Architecture, this discussion will provide best practices on how to configure, implement, and process data in real-time using ODI and GoldenGate. Attendees will see common real-time challenges solved, including parent-child relationships within micro-batch ETL.
Presented at RMOUG Training Days 2013 & KScope13.
OBIEE11g Seminar by Mark Rittman for OU Expert Summit, Dubai 2015Mark Rittman
Ā
Slides from a two-day OBIEE11g seminar in Dubai, February 2015, at the Oracle University Expert Summit. Covers the following topics:
1. OBIEE 11g Overview & New Features
2. Adding Exalytics and In-Memory Analytics to OBIEE 11g
3. Source Control and Concurrent Development for OBIEE
4. No Silver Bullets - OBIEE 11g Performance in the Real World
5. Oracle BI Cloud Service Overview, Tips and Techniques
6. Moving to Oracle BI Applications 11g + ODI
7. Oracle Essbase and Oracle BI EE 11g Integration Tips and Techniques
8. OBIEE 11g and Predictive Analytics, Hadoop & Big Data
IDERA Live | The Ever Growing Science of Database MigrationsIDERA Software
Ā
You can watch the replay for this webcast in the IDERA Resource Center: http://ow.ly/QHaG50A58ZB
Many information technology professionals may not recognize it, but the bulk of their work has been and continues to be nothing more than database migrations. In the old days to share files across systems, then to move files into relational databases, then to load into data warehouses, and finally now we're moving to NoSQL and the cloud. In the presentation we'll delve into the ever growing and increasingly complex world of database migrations. Some of these considerations include what issues must be planned for and overcome, what problems are likely to occur, and what types of tools exist.
Database expert Bert Scalzo will cover these and many other database migration concerns.
About Bert: Bert Scalzo is an Oracle ACE, author, speaker, consultant, and a major contributor for many popular database tools used by millions of people worldwide. He has 30+ years of database experience and has worked for several major database vendors. He has BS, MS and Ph.D. in computer science plus an MBA. He has presented at numerous events and webcasts. His areas of key interest include data modeling, database benchmarking, database tuning, SQL optimization, "star schema" data warehousing, running databases on Linux or VMware, and using NVMe flash based technology to speed up database performance.
SQL-on-Hadoop for Analytics + BI: What Are My Options, What's the Future?Mark Rittman
Ā
There are many options for providing SQL access over data in a Hadoop cluster, including proprietary vendor products along with open-source technologies such as Apache Hive, Cloudera Impala and Apache Drill; customers are using those to provide reporting over their Hadoop and relational data platforms, and looking to add capabilities such as calculation engines, data integration and federation along with in-memory caching to create complete analytic platforms. In this session weāll look at the options that are available, compare database vendor solutions with their open-source alternative, and see how emerging vendors are going beyond simple SQL-on-Hadoop products to offer complete ādata fabricā solutions that bring together old-world and new-world technologies and allow seamless offloading of archive data and compute work to lower-cost Hadoop platforms.
Social Network Analysis using Oracle Big Data Spatial & Graph (incl. why I di...Mark Rittman
Ā
As presented at OGh SQL Celebration Day 2016 - including new content on why NoSQL and Hadoop is a better solution for social network analysis than the Oracle Database (for now...)
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop : Mark Rittman
Ā
There are many options for providing SQL access over data in a Hadoop cluster, including proprietary vendor products such as Oracle Big Data SQL on the Oracle Big Data Appliance along with open-source technologies such as Apache Hive, Cloudera Impala and Apache Drill; customers are using those to provide reporting over their Hadoop and relational data platforms, and looking to add capabilities such as calculation engines, data integration and federation along with in-memory caching to create complete analytic platforms. In this session we'll look at the options that are available, compare database vendor solutions with their open-source alternative, and see how emerging vendors are going beyond simple SQL-on-Hadoop products to offer complete "data fabric" solutions that bring together old-world and new-world technologies and allow seamless offloading of archive data and compute work to lower-cost Hadoop platforms.
Oracle BI Hybrid BI : Mode 1 + Mode 2, Cloud + On-Premise Business AnalyticsMark Rittman
Ā
Presented at the UKOUG Business Analytics SIG Meeting in April 2016, addresses the question as to whether enterprise BI tools such as OBIEE12c are relevant in the world of Gartner BiModal Mode 1 + Mode 2 analytics, and Hybrid cloud/on-premise deployments
Riga dev day 2016 adding a data reservoir and oracle bdd to extend your ora...Mark Rittman
Ā
This talk focus is on what a data reservoir is, how it related to the RDBMS DW, and how Big Data Discovery provides access to it to business and BI users
OBIEE12c and Embedded Essbase 12c - An Initial Look at Query Acceleration Use...Mark Rittman
Ā
OBIEE12c comes with an updated version of Essbase that focuses entirely in this release on the query acceleration use-case. This presentation looks at this new release and explains how the new BI Accelerator Wizard manages the creation of Essbase cubes to accelerate OBIEE query performance
Deploying Full Oracle BI Platforms to Oracle Cloud - OOW2015Mark Rittman
Ā
Presentation given at Oracle Openworld 2015 on moving an existing OBIEE11g BI platform to Oracle Public Cloud, including accompanying DW database and continuing the ETL process. Explores migration process and what's now possible in Oracle Cloud for hosting full OBIEE platforms, and looks at what the benefits of such a migration might be for customers and end-users.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Ā
Abstract ā Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
Ā
M Capital Group (āMCGā) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (āWFHā), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames āCOVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain servicesā, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: āSpecifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.ā
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
End to-end hadoop development using OBIEE, ODI, Oracle Big Data SQL and Oracle Big Data Discovery
1. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or āØ
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
End-to-End Hadoop Development usingāØ
OBIEE, ODI and Oracle Big Data DiscoveryāØ
Mark Rittman, CTO, Rittman Mead
March 2015
2. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or āØ
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or āØ
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
About the Speaker
ā¢Mark Rittman, Co-Founder of Rittman Mead
ā¢Oracle ACE Director, specialising in Oracle BI&DW
ā¢14 Years Experience with Oracle Technology
ā¢Regular columnist for Oracle Magazine
ā¢Author of two Oracle Press Oracle BI books
ā¢Oracle Business Intelligence Developers Guide
ā¢Oracle Exalytics Revealed
ā¢Writer for Rittman Mead Blog :āØ
http://www.rittmanmead.com/blog
ā¢Email : mark.rittman@rittmanmead.com
ā¢Twitter : @markrittman
3. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or āØ
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or āØ
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
About Rittman Mead
ā¢Oracle BI and DW Gold partner
ā¢Winner of five UKOUG Partner of the Year awards in 2013 - including BI
ā¢World leading specialist partner for technical excellence, āØ
solutions delivery and innovation in Oracle BI
ā¢Approximately 80 consultants worldwide
ā¢All expert in Oracle BI and DW
ā¢Offices in US (Atlanta), Europe, Australia and India
ā¢Skills in broad range of supporting Oracle tools:
ā£OBIEE, OBIA, ODIEE
ā£Big Data, Hadoop, NoSQL & Big Data Discovery
ā£Essbase, Oracle OLAP
ā£GoldenGate
ā£Endeca
4. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or āØ
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
End-to-End Oracle Big Data Example
ā¢Rittman Mead want to understand drivers and audience for their website
ā£What is our most popular content? Who are the most in-demand blog authors?
ā£Who are the influencers? What do they read?
ā¢Three data sources in scope:
RM Website Logs Twitter Stream Website Posts, Comments etc
5. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or āØ
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
Two Analysis Scenarios : Reporting, and Data Discovery
ā¢Initial task will be to ingest data from webserver logs, Twitter firehose, site content + ref data
ā¢Land in Hadoop cluster, basic transform, format, store; then, analyse the data:
1 Combine with Oracle Big Data SQL
for structured OBIEE dashboard analysis
2 Combine with site content, semantics, text enrichment
Catalog and explore using Oracle Big Data Discovery
What pages are people visiting?
Who is referring to us on Twitter?
What content has the most reach?
Why is some content more popular?
Does sentiment affect viewership?
What content is popular, where?
6. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or āØ
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
Step 1āØ
Ingesting Basic Dataset into Hadoop
7. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or āØ
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
Data Loading into Hadoop
ā¢Default load type is real-time, streaming loads
ā£Batch / bulk loads only typically used to seed system
ā¢Variety of sources including web log activity, event streams
ā¢Target is typically HDFS (Hive) or HBase
ā¢Data typically lands in āraw stateā
ā£Lots of files and events, need to be filtered/aggregated
ā£Typically semi-structured (JSON, logs etc)
ā£High volume, high velocity
-Which is why we use Hadoop rather thanāØ
RBDMS (speed vs. ACID trade-off)
ā£Economics of Hadoop means its often possible toāØ
archive all incoming data at detail level
LoadingāØ
Stage
Real-Time āØ
Logs / Events
File / āØ
UnstructuredāØ
Imports
8. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or āØ
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
Data Sources used for Project : Stage 1
Spark
Hive
HDFS
Spark
Hive
HDFS
Spark
Hive
HDFS
Cloudera CDH5.3 BDA Hadoop Cluster
Big Data
SQL
Exadata Exalytics
Flume
Flume MongoDB
DimāØ
Attributes
SQL forāØ
BDA Exec
Filtered &āØ
ProjectedāØ
Rows / āØ
Columns
OBIEE
TimesTen
12c In-Mem
Ingest Process Publish
9. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or āØ
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
Apache Flume : Distributed Transport for Log Activity
ā¢Apache Flume is the standard way to transport log files from source through to target
ā¢Initial use-case was webserver log files, but can transport any file from A>B
ā¢Does not do data transformation, but can send to multiple targets / target types
ā¢Mechanisms and checks to ensure successful transport of entries
ā¢Has a concept of āagentsā, āsinksā and āchannelsā
ā¢Agents collect and forward log data
ā¢Sinks store it in final destination
ā¢Channels store log data en-route
ā¢Simple configuration through INI files
ā¢Handled outside of ODI12c
10. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or āØ
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
Flume Source / Target Configuration
ā¢Conf file for source system agent
ā¢TCP port, channel size+type, source type
ā¢Conf file for target system agent
ā¢TCP port, channel size+type, sink type
11. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or āØ
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
Also - Apache Kafka : Reliable, Message-Based
ā¢Developed by LinkedIn, designed to address Flume issues around reliability, throughput
ā£(though many of those issues have been addressed since)
ā¢Designed for persistent messages as the common use case
ā£Website messages, events etc vs. log file entries
ā¢Consumer (pull) rather than Producer (push) model
ā¢Supports multiple consumers per message queue
ā¢More complex to set up than Flume, and can useāØ
Flume as a consumer of messages
ā£But gaining popularity, especially āØ
alongside Spark Streaming
12. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or āØ
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
Starting Flume Agents, Check Files Landing in HDFS Directory
ā¢Start the Flume agents on source and target (BDA) servers
ā¢Check that incoming file data starts appearing in HDFS
ā£Note - files will be continuously written-to as āØ
entries added to source log files
ā£Channel size for source, target agentsāØ
determines max no. of events buffered
ā£If buffer exceeded, new events droppedāØ
until buffer < channel size
13. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or āØ
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
Adding Social Media Datasources to the Hadoop Dataset
ā¢The log activity from the Rittman Mead website tells us what happened, but not āwhyā
ā¢Common customer requirement now is to get a ā360 degree viewā of their activity
ā£Understand whatās being said about them
ā£External drivers for interest, activity
ā£Understand more about customer intent, opinions
ā¢One example is to add details of social media mentions,āØ
likes, tweets and retweets etc to the transactional dataset
ā£Correlate twitter activity with sales increases, drops
ā£Measure impact of social media strategy
ā£Gather and include textual, sentiment, contextualāØ
data from surveys, media etc
14. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or āØ
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
Example : Supplement Webserver Log Activity with Twitter Data
ā¢Datasift provide access to the Twitter āfirehoseā along with Facebook data, Tumblr etc
ā¢Developer-friendly APIs and ability to define search terms, keywords etc
ā¢Pull (historical data) or Push (real-time) delivery using many formats / end-points
ā£Most commonly-used consumption format is JSON, loaded into Redis, MongoDB etc
15. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or āØ
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
What is MongoDB?
ā¢Open-source document-store NoSQL database
ā¢Flexible data model, each document (record) āØ
can have its own JSON schema
ā¢Highly-scalable across multiple nodes (shards)
ā¢MongoDB databases made up of āØ
collections of documents
ā£Add new attributes to a document just by using it
ā£Single table (collection) design, no joins etc
ā£Very useful for holding JSON output from web apps
-for example, twitter data from Datasift
16. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or āØ
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
Where Are We Now : Data Ingestion Phase
ā¢Log data streamed into HDFS via Flume collector / sink
ā¢Twitter data pushed into MongoDB collection
ā¢Data is in raw form
ā¢Now it needs to be refined, combined,āØ
transformed and processed intoāØ
useful dataā¦
17. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or āØ
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
Step 2āØ
Processing, Enhancing & Transforming Data
18. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or āØ
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
Core Apache Hadoop Tools
ā¢Apache Hadoop, including MapReduce and HDFS
ā£Scaleable, fault-tolerant file storage for Hadoop
ā£Parallel programming framework for Hadoop
ā¢Apache Hive
ā£SQL abstraction layer over HDFS
ā£Perform set-based ETL within Hadoop
ā¢Apache Pig, Spark
ā£Dataflow-type languages over HDFS, Hive etc
ā£Extensible through UDFs, streaming etc
ā¢Apache Flume, Apache Sqoop, Apache Kafka
ā£Real-time and batch loading into HDFS
ā£Modular, fault-tolerant, wide source/target coverage
19. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or āØ
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
Other Tools Typically Usedā¦
ā¢Python, Scala, Java and other programming languages
ā£For more complex and procedural transformations
ā¢Shell scripts, sed, awk, regexes etc
ā¢R and R-on-Hadoop
ā£Typically at the ādiscoveryā phase
ā¢And down the line - ETL tools to automate the process
ā£ODI, Pentaho Data Integrator etc
20. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or āØ
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
ODI on Hadoop - Enterprise Data Integration for Big Data
ā¢ODI provides an excellent framework for loading and transforming Hadoop data
ā£ELT approach pushes transformations down to Hadoop - leveraging power of cluster
ā¢Hive, HBase, Sqoop and OLH/ODCH KMs provide native Hadoop loading / transformation
ā£Whilst still preserving RDBMS push-down
ā£Extensible to cover Pig, Spark etc
ā¢Process orchestration
ā¢Data quality / error handling
ā¢Metadata and model-driven
21. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or āØ
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
Oracle Big Data Connectors
ā¢Oracle-licensed utilities to connect Hadoop to Oracle RBDMS
ā£Bulk-extract data from Hadoop to Oracle, or expose HDFS / Hive data as external tables
ā£Run R analysis and processing on Hadoop
ā£Leverage Hadoop compute resources to offload ETL and other work from Oracle RBDMS
ā£Enable Oracle SQL to access and load Hadoop data
22. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or āØ
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
Hive SerDes & Storage Handlers
ā¢Plug-in technologies that extend Hive to handle new data formats and semi-structured sources
ā¢Typically distributed as JAR files, hosted on sites such as GitHub
ā¢Can be used to parse log files, access data in NoSQL databases, Amazon S3 etc
CREATE EXTERNAL TABLE apachelog (
host STRING,
identity STRING,
user STRING,
time STRING,
request STRING,
status STRING,
size STRING,
referer STRING,
agent STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|[[^]]*]) āØ
([^ "]*|"[^"]*") (-|[0-9]*) (-|[0-9]*)(?: ([^ "]*|"[^"]*") āØ
([^ "]*|"[^"]*"))?",
"output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s"
)
STORED AS TEXTFILE
LOCATION '/user/root/logs';
CREATE TABLE tweet_data(
interactionId string,
username string,
content string,
author_followers int)
ROW FORMAT SERDE
'com.mongodb.hadoop.hive.BSONSerDe'
STORED BY
'com.mongodb.hadoop.hive.MongoStorageHandler'
WITH SERDEPROPERTIES (
'mongo.columns.mapping'='{"interactionId":"interactionId",
"username":"interaction.interaction.author.username",
"content":"interaction.interaction.content",
"author_followers_count":"interaction.twitter.user.followers_count"}'
)
TBLPROPERTIES (
'mongo.uri'='mongodb://cdh51-node1:27017/datasiftmongodb.rm_tweets'
)
23. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or āØ
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
Configuring ODI12c Topology and Models
ā¢HDFS data servers (source) defined using generic File technology
ā¢Workaround to support IKM Hive Control Append
ā¢Leave JDBC driver blank, put HDFS URL in JDBC URL field
24. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or āØ
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
Defining Topology and Model for Hive Sources
ā¢Hive supported āout-of-the-boxā with ODI12c (but requires ODIAAH license for KMs)
ā¢Most recent Hadoop distributions use HiveServer2 rather than HiveServer
ā¢Need to ensure JDBC drivers support Hive version
ā¢Use correct JDBC URL format (jdbc:hive2//ā¦)
25. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or āØ
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
HDFS & Hive Model and Datastore Definitions
ā¢HDFS files for incoming log data, and any other input data
ā¢Hive tables for ETL targets and downstream processing
ā¢Use RKM Hive to reverse-engineer column definition from Hive
26. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or āØ
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
Hive and MongoDB
ā¢MongoDB Hadoop connector provides a storage handler for Hive tables
ā¢Rather than store its data in HDFS, the Hive table uses MongoDB for storage instead
ā¢Define in SerDe properties the Collection elements you want to access, using dot notation
ā¢https://github.com/mongodb/mongo-hadoop
CREATE TABLE tweet_data(
interactionId string,
username string,
content string,
author_followers int)
ROW FORMAT SERDE
'com.mongodb.hadoop.hive.BSONSerDe'
STORED BY
'com.mongodb.hadoop.hive.MongoStorageHandler'
WITH SERDEPROPERTIES (
'mongo.columns.mapping'='{"interactionId":"interactionId",
"username":"interaction.interaction.author.username",
"content":"interaction.interaction.content",
"author_followers_count":"interaction.twitter.user.followers_count"}'
)
TBLPROPERTIES (
'mongo.uri'='mongodb://cdh51-node1:27017/datasiftmongodb.rm_tweets'
)
27. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or āØ
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
Adding MongoDB Datasets into the ODI Repository
ā¢Define Hive table outside of ODI, using MongoDB storage handler
ā¢Select the document elements of interest, project into Hive columns
ā¢Add Hive source to Topology if needed, then use Hive RKM to bring in column metadata
28. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or āØ
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
Load Incoming Log Files into Hive Table
ā¢First step in process is to load the incoming log files into a Hive table
ā£Also need to parse the log entries to extract request, date, IP address etc columns
ā£Hive table can then easily be used in āØ
downstream transformations
ā¢Use IKM File to Hive (LOAD DATA) KM
ā£Source can be local files or HDFS
ā£Either load file into Hive HDFS area,āØ
or leave as external Hive table
ā£Ability to use SerDe to parse file data
1
29. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or āØ
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
Specifying a SerDe to Parse Incoming Hive Data
ā¢SerDe (Serializer-Deserializer) interfaces give Hive the ability to process new file formats
ā¢Distributed as JAR file, gives Hive ability to parse semi-structured formats
ā¢We can use the RegEx SerDe to parse the Apache CombinedLogFormat file into columns
ā¢Enabled through OVERRIDE_ROW_FORMAT IKM File to Hive (LOAD DATA) KM option
30. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or āØ
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
Join to Additional Hive Tables, Transform using HiveQL
ā¢IKM Hive to Hive Control Append can be used to perform Hive table joins, filtering, agg. etc.
ā¢INSERT only, no DELETE, UPDATE etc
ā¢Not all ODI12c mapping operators supported, but basic functionality works OK
ā¢Use this KM to join to other Hive tables,āØ
adding more details on post, title etc
ā¢Perform DISTINCT on join output, loadāØ
into summary Hive table
2
31. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or āØ
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
Joining Hive Tables
ā¢Only equi-joins supported
ā¢Must use ANSI syntax
ā¢More complex joins may not produceāØ
valid HiveQL (subqueries etc)
32. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or āØ
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
Filtering, Aggregating and Transforming Within Hive
ā¢Aggregate (GROUP BY), DISTINCT, FILTER, EXPRESSION, JOIN, SORT etc mappingāØ
operators can be added to mapping to manipulate data
ā¢Generates HiveQL functions, clauses etc
33. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or āØ
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
Loading Hive Tables - Streaming CDC using GoldenGate
ā¢Hive tables in Hadoop cluster can be loaded from file and other sources
ā¢For streaming updates from RDBMS sources, Oracle GoldenGate for Big Data can be used
ā£RDBMS to Hive
ā£RDBMS to HDFS
ā£RDBMS to Flume
ā¢Can combine with Flume,āØ
Kafka etc for real-timeāØ
Hive/HDFS data sync
34. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or āØ
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
Bring in Reference Data from Oracle Database
ā¢In this third step, additional reference data from Oracle Database needs to be added
ā¢In theory, should be able to add Oracle-sourced datastores to mapping and join as usual
ā¢But ā¦ Oracle / JDBC-generic LKMs donāt get work with Hive
3
35. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or āØ
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
Options for Importing Oracle / RDBMS Data into Hadoop
ā¢Could export RBDMS data to file, and load using IKM File to Hive
ā¢Oracle Big Data Connectors only export to Oracle, not import to Hadoop
ā¢Best option is to use Apache Sqoop, and new āØ
IKM SQL to Hive-HBase-File knowledge module
ā¢Hadoop-native, automatically runs in parallel
ā¢Uses native JDBC drivers, or OraOop (for example)
ā¢Bi-directional in-and-out of Hadoop to RDBMS
ā¢Run from OS command-line
36. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or āØ
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
Loading RDBMS Data into Hive using Sqoop
ā¢First step is to stage Oracle data into equivalent Hive table
ā¢Use special LKM SQL Multi-Connect Global load knowledge module for Oracle source
ā£Passes responsibility for load (extract) to following IKM
ā¢Then use IKM SQL to Hive-HBase-File (Sqoop) to load the Hive table
37. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or āØ
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
Join Oracle-Sourced Hive Table to Existing Hive Table
ā¢Oracle-sourced reference data in Hive can then be joined to existing Hive table as normal
ā¢Filters, aggregation operators etc can be added to mapping if required
ā¢Use IKM Hive Control Append as integration KM
38. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or āØ
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
Adding the Twitter Data from MongoDB
ā¢Previous steps exposed the Twitter data, in MongoDB, through a Hive table
ā¢RM_RELATED_TWEETS Hive table now included in ODI Topology + Model
hive> describe rm_related_tweets;
OK
id string from deserializer
interactionid string from deserializer
username string from deserializer
author_name string from deserializer
created_at string from deserializer
content string from deserializer
twitter_link string from deserializer
language string from deserializer
sentiment int from deserializer
author_tweet_count int from deserializer
author_followers_co int from deserializer
author_profile_img_u string from deserializer
author_timezone string from deserializer
normalized_url string from deserializer
mentions string from deserializer
hashtags string from deserializer
Time taken: 0.109 seconds, Fetched: 16 row(s)
39. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or āØ
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
Filter Log Entries to Only Leave Blog Post Views
ā¢Weāre only interested in Twitter activity around blog posts
ā¢Create an additional Hive table to contain just blog post views, to then join to tweets
40. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or āØ
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
Filter Tweets Down to Just Those Linking to RM Website
ā¢Filter the list of tweets down to just those that link to RM blog
41. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or āØ
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
Join Twitter Data to Page View Data
ā¢Create summary table showing twitter activity per page on the blog
42. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or āØ
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
Create ODI Package for Processing Steps, and Execute
ā¢Create ODI Package or Load Plan to run steps in sequence
ā£With load plan, can also add exceptions and recoverability
ā¢Execute package to load data into final Hive tables
43. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or āØ
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
Summary : Data Processing Phase
ā¢Weāve now processed the incoming data, filtering it and transforming to required state
ā¢Joined (āmashed-upā) datasets from website activity, and social media mentions
ā¢Ingestion and the load/processing stages are now complete
ā¢Now we want to make the Hadoop āØ
output available to a wider, āØ
non-technical audienceā¦
44. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or āØ
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
Step 3āØ
Publishing & Analyzing Data
45. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or āØ
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
New in OBIEE 11.1.1.7 : Hadoop Connectivity through Hive
ā¢MapReduce jobs are typically written in Java, but Hive can make this simpler
ā¢Hive is a query environment over Hadoop/MapReduce to support SQL-like queries
ā¢Hive server accepts HiveQL queries via HiveODBC or HiveJDBC, automaticallyāØ
creates MapReduce jobs against data previously loaded into the Hive HDFS tables
ā¢Approach used by ODI and OBIEE to gain access to Hadoop data
ā¢Allows Hadoop data to be accessed just like any other data source
46. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or āØ
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
Importing Hadoop/Hive Metadata into RPD
ā¢HiveODBC driver has to be installed into Windows environment, so that āØ
BI Administration tool can connect to Hive and return table metadata
ā¢Import as ODBC datasource, change physical DB type to Apache Hadoop afterwards
ā¢Note that OBIEE queries cannot span >1 Hive schema (no table prefixes)
1
2
3
47. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or āØ
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
Set up ODBC Connection at the OBIEE Server
ā¢OBIEE 11.1.1.7+ ships with HiveODBC drivers, need to use 7.x versions though (only Linux
supported)
ā¢Configure the ODBC connection in odbc.ini, name needs to match RPD ODBC name
ā¢BI Server should then be able to connect to the Hive server, and Hadoop/MapReduce
[ODBC Data Sources]āØ
AnalyticsWeb=Oracle BI ServerāØ
Cluster=Oracle BI ServerāØ
SSL_Sample=Oracle BI ServerāØ
bigdatalite=Oracle 7.1 Apache Hive Wire Protocol
[bigdatalite]āØ
Driver=/u01/app/Middleware/Oracle_BI1/common/ODBC/āØ
Merant/7.0.1/lib/ARhive27.soāØ
Description=Oracle 7.1 Apache Hive Wire ProtocolāØ
ArraySize=16384āØ
Database=defaultāØ
DefaultLongDataBuffLen=1024āØ
EnableLongDataBuffLen=1024āØ
EnableDescribeParam=0āØ
Hostname=bigdataliteāØ
LoginTimeout=30āØ
MaxVarcharSize=2000āØ
PortNumber=10000āØ
RemoveColumnQualifiers=0āØ
StringDescribeType=12āØ
TransactionMode=0āØ
UseCurrentSchema=0
48. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or āØ
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
OBIEE 11.1.1.7 / HiveServer2 ODBC Driver Issue
ā¢Most customers using BDAs are using CDH4 or CDH5 - which uses HiveServer2
ā¢OBIEE 11.1.1.7 only ships/supports HiveServer1 ODBC drivers
ā¢But ā¦ OBIEE 11.1.1.7 on Windows can use the Cloudera HiveServer2 ODBC drivers
ā£which isnāt supported by Oracle
ā£but works!
49. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or āØ
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
Using Big Data SQL with OBIEE11g
ā¢Preferred Hadoop access path for customers with Oracle Big Data Appliance is Big Data SQL
ā¢Oracle SQL Access to both relational, and Hive/NoSQL data sources
ā¢Exadata-type SmartScan against Hadoop datasets
ā¢Response-time equivalent to Impala or Hive on Tez
ā¢No issues around HiveQL limitations
ā¢Insulates end-users around differencesāØ
between Oracle and Hive datasets
50. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or āØ
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
Using Oracle Big Data SQL to Federate Hive with Oracle
ā¢Big Data SQL can also be used to add, on-the-fly, additional Oracle reference data to Hive
ā£For example : geoip country lookupāØ
requiring non-equi join
ā£Additional dimensions not present in Hive
ā£Calculated measures, derived attributesāØ
using Oracle SQL
Hive Weblog Activity table
Oracle Dimension lookup tables
Combined outputāØ
in report form
51. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or āØ
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
Big Data SQL Example : Add Geolookup Data to Hive
ā¢Hive ACCESS_PER_POST_CATEGORIES table contains Host IP addresses
ā¢Ideally would like to translate into country, city etc
ā£See readership by country
ā£Match topics of interest to country, city
52. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or āØ
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
How GeoIP Geocoding Works
ā¢Uses free Geocoding API and database from Maxmind
ā¢Convert IP address to an integer
ā¢Find which integer range our IP address sits within
ā¢But Hive canāt use BETWEEN in a joinā¦
ā¢Solution : Expose ACCESS_PER_POST_CATEGORIES using Big Data SQL, then join
53. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or āØ
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
Create ORACLE_HIVE version of Hive Table, Join to GeoIP
access_per_post_categories.ip_integerāØ
BETWEEN geoip_country.start_ip_intāØ
AND geoip_country.end_ip_int
54. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or āØ
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
Import Oracle Tables, Create RPD joining Tables Together
ā¢No need to use Hive ODBC drivers - Oracle OCI connection instead
ā¢No issue around HiveServer1 vs HiveServer2; also Big Data SQL handles authenticationāØ
with Hadoop cluster in background, Kerberos etc
ā¢Transparent to OBIEE - all appear as Oracle tables
ā¢Join across schemas if required
55. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or āØ
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
Create Physical Data Model from Imported Table Metadata
ā¢Join ORACLE_HIVE external table containing log data, to reference tables from Oracle DB
56. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or āØ
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
Create Business Model and Presentation Layers
ā¢Map incoming physical tables into a star schema
ā¢Add aggregation method for fact measures
ā¢Add logical keys for logical dimension tables
ā¢Aggregate (Count) on page ID fact column
ā¢Remove columns from fact table āØ
that arenāt measures
57. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or āØ
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
Create Initial Analyses Against Combined Dataset
ā¢Create analyses usingāØ
full SQL features
ā¢Access to Oracle RDBMSāØ
Advanced Analytics functionsāØ
through EVALUATE,āØ
EVALUATE_AGGR etc
ā¢Big Data SQL SmartScan featureāØ
provides fast, ad-hoc accessāØ
to Hive data, avoiding MapReduce
58. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or āØ
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
Create Views over Fact Table To Support Time-Series Analysis
ā¢Time field comes in from Hive as string (date and time, plus timezone)
ā¢Create view over base ORACLE_HIVE external table to break-out date, time etc
ā¢Import time dimension table from Oracle to support date/time analysis, forward forecasting
59. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or āØ
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
Now Enables Time-Series Reporting Incl. Country Lookups
60. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or āØ
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
Use Exalytics In-Memory Aggregate Cache if Required
ā¢If further query acceleration is required, Exalytics In-Memory Cache can be used
ā¢Enabled through Summary Advisor, caches commonly-used aggregates in in-memory cache
ā¢Options for TimesTen or Oracle Database 12c In-Memory Option
ā¢Returns aggregated data āat the speed of thoughtā
61. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or āØ
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
Step 4āØ
Enabling for Data Discovery
62. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or āØ
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
Enable Incoming Site Activity Data for Data Discovery
ā¢Another use-case for Hadoop data is ādata discoveryā
ā£Load data into the data reservoir
ā£Catalog and understand separate datasets
ā£Enrich data using graphical tools
ā£Join separate datasets together
ā£Present textual data alongside measuresāØ
and key attributes
ā£Explore and analyse using faceted search
2 Combine with site content, semantics, text enrichment
Catalog and explore using Oracle Big Data Discovery
Why is some content more popular?
Does sentiment affect viewership?
What content is popular, where?
63. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or āØ
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
Oracle Big Data Discovery
ā¢āThe Visual Face of Hadoopā - cataloging, analysis and discovery for the data reservoir
ā¢Runs on Cloudera CDH5.3+ (Hortonworks support coming soon)
ā¢Combines Endeca Server + Studio technology with Hadoop-native (Spark) transformations
64. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or āØ
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
Data Sources used for Project : Stage 2
Spark
Hive
HDFS
Spark
Hive
HDFS
Spark
Hive
HDFS
Cloudera CDH5.3 BDA Hadoop Cluster
Hive Client
HDFS Client
BDDāØ
DGraphāØ
Gateway
Hive Client
BDD
StudioāØ
Web UI
BDD Node
BDD Data
Processing
BDD Data
Processing
BDD Data
Processing
Ingest semi-āØ
process logsāØ
(1m rows)
Ingest processedāØ
Twitter activity
Write-backāØ
TransformationsāØ
to fullāØ
datasets
UploadāØ
Site page andāØ
comment
contents
Persist uploaded DGraphāØ
content in Hive / HDFS
Data Discovery
using Studio
web-based app
65. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or āØ
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
Oracle Big Data Discovery Architecture
ā¢Adds additional nodes into the CDH5.3 cluster, for running DGraph and Studio
ā¢DGraph engine based on Endeca Server technology, can also be clustered
ā¢Hive (HCatalog) used for reading table metadata,āØ
mapping back to underlying HDFS files
ā¢Apache Spark then used to upload (ingest)āØ
data into DGraph, typically 1m row sample
ā¢Data then held for online analysis in DGraph
ā¢Option to write-back transformations toāØ
underlying Hive/HDFS files using Apache Spark
66. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or āØ
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
Ingesting & Sampling Datasets for the DGraph Engine
ā¢Datasets in Hive have to be ingested into DGraph engine before analysis, transformation
ā¢Can either define an automatic Hive table detector process, or manually upload
ā¢Typically ingests 1m row random sample
ā£1m row sample provides > 99% confidence that answer is within 2% of value shownāØ
no matter how big the full dataset (1m, 1b, 1q+)
ā£Makes interactivity cheap - representative dataset
Amount'of'data'queried
The'100%'premium
Cost
Accuracy
67. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or āØ
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
Ingesting Site Activity and Tweet Data into DGraph
ā¢Two output datasets from ODI process have to be ingested into DGraph engine
ā¢Upload triggered by manual call to BDD Data Processing CLI
ā£Runs Oozie job in the background to profile,āØ
enrich and then ingest data into DGraph
[oracle@bddnode1 ~]$ cd /home/oracle/Middleware/BDD1.0/dataprocessing/edp_cli
[oracle@bddnode1 edp_cli]$ ./data_processing_CLI -t pageviews
[oracle@bddnode1 edp_cli]$ ./data_processing_CLI -t rm_linked_tweets
Hive
Apache Spark
pageviews
X rows
pageviews
>1m rows
Profiling pageviews
>1m rows
Enrichment pageviews
>1m rows
BDD
pageviews
>1m rows
{
"@class" : "com.oracle.endeca.pdi.client.config.workflow.āØ
ProvisionDataSetFromHiveConfig",
"hiveTableName" : "rm_linked_tweets",
"hiveDatabaseName" : "default",
"newCollectionName" : āedp_cli_edp_a5dbdb38-b065ā¦ā,
"runEnrichment" : true,
"maxRecordsForNewDataSet" : 1000000,
"languageOverride" : "unknown"
}
1
2
3
68. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or āØ
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
View Ingested Datasets, Create New Project
ā¢Ingested datasets are now visible in Big Data Discovery Studio
ā¢Create new project from first dataset, then add second
69. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or āØ
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
Automatic Enrichment of Ingested Datasets
ā¢Ingestion process has automatically geo-coded host IP addresses
ā¢Other automatic enrichments run after initial discovery step, based on datatypes, content
70. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or āØ
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
Datatype Conversion Example : String to Date / Time
ā¢Datatypes can be converted into other datatypes, with data transformed if required
ā¢Example : convert Apache Combined Format Log date/time to Java date/time
71. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or āØ
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
Other Transformation Types, Filtering and Grouping
ā¢Group and bin attribute values; filter on attribute values, etc
ā¢Use Transformation Editor for custom transformations (Groovy, incl. enrichment functions)
72. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or āØ
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
Commit Transforms to DGraph, or Create New Hive Table
ā¢Transformation changes have to be committed to DGraph sample of dataset
ā£Project transformations kept separate from other project copies of dataset
ā¢Transformations can also be applied to full dataset, using Apache Spark
ā£Creates new Hive table of complete dataset
73. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or āØ
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
Upload Additional Datasets
ā¢Users can upload their own datasets into BDD, from MS Excel or CSV file
ā¢Uploaded data is first loaded into Hive table, then sampled/ingested as normal
1
2
3
74. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or āØ
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
Join Datasets On Common Attributes
ā¢Creates left-outer join between primary, and secondary dataset
ā£Add metrics (sum of page visits, for example) to activity dataset (e.g. tweets)
ā£Join more attributes to main dataset
75. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or āØ
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
Create Discovery Pages for Dataset Analysis
ā¢Select from palette of visualisation components
ā¢Select measures, attributes for display
76. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or āØ
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
Select From Multiple Visualisation Types
77. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or āØ
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
ā¢Search on attribute values, text in attributes across all datasets
ā¢Extracted keywords, free text field search
Faceted Search Across Project
78. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or āØ
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
Key BDD Studio Differentiator : Faceted Search Across Hadoop
ā¢BDD Studio dashboards support faceted search across all attributes, refinements
ā¢Auto-filter dashboard contents on selected attribute values
ā¢Fast analysis and summarisation through Endeca Server technology
āMark Rittmanā selected
from Post Authors
Results filtered on āØ
selected refinement
1
2
79. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or āØ
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
Key BDD Studio Differentiator : Faceted Search Across Hadoop
ā¢BDD Studio dashboards support faceted search across all attributes, refinements
ā¢Auto-filter dashboard contents on selected attribute values - for data discovery
ā¢Fast analysis and summarisation through Endeca Server technology
Further refinement onāØ
āOBIEEā in post keywords
3
Results now filteredāØ
on two refinements
4
80. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or āØ
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
Summary
ā¢Oracle Big Data, together with OBIEE, ODI and Oracle Big Data Discovery
ā¢Complete end-to-end solution with engineered hardware, and Hadoop-native tooling
1 Combine with Oracle Big Data SQL
for structured OBIEE dashboard analysis
2 Combine with site content, semantics, text enrichment
Catalog and explore using Oracle Big Data Discovery
81. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or āØ
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or āØ
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
Thank You for Attending!
ā¢Thank you for attending this presentation, and more information can be found at http://
www.rittmanmead.com
ā¢Contact us at info@rittmanmead.com or mark.rittman@rittmanmead.com
ā¢Look out for our book, āOracle Business Intelligence Developers Guideā out now!
ā¢Follow-us on Twitter (@rittmanmead) or Facebook (facebook.com/rittmanmead)
82. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or āØ
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
End-to-End Hadoop Development usingāØ
OBIEE, ODI and Oracle Big Data DiscoveryāØ
Mark Rittman, CTO, Rittman Mead
March 2015