Presentation from the recent Oracle OTN Virtual Technology Summit, on using Oracle Data Integrator 12c to ingest, transform and process data on a Hadoop cluster.
Part 2 - Hadoop Data Loading using Hadoop Tools and ODI12cMark Rittman
Delivered as a one-day seminar at the SIOUG and HROUG Oracle User Group Conferences, October 2014.
There are many ways to ingest (load) data into a Hadoop cluster, from file copying using the Hadoop Filesystem (FS) shell through to real-time streaming using technologies such as Flume and Hadoop streaming. In this session we’ll take a high-level look at the data ingestion options for Hadoop, and then show how Oracle Data Integrator and Oracle GoldenGate leverage these technologies to load and process data within your Hadoop cluster. We’ll also consider the updated Oracle Information Management Reference Architecture and look at the best places to land and process your enterprise data, using Hadoop’s schema-on-read approach to hold low-value, low-density raw data, and then use the concept of a “data factory” to load and process your data into more traditional Oracle relational storage, where we hold high-density, high-value data.
Part 1 - Introduction to Hadoop and Big Data Technologies for Oracle BI & DW ...Mark Rittman
Delivered as a one-day seminar at the SIOUG and HROUG Oracle User Group Conferences, October 2014
In this presentation we cover some key Hadoop concepts including HDFS, MapReduce, Hive and NoSQL/HBase, with the focus on Oracle Big Data Appliance and Cloudera Distribution including Hadoop. We explain how data is stored on a Hadoop system and the high-level ways it is accessed and analysed, and outline Oracle’s products in this area including the Big Data Connectors, Oracle Big Data SQL, and Oracle Business Intelligence (OBI) and Oracle Data Integrator (ODI).
Leveraging Hadoop with OBIEE 11g and ODI 11g - UKOUG Tech'13Mark Rittman
The latest releases of OBIEE and ODI come with the ability to connect to Hadoop data sources, using MapReduce to integrate data from clusters of "big data" servers complementing traditional BI data sources. In this presentation, we will look at how these two tools connect to Apache Hadoop and access "big data" sources, and share tips and tricks on making it all work smoothly.
Part 4 - Hadoop Data Output and Reporting using OBIEE11gMark Rittman
Delivered as a one-day seminar at the SIOUG and HROUG Oracle User Group Conferences, October 2014.
Once insights and analysis have been produced within your Hadoop cluster by analysts and technical staff, it’s usually the case that you want to share the output with a wider audience in the organisation. Oracle Business Intelligence has connectivity to Hadoop through Apache Hive compatibility, and other Oracle tools such as Oracle Big Data Discovery and Big Data SQL can be used to visualise and publish Hadoop data. In this final session we’ll look at what’s involved in connecting these tools to your Hadoop environment, and also consider where data is optimally located when large amounts of Hadoop data need to be analysed alongside more traditional data warehouse datasets
What is Big Data Discovery, and how it complements traditional business anal...Mark Rittman
Data Discovery is an analysis technique that complements traditional business analytics, and enables users to combine, explore and analyse disparate datasets to spot opportunities and patterns that lie hidden within your data. Oracle Big Data discovery takes this idea and applies it to your unstructured and big data datasets, giving users a way to catalogue, join and then analyse all types of data across your organization.
In this session we'll look at Oracle Big Data Discovery and how it provides a "visual face" to your big data initatives, and how it complements and extends the work that you currently do using business analytics tools.
Part 2 - Hadoop Data Loading using Hadoop Tools and ODI12cMark Rittman
Delivered as a one-day seminar at the SIOUG and HROUG Oracle User Group Conferences, October 2014.
There are many ways to ingest (load) data into a Hadoop cluster, from file copying using the Hadoop Filesystem (FS) shell through to real-time streaming using technologies such as Flume and Hadoop streaming. In this session we’ll take a high-level look at the data ingestion options for Hadoop, and then show how Oracle Data Integrator and Oracle GoldenGate leverage these technologies to load and process data within your Hadoop cluster. We’ll also consider the updated Oracle Information Management Reference Architecture and look at the best places to land and process your enterprise data, using Hadoop’s schema-on-read approach to hold low-value, low-density raw data, and then use the concept of a “data factory” to load and process your data into more traditional Oracle relational storage, where we hold high-density, high-value data.
Part 1 - Introduction to Hadoop and Big Data Technologies for Oracle BI & DW ...Mark Rittman
Delivered as a one-day seminar at the SIOUG and HROUG Oracle User Group Conferences, October 2014
In this presentation we cover some key Hadoop concepts including HDFS, MapReduce, Hive and NoSQL/HBase, with the focus on Oracle Big Data Appliance and Cloudera Distribution including Hadoop. We explain how data is stored on a Hadoop system and the high-level ways it is accessed and analysed, and outline Oracle’s products in this area including the Big Data Connectors, Oracle Big Data SQL, and Oracle Business Intelligence (OBI) and Oracle Data Integrator (ODI).
Leveraging Hadoop with OBIEE 11g and ODI 11g - UKOUG Tech'13Mark Rittman
The latest releases of OBIEE and ODI come with the ability to connect to Hadoop data sources, using MapReduce to integrate data from clusters of "big data" servers complementing traditional BI data sources. In this presentation, we will look at how these two tools connect to Apache Hadoop and access "big data" sources, and share tips and tricks on making it all work smoothly.
Part 4 - Hadoop Data Output and Reporting using OBIEE11gMark Rittman
Delivered as a one-day seminar at the SIOUG and HROUG Oracle User Group Conferences, October 2014.
Once insights and analysis have been produced within your Hadoop cluster by analysts and technical staff, it’s usually the case that you want to share the output with a wider audience in the organisation. Oracle Business Intelligence has connectivity to Hadoop through Apache Hive compatibility, and other Oracle tools such as Oracle Big Data Discovery and Big Data SQL can be used to visualise and publish Hadoop data. In this final session we’ll look at what’s involved in connecting these tools to your Hadoop environment, and also consider where data is optimally located when large amounts of Hadoop data need to be analysed alongside more traditional data warehouse datasets
What is Big Data Discovery, and how it complements traditional business anal...Mark Rittman
Data Discovery is an analysis technique that complements traditional business analytics, and enables users to combine, explore and analyse disparate datasets to spot opportunities and patterns that lie hidden within your data. Oracle Big Data discovery takes this idea and applies it to your unstructured and big data datasets, giving users a way to catalogue, join and then analyse all types of data across your organization.
In this session we'll look at Oracle Big Data Discovery and how it provides a "visual face" to your big data initatives, and how it complements and extends the work that you currently do using business analytics tools.
Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Ar...Mark Rittman
Presentation from the Rittman Mead BI Forum 2015 masterclass, pt.2 of a two-part session that also covered creating the Discovery Lab. Goes through setting up Flume log + twitter feeds into CDH5 Hadoop using ODI12c Advanced Big Data Option, then looks at the use of OBIEE11g with Hive, Impala and Big Data SQL before finally using Oracle Big Data Discovery for faceted search and data mashup on-top of Hadoop
Innovation in the Data Warehouse - StampedeCon 2016StampedeCon
Enterprise Holding’s first started with Hadoop as a POC in 2013. Today, we have clusters on premises and in the cloud. This talk will explore our experience with Big Data and outline three common big data architectures (batch, lambda, and kappa). Then, we’ll dive into the decision points to necessary for your own cluster, for example: cloud vs on premises, physical vs virtual, workload, and security. These decisions will help you understand what direction to take. Finally, we’ll share some lessons learned with the pieces of our architecture worked well and rant about those which didn’t. No deep Hadoop knowledge is necessary, architect or executive level.
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...StampedeCon
This session will be a detailed recount of the design, implementation, and launch of the next-generation Shutterstock Data Platform, with strong emphasis on conveying clear, understandable learnings that can be transferred to your own organizations and projects. This platform was architected around the prevailing use of Kafka as a highly-scalable central data hub for shipping data across your organization in batch or streaming fashion. It also relies heavily on Avro as a serialization format and a global schema registry to provide structure that greatly improves quality and usability of our data sets, while also allowing the flexibility to evolve schemas and maintain backwards compatibility.
As a company, Shutterstock has always focused heavily on leveraging open source technologies in developing its products and infrastructure, and open source has been a driving force in big data more so than almost any other software sub-sector. With this plethora of constantly evolving data technologies, it can be a daunting task to select the right tool for your problem. We will discuss our approach for choosing specific existing technologies and when we made decisions to invest time in home-grown components and solutions.
We will cover advantages and the engineering process of developing language-agnostic APIs for publishing to and consuming from the data platform. These APIs can power some very interesting streaming analytics solutions that are easily accessible to teams across our engineering organization.
We will also discuss some of the massive advantages a global schema for your data provides for downstream ETL and data analytics. ETL into Hadoop and creation and maintenance of Hive databases and tables becomes much more reliable and easily automated with historically compatible schemas. To complement this schema-based approach, we will cover results of performance testing various file formats and compression schemes in Hadoop and Hive, the massive performance benefits you can gain in analytical workloads by leveraging highly optimized columnar file formats such as ORC and Parquet, and how you can use good old fashioned Hive as a tool for easily and efficiently converting exiting datasets into these formats.
Finally, we will cover lessons learned in launching this platform across our organization, future improvements and further design, and the need for data engineers to understand and speak the languages of data scientists and web, infrastructure, and network engineers.
Introduction to Kudu - StampedeCon 2016StampedeCon
Over the past several years, the Hadoop ecosystem has made great strides in its real-time access capabilities, narrowing the gap compared to traditional database technologies. With systems such as Impala and Spark, analysts can now run complex queries or jobs over large datasets within a matter of seconds. With systems such as Apache HBase and Apache Phoenix, applications can achieve millisecond-scale random access to arbitrarily-sized datasets.
Despite these advances, some important gaps remain that prevent many applications from transitioning to Hadoop-based architectures. Users are often caught between a rock and a hard place: columnar formats such as Apache Parquet offer extremely fast scan rates for analytics, but little to no ability for real-time modification or row-by-row indexed access. Online systems such as HBase offer very fast random access, but scan rates that are too slow for large scale data warehousing workloads.
This talk will investigate the trade-offs between real-time transactional access and fast analytic performance from the perspective of storage engine internals. It will also describe Kudu, the new addition to the open source Hadoop ecosystem that fills the gap described above, complementing HDFS and HBase to provide a new option to achieve fast scans and fast random access from a single API.
2015 nov 27_thug_paytm_rt_ingest_brief_finalAdam Muise
Paytm Labs provides a quick overview of their Hadoop data ingest platform. We cover our journey from a batch focused ingest system with SQOOP to a streaming ingest supported by Kafka, Confluent.io, Hadoop, Cassandra, and Spark Streaming. This presentation also provides an overview of our complete data platform including our feature creation template
Using Oracle Big Data SQL 3.0 to add Hadoop & NoSQL to your Oracle Data Wareh...Mark Rittman
As presented at OGh SQL Celebration Day in June 2016, NL. Covers new features in Big Data SQL including storage indexes, storage handlers and ability to install + license on commodity hardware
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...Mark Rittman
Hadoop and NoSQL platforms initially focused on Java developers and slow but massively-scalable MapReduce jobs as an alternative to high-end but limited-scale analytics RDBMS engines. Apache Hive opened-up Hadoop to non-programmers by adding a SQL query engine and relational-style metadata layered over raw HDFS storage, and since then open-source initiatives such as Hive Stinger, Cloudera Impala and Apache Drill along with proprietary solutions from closed-source vendors have extended SQL-on-Hadoop’s capabilities into areas such as low-latency ad-hoc queries, ACID-compliant transactions and schema-less data discovery – at massive scale and with compelling economics.
In this session we’ll focus on technical foundations around SQL-on-Hadoop, first reviewing the basic platform Apache Hive provides and then looking in more detail at how ad-hoc querying, ACID-compliant transactions and data discovery engines work along with more specialised underlying storage that each now work best with – and we’ll take a look to the future to see how SQL querying, data integration and analytics are likely to come together in the next five years to make Hadoop the default platform running mixed old-world/new-world analytics workloads.
Deep learning has become widespread as frameworks such as TensorFlow and PyTorch have made it easy to onboard machine learning applications. However, while it is easy to start developing with these frameworks on your local developer machine, scaling up a model to run on a cluster and train on huge datasets is still challenging. Code and dependencies have to be copied to every machine and defining the cluster configurations is tedious and error-prone. In addition, troubleshooting errors and aggregating logs is difficult. Ad-hoc solutions also lack resource guarantees, isolation from other jobs, and fault tolerance.
To solve these problems and make scaling deep learning easy, we have made several enhancements to Hadoop and built an open-source deep learning platform called TonY. In this talk, Anthony and Keqiu will discuss new Hadoop features useful for deep learning, such as GPU resource support, and deep dive into TonY, which lets you run deep learning programs natively on Hadoop. We will discuss TonY's architecture and how it allows users to manage their deep learning jobs, acting as a portal from which to launch notebooks, monitor jobs, and visualize training results.
Using Oracle Big Data Discovey as a Data Scientist's ToolkitMark Rittman
As delivered at Trivadis Tech Event 2016 - how Big Data Discovery along with Python and pySpark was used to build predictive analytics models against wearables and smart home data
Deep-Dive into Big Data ETL with ODI12c and Oracle Big Data ConnectorsMark Rittman
Presented at Oracle Openworld 2014 - a look at the ETL process within a Hadoop cluster, how data gets in, out and around, and how ODI12c and Oracle's Big Data Connectors can be used for this process.
UKOUG Tech'14 Super Sunday : Deep-Dive into Big Data ETL with ODI12cMark Rittman
Slides from my 2hr session at the UKOUG Tech'14 Super Sunday event, covering Hadoop basics and use of Oracle Data Integrator 12c for ETL on the Hadoop platform. Also some coverage of Oracle big data product announcements from OOW2014.
Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Ar...Mark Rittman
Presentation from the Rittman Mead BI Forum 2015 masterclass, pt.2 of a two-part session that also covered creating the Discovery Lab. Goes through setting up Flume log + twitter feeds into CDH5 Hadoop using ODI12c Advanced Big Data Option, then looks at the use of OBIEE11g with Hive, Impala and Big Data SQL before finally using Oracle Big Data Discovery for faceted search and data mashup on-top of Hadoop
Innovation in the Data Warehouse - StampedeCon 2016StampedeCon
Enterprise Holding’s first started with Hadoop as a POC in 2013. Today, we have clusters on premises and in the cloud. This talk will explore our experience with Big Data and outline three common big data architectures (batch, lambda, and kappa). Then, we’ll dive into the decision points to necessary for your own cluster, for example: cloud vs on premises, physical vs virtual, workload, and security. These decisions will help you understand what direction to take. Finally, we’ll share some lessons learned with the pieces of our architecture worked well and rant about those which didn’t. No deep Hadoop knowledge is necessary, architect or executive level.
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...StampedeCon
This session will be a detailed recount of the design, implementation, and launch of the next-generation Shutterstock Data Platform, with strong emphasis on conveying clear, understandable learnings that can be transferred to your own organizations and projects. This platform was architected around the prevailing use of Kafka as a highly-scalable central data hub for shipping data across your organization in batch or streaming fashion. It also relies heavily on Avro as a serialization format and a global schema registry to provide structure that greatly improves quality and usability of our data sets, while also allowing the flexibility to evolve schemas and maintain backwards compatibility.
As a company, Shutterstock has always focused heavily on leveraging open source technologies in developing its products and infrastructure, and open source has been a driving force in big data more so than almost any other software sub-sector. With this plethora of constantly evolving data technologies, it can be a daunting task to select the right tool for your problem. We will discuss our approach for choosing specific existing technologies and when we made decisions to invest time in home-grown components and solutions.
We will cover advantages and the engineering process of developing language-agnostic APIs for publishing to and consuming from the data platform. These APIs can power some very interesting streaming analytics solutions that are easily accessible to teams across our engineering organization.
We will also discuss some of the massive advantages a global schema for your data provides for downstream ETL and data analytics. ETL into Hadoop and creation and maintenance of Hive databases and tables becomes much more reliable and easily automated with historically compatible schemas. To complement this schema-based approach, we will cover results of performance testing various file formats and compression schemes in Hadoop and Hive, the massive performance benefits you can gain in analytical workloads by leveraging highly optimized columnar file formats such as ORC and Parquet, and how you can use good old fashioned Hive as a tool for easily and efficiently converting exiting datasets into these formats.
Finally, we will cover lessons learned in launching this platform across our organization, future improvements and further design, and the need for data engineers to understand and speak the languages of data scientists and web, infrastructure, and network engineers.
Introduction to Kudu - StampedeCon 2016StampedeCon
Over the past several years, the Hadoop ecosystem has made great strides in its real-time access capabilities, narrowing the gap compared to traditional database technologies. With systems such as Impala and Spark, analysts can now run complex queries or jobs over large datasets within a matter of seconds. With systems such as Apache HBase and Apache Phoenix, applications can achieve millisecond-scale random access to arbitrarily-sized datasets.
Despite these advances, some important gaps remain that prevent many applications from transitioning to Hadoop-based architectures. Users are often caught between a rock and a hard place: columnar formats such as Apache Parquet offer extremely fast scan rates for analytics, but little to no ability for real-time modification or row-by-row indexed access. Online systems such as HBase offer very fast random access, but scan rates that are too slow for large scale data warehousing workloads.
This talk will investigate the trade-offs between real-time transactional access and fast analytic performance from the perspective of storage engine internals. It will also describe Kudu, the new addition to the open source Hadoop ecosystem that fills the gap described above, complementing HDFS and HBase to provide a new option to achieve fast scans and fast random access from a single API.
2015 nov 27_thug_paytm_rt_ingest_brief_finalAdam Muise
Paytm Labs provides a quick overview of their Hadoop data ingest platform. We cover our journey from a batch focused ingest system with SQOOP to a streaming ingest supported by Kafka, Confluent.io, Hadoop, Cassandra, and Spark Streaming. This presentation also provides an overview of our complete data platform including our feature creation template
Using Oracle Big Data SQL 3.0 to add Hadoop & NoSQL to your Oracle Data Wareh...Mark Rittman
As presented at OGh SQL Celebration Day in June 2016, NL. Covers new features in Big Data SQL including storage indexes, storage handlers and ability to install + license on commodity hardware
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...Mark Rittman
Hadoop and NoSQL platforms initially focused on Java developers and slow but massively-scalable MapReduce jobs as an alternative to high-end but limited-scale analytics RDBMS engines. Apache Hive opened-up Hadoop to non-programmers by adding a SQL query engine and relational-style metadata layered over raw HDFS storage, and since then open-source initiatives such as Hive Stinger, Cloudera Impala and Apache Drill along with proprietary solutions from closed-source vendors have extended SQL-on-Hadoop’s capabilities into areas such as low-latency ad-hoc queries, ACID-compliant transactions and schema-less data discovery – at massive scale and with compelling economics.
In this session we’ll focus on technical foundations around SQL-on-Hadoop, first reviewing the basic platform Apache Hive provides and then looking in more detail at how ad-hoc querying, ACID-compliant transactions and data discovery engines work along with more specialised underlying storage that each now work best with – and we’ll take a look to the future to see how SQL querying, data integration and analytics are likely to come together in the next five years to make Hadoop the default platform running mixed old-world/new-world analytics workloads.
Deep learning has become widespread as frameworks such as TensorFlow and PyTorch have made it easy to onboard machine learning applications. However, while it is easy to start developing with these frameworks on your local developer machine, scaling up a model to run on a cluster and train on huge datasets is still challenging. Code and dependencies have to be copied to every machine and defining the cluster configurations is tedious and error-prone. In addition, troubleshooting errors and aggregating logs is difficult. Ad-hoc solutions also lack resource guarantees, isolation from other jobs, and fault tolerance.
To solve these problems and make scaling deep learning easy, we have made several enhancements to Hadoop and built an open-source deep learning platform called TonY. In this talk, Anthony and Keqiu will discuss new Hadoop features useful for deep learning, such as GPU resource support, and deep dive into TonY, which lets you run deep learning programs natively on Hadoop. We will discuss TonY's architecture and how it allows users to manage their deep learning jobs, acting as a portal from which to launch notebooks, monitor jobs, and visualize training results.
Using Oracle Big Data Discovey as a Data Scientist's ToolkitMark Rittman
As delivered at Trivadis Tech Event 2016 - how Big Data Discovery along with Python and pySpark was used to build predictive analytics models against wearables and smart home data
Deep-Dive into Big Data ETL with ODI12c and Oracle Big Data ConnectorsMark Rittman
Presented at Oracle Openworld 2014 - a look at the ETL process within a Hadoop cluster, how data gets in, out and around, and how ODI12c and Oracle's Big Data Connectors can be used for this process.
UKOUG Tech'14 Super Sunday : Deep-Dive into Big Data ETL with ODI12cMark Rittman
Slides from my 2hr session at the UKOUG Tech'14 Super Sunday event, covering Hadoop basics and use of Oracle Data Integrator 12c for ETL on the Hadoop platform. Also some coverage of Oracle big data product announcements from OOW2014.
Real-Time Data Replication to Hadoop using GoldenGate 12c AdaptorsMichael Rainey
Oracle GoldenGate 12c is well known for its highly performant data replication between relational databases. With the GoldenGate Adaptors, the tool can now apply the source transactions to a Big Data target, such as HDFS. In this session, we'll explore the different options for utilizing Oracle GoldenGate 12c to perform real-time data replication from a relational source database into HDFS. The GoldenGate Adaptors will be used to load movie data from the source to HDFS for use by Hive. Next, we'll take the demo a step further and publish the source transactions to a Flume agent, allowing Flume to handle the final load into the targets.
Presented at the Oracle Technology Network Virtual Technology Summit February/March 2015.
Presentation by Mark Rittman, Technical Director, Rittman Mead, on ODI 11g features that support enterprise deployment and usage. Delivered at BIWA Summit 2013, January 2013.
Deploying OBIEE in the Cloud - Oracle Openworld 2014Mark Rittman
Introduction to Oracle BI Cloud Service (BICS) including administration, data upload, creating the repository and creating dashboards and reports. Also includes a short case-study around Salesforce.com reporting created for the BICS beta program.
Presentation from the Rittman Mead BI Forum 2013 on ODI11g's Hadoop connectivity. Provides a background to Hadoop, HDFS and Hive, and talks about how ODI11g, and OBIEE 11.1.1.7+, uses Hive to connect to "big data" sources.
Demystifying Data Warehouse as a Service (DWaaS)Kent Graziano
This is from the talk I gave at the 30th Anniversary NoCOUG meeting in San Jose, CA.
We all know that data warehouses and best practices for them are changing dramatically today. As organizations build new data warehouses and modernize established ones, they are turning to Data Warehousing as a Service (DWaaS) in hopes of taking advantage of the performance, concurrency, simplicity, and lower cost of a SaaS solution or simply to reduce their data center footprint (and the maintenance that goes with that).
But what is a DWaaS really? How is it different from traditional on-premises data warehousing?
In this talk I will:
• Demystify DWaaS by defining it and its goals
• Discuss the real-world benefits of DWaaS
• Discuss some of the coolest features in a DWaaS solution as exemplified by the Snowflake Elastic Data Warehouse.
TimesTen - Beyond the Summary Advisor (ODTUG KScope'14)Mark Rittman
Presentation from ODTUG KScope'14, Seattle, on using TimesTen as a standalone analytic database, and going beyond the use of the Exalytics Summary Advisor.
In this document, we will present a very brief introduction to BigData (what is BigData?), Hadoop (how does Hadoop fits the picture?) and Cloudera Hadoop (what is the difference between Cloudera Hadoop and regular Hadoop?).
Please note that this document is for Hadoop beginners looking for a place to start.
Using Endeca with Oracle Exalytics - Oracle France BI Customer Event, October...Mark Rittman
Short presentation at the Oracle France BI Customer event in Paris, October 2013, on the advantage of running Endeca Information Discovery on Oracle Exalytics In-Memory Machine.
AIS data management and time series analytics on TileDB Cloud (Webinar, Feb 3...Stavros Papadopoulos
Slides used in the webinar TileDB hosted with participation from Spire Maritime, describing the use and accessibility of massive time series maritime data on TileDB Cloud.
Are you confused by Big Data? Get in touch with this new "black gold" and familiarize yourself with undiscovered insights through our complimentary introductory lesson on Big Data and Hadoop!
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...DATAVERSITY
Thirty years is a long time for a technology foundation to be as active as relational databases. Are their replacements here? In this webinar, we say no.
Databases have not sat around while Hadoop emerged. The Hadoop era generated a ton of interest and confusion, but is it still relevant as organizations are deploying cloud storage like a kid in a candy store? We’ll discuss what platforms to use for what data. This is a critical decision that can dictate two to five times additional work effort if it’s a bad fit.
Drop the herd mentality. In reality, there is no “one size fits all” right now. We need to make our platform decisions amidst this backdrop.
This webinar will distinguish these analytic deployment options and help you platform 2020 and beyond for success.
SQL-on-Hadoop for Analytics + BI: What Are My Options, What's the Future?Mark Rittman
There are many options for providing SQL access over data in a Hadoop cluster, including proprietary vendor products along with open-source technologies such as Apache Hive, Cloudera Impala and Apache Drill; customers are using those to provide reporting over their Hadoop and relational data platforms, and looking to add capabilities such as calculation engines, data integration and federation along with in-memory caching to create complete analytic platforms. In this session we’ll look at the options that are available, compare database vendor solutions with their open-source alternative, and see how emerging vendors are going beyond simple SQL-on-Hadoop products to offer complete “data fabric” solutions that bring together old-world and new-world technologies and allow seamless offloading of archive data and compute work to lower-cost Hadoop platforms.
Social Network Analysis using Oracle Big Data Spatial & Graph (incl. why I di...Mark Rittman
As presented at OGh SQL Celebration Day 2016 - including new content on why NoSQL and Hadoop is a better solution for social network analysis than the Oracle Database (for now...)
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop : Mark Rittman
There are many options for providing SQL access over data in a Hadoop cluster, including proprietary vendor products such as Oracle Big Data SQL on the Oracle Big Data Appliance along with open-source technologies such as Apache Hive, Cloudera Impala and Apache Drill; customers are using those to provide reporting over their Hadoop and relational data platforms, and looking to add capabilities such as calculation engines, data integration and federation along with in-memory caching to create complete analytic platforms. In this session we'll look at the options that are available, compare database vendor solutions with their open-source alternative, and see how emerging vendors are going beyond simple SQL-on-Hadoop products to offer complete "data fabric" solutions that bring together old-world and new-world technologies and allow seamless offloading of archive data and compute work to lower-cost Hadoop platforms.
Oracle BI Hybrid BI : Mode 1 + Mode 2, Cloud + On-Premise Business AnalyticsMark Rittman
Presented at the UKOUG Business Analytics SIG Meeting in April 2016, addresses the question as to whether enterprise BI tools such as OBIEE12c are relevant in the world of Gartner BiModal Mode 1 + Mode 2 analytics, and Hybrid cloud/on-premise deployments
Riga dev day 2016 adding a data reservoir and oracle bdd to extend your ora...Mark Rittman
This talk focus is on what a data reservoir is, how it related to the RDBMS DW, and how Big Data Discovery provides access to it to business and BI users
Big Data for Oracle Devs - Towards Spark, Real-Time and Predictive AnalyticsMark Rittman
This is a session for Oracle DBAs and devs that looks at the cutting edge big data techs like Spark, Kafka etc, and through demos shows how Hadoop is now a a real-time platform for fast analytics, data integration and predictive modeling
OBIEE12c and Embedded Essbase 12c - An Initial Look at Query Acceleration Use...Mark Rittman
OBIEE12c comes with an updated version of Essbase that focuses entirely in this release on the query acceleration use-case. This presentation looks at this new release and explains how the new BI Accelerator Wizard manages the creation of Essbase cubes to accelerate OBIEE query performance
Deploying Full Oracle BI Platforms to Oracle Cloud - OOW2015Mark Rittman
Presentation given at Oracle Openworld 2015 on moving an existing OBIEE11g BI platform to Oracle Public Cloud, including accompanying DW database and continuing the ETL process. Explores migration process and what's now possible in Oracle Cloud for hosting full OBIEE platforms, and looks at what the benefits of such a migration might be for customers and end-users.
OBIEE11g Seminar by Mark Rittman for OU Expert Summit, Dubai 2015Mark Rittman
Slides from a two-day OBIEE11g seminar in Dubai, February 2015, at the Oracle University Expert Summit. Covers the following topics:
1. OBIEE 11g Overview & New Features
2. Adding Exalytics and In-Memory Analytics to OBIEE 11g
3. Source Control and Concurrent Development for OBIEE
4. No Silver Bullets - OBIEE 11g Performance in the Real World
5. Oracle BI Cloud Service Overview, Tips and Techniques
6. Moving to Oracle BI Applications 11g + ODI
7. Oracle Essbase and Oracle BI EE 11g Integration Tips and Techniques
8. OBIEE 11g and Predictive Analytics, Hadoop & Big Data
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
2. ODI12c as your Big Data Integration Hub
Mark Rittman, CTO, Rittman Mead
July 2014
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
3. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
About the Speaker
•Mark Rittman, Co-Founder of Rittman Mead
•Oracle ACE Director, specialising in Oracle BI&DW
•14 Years Experience with Oracle Technology
•Regular columnist for Oracle Magazine
•Author of two Oracle Press Oracle BI books
•Oracle Business Intelligence Developers Guide
•Oracle Exalytics Revealed
•Writer for Rittman Mead Blog :
http://www.rittmanmead.com/blog
•Email : mark.rittman@rittmanmead.com
•Twitter : @markrittman
4. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
About Rittman Mead
•Oracle BI and DW Gold partner
•Winner of five UKOUG Partner of the Year awards in 2013 - including BI
•World leading specialist partner for technical excellence,
solutions delivery and innovation in Oracle BI
•Approximately 80 consultants worldwide
•All expert in Oracle BI and DW
•Offices in US (Atlanta), Europe, Australia and India
•Skills in broad range of supporting Oracle tools:
‣OBIEE, OBIA
‣ODIEE
‣Essbase, Oracle OLAP
‣GoldenGate
‣Endeca
5. Traditional Data Warehouse / BI Architectures
•Three-layer architecture - staging, foundation and access/performance
•All three layers stored in a relational database (Oracle)
•ETL used to move data from layer-to-layer
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
Staging Foundation /
ODS
E : info@rittmanmead.com
W : www.rittmanmead.com
Performance /
Dimensional
ETL ETL
BI Tool (OBIEE)
with metadata
layer
OLAP / In-Memory
Tool with data load
into own database
Direct
Read
Data
Load
Traditional structured
data sources
Data
Load
Data
Load
Data
Load
Traditional Relational Data Warehouse
6. Recent Innovations and Developments in DW Architecture
•The rise of “big data” and “hadoop”
‣New ways to process, store and analyse data
‣New paradigm for TCO - low-cost servers, open-source software, cheap clustering
•Explosion in potential data-source types
‣Unstructured data
‣Social media feeds
‣Schema-less and schema-on-read databases
•New ways of hosting data warehouses
‣In the cloud
‣Do we even need an Oracle database or DW?
•Lots of opportunities for DW/BI developers - make our systems cheaper, wider range of data
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
7. Introduction of New Data Sources : Unstructured, Big Data
Staging Foundation /
ETL ETL
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
ODS
Performance /
Dimensional
E : info@rittmanmead.com
W : www.rittmanmead.com
BI Tool (OBIEE)
with metadata
layer
OLAP / In-Memory
Tool with data load
into own database
Direct
Read
Data
Load
Traditional structured
data sources
Data
Load
Data
Load
Data
Load
Traditional Relational Data Warehouse
Schema-less / NoSQL
data sources
Unstructured/
Social / Doc
data sources
Hadoop /
Big Data
data sources
Data
Load
8. Unstructured, Semi-Structured and Schema-Less Data
•Gaining access to the vast amounts of non-financial / application data out there
‣Data in documents, spreadsheets etc
-Warranty claims, supporting documents, notes etc
‣Data coming from the cloud / social media
‣Data for which we don’t yet have a structure
‣Data who’s structure we’ll decide when we
choose to access it (“schema-on-read”)
•All of the above could be useful information
to have in our DW and BI systems
‣But how do we load it in?
‣And what if we want to access it directly?
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
Schema-less / NoSQL
data sources
Unstructured/
Social / Doc
data sources
Hadoop /
Big Data
data sources
9. Hadoop, and the Big Data Ecosystem
•Apache Hadoop is one of the most well-known Big Data technologies
‣Family of open-source products used to store, and analyze distributed datasets
‣Hadoop is the enabling framework, automatically parallelises and co-ordinates jobs
‣MapReduce is the programming framework
for filtering, sorting and aggregating data
‣Map : filter data and pass on to reducers
‣Reduce : sort, group and return results
‣MapReduce jobs can be written in any
language (Java etc), but it is complicated
•Can be used as an extension of the DW staging layer - cheap processing & storage
•And there may be data stored in Hadoop that our BI users might benefit from
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
10. HDFS: Low-Cost, Clustered, Fault-Tolerant Storage
•The filesystem behind Hadoop, used to store data for Hadoop analysis
‣Unix-like, uses commands such as ls, mkdir, chown, chmod
•Fault-tolerant, with rapid fault detection and recovery
•High-throughput, with streaming data access and large block sizes
•Designed for data-locality, placing data closed to where it is processed
•Accessed from the command-line, via internet (hdfs://), GUI tools etc
[oracle@bigdatalite mapreduce]$ hadoop fs -mkdir /user/oracle/my_stuff
[oracle@bigdatalite mapreduce]$ hadoop fs -ls /user/oracle
Found 5 items
drwx------ - oracle hadoop 0 2013-04-27 16:48 /user/oracle/.staging
drwxrwxrwx - oracle hadoop 0 2012-09-18 17:02 /user/oracle/moviedemo
drwxrwxrwx - oracle hadoop 0 2012-10-17 15:58 /user/oracle/moviework
drwxrwxrwx - oracle hadoop 0 2013-05-03 17:49 /user/oracle/my_stuff
drwxrwxrwx - oracle hadoop 0 2012-08-10 16:08 /user/oracle/stage
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
11. Hadoop & HDFS as a Low-Cost Pre-Staging Layer
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
Staging Foundation /
ODS
E : info@rittmanmead.com
W : www.rittmanmead.com
Performance /
Dimensional
ETL ETL
BI Tool (OBIEE)
with metadata
layer
OLAP / In-Memory
Tool with data load
into own database
Direct
Read
Data
Load
Traditional structured
data sources
Data
Load
Data
Load
Data
Load
Traditional Relational Data Warehouse
Schema-less / NoSQL
data sources
Unstructured/
Social / Doc
data sources
Hadoop /
Big Data
data sources Data
Load
Pre-ETL
Filtering &
Aggregation
(MapReduce)
Low-cost
file store
(HDFS)
Data
Load
Hadoop
12. Big Data and the Hadoop “Data Warehouse”
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
BI Tool (OBIEE)
with metadata
layer
Direct
Read
Data
Load
Data
Load
Data
Load
Schema-less / NoSQL
data sources
Unstructured/
Social / Doc
data sources
Hadoop /
Big Data
data sources Data
Load
Hadoop
Pre-ETL
Filtering &
Aggregation
(MapReduce)
Low-cost
file store
(HDFS)
Hadoop DW
Layer (Hive)
Cloud-Based
data sources
•Rather than load Hadoop data
into the DW, access it directly
•Hadoop has a “DW layer” called
Hive, which provides SQL access
•Could even be used instead of
a traditional DW or data mart
•Limited functionality now
•But products maturing
•and unbeatable TCO
13. Hive as the Hadoop “Data Warehouse”
•MapReduce jobs are typically written in Java, but Hive can make this simpler
•Hive is a query environment over Hadoop/MapReduce to support SQL-like queries
•Hive server accepts HiveQL queries via HiveODBC or HiveJDBC, automatically
creates MapReduce jobs against data previously loaded into the Hive HDFS tables
•Approach used by ODI and OBIEE
to gain access to Hadoop data
•Allows Hadoop data to be accessed just like
any other data source (sort of...)
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
14. How Hive Provides SQL Access over Hadoop
•Hive uses a RBDMS metastore to hold
table and column definitions in schemas
•Hive tables then map onto HDFS-stored files
‣Managed tables
‣External tables
•Oracle-like query optimizer, compiler,
executor
•JDBC and OBDC drivers,
plus CLI etc
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
Hive Driver
(Compile
Optimize, Execute)
Managed Tables
/user/hive/warehouse/
External Tables
/user/oracle/
/user/movies/data/
HDFS
HDFS or local files
loaded into Hive HDFS
area, using HiveQL
CREATE TABLE
command
HDFS files loaded into HDFS
using external process, then
mapped into Hive using
CREATE EXTERNAL TABLE
command
Metastore
15. Transforming HiveQL Queries into MapReduce Jobs
•HiveQL queries are automatically translated into Java MapReduce jobs
•Selection and filtering part becomes Map tasks
•Aggregation part becomes the Reduce tasks
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
SELECT a, sum(b)
FROM myTable
WHERE a<100
GROUP BY a
E : info@rittmanmead.com
W : www.rittmanmead.com
Map
Task
Map
Task
Map
Task
Reduce
Task
Reduce
Task
Result
16. An example Hive Query Session: Connect and Display Table List
[oracle@bigdatalite ~]$ hive
Hive history file=/tmp/oracle/hive_job_log_oracle_201304170403_1991392312.txt
hive> show tables;
OK
dwh_customer
dwh_customer_tmp
i_dwh_customer
ratings
src_customer
src_sales_person
weblog
weblog_preprocessed
weblog_sessionized
Time taken: 2.925 seconds
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
Hive Server lists out all
“tables” that have been
defined within the Hive
environment
17. An example Hive Query Session: Display Table Row Count
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
hive> select count(*) from src_customer;!
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=
In order to set a constant number of reducers:
set mapred.reduce.tasks=
Starting Job = job_201303171815_0003, Tracking URL =
http://localhost.localdomain:50030/jobdetails.jsp?jobid=job_201303171815_0003
Kill Command = /usr/lib/hadoop-0.20/bin/
hadoop job -Dmapred.job.tracker=localhost.localdomain:8021 -kill job_201303171815_0003
2013-04-17 04:06:59,867 Stage-1 map = 0%, reduce = 0%
2013-04-17 04:07:03,926 Stage-1 map = 100%, reduce = 0%
2013-04-17 04:07:14,040 Stage-1 map = 100%, reduce = 33%
2013-04-17 04:07:15,049 Stage-1 map = 100%, reduce = 100%
Ended Job = job_201303171815_0003
OK !
25
Time taken: 22.21 seconds
Request count(*) from table
Hive server generates
MapReduce job to “map” table
key/value pairs, and then
reduce the results to table
count
MapReduce job automatically
run by Hive Server
Results returned to user
18. Demonstration of Hive and HiveQL
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
19. DW 2013: The Mixed Architecture with Federated Queries
•Where many organisations are going:
•Traditional DW at core of strategy
•Making increasing use of low-cost,
cloud/big data tech for storage /
pre-processing
•Access to non-traditional data sources,
usually via ETL in to the DW
•Federated data access through
OBIEE connectivity & metadata layer
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
20. Oracle’s Big Data Products
•Oracle Big Data Appliance - Engineered System for Big Data Acquisition and Processing
‣Cloudera Distribution of Hadoop
‣Cloudera Manager
‣Open-source R
‣Oracle NoSQL Database Community Edition
‣Oracle Enterprise Linux + Oracle JVM
•Oracle Big Data Connectors
‣Oracle Loader for Hadoop (Hadoop > Oracle RBDMS)
‣Oracle Direct Connector for HDFS (HFDS > Oracle RBDMS)
‣Oracle Data Integration Adapter for Hadoop
‣Oracle R Connector for Hadoop
‣Oracle NoSQL Database (column/key-store DB based on BerkeleyDB)
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
21. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
Oracle Loader for Hadoop
•Oracle technology for accessing Hadoop data, and loading it into an Oracle database
•Pushes data transformation, “heavy lifting” to the Hadoop cluster, using MapReduce
•Direct-path loads into Oracle Database, partitioned and non-partitioned
•Online and offline loads
•Key technology for fast load of
Hadoop results into Oracle DB
22. Oracle Direct Connector for HDFS
•Enables HDFS as a data-source for Oracle Database external tables
•Effectively provides Oracle SQL access over HDFS
•Supports data query, or import into Oracle DB
•Treat HDFS-stored files in the same way as regular files
‣But with HDFS’s low-cost
‣… and fault-tolerance
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
23. Oracle Data Integration Adapter for Hadoop
•ODI 11g/12c Application Adapter (pay-extra option) for Hadoop connectivity
•Works for both Windows and Linux installs of ODI Studio
‣Need to source HiveJDBC drivers and JARs from separate Hadoop install
•Provides six new knowledge modules
‣IKM File to Hive (Load Data)
‣IKM Hive Control Append
‣IKM Hive Transform
‣IKM File-Hive to Oracle (OLH)
‣CKM Hive
‣RKM Hive
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
24. How ODI Accesses Hadoop Data
•ODI accesses data in Hadoop clusters through Apache Hive
‣Metadata and query layer over MapReduce
‣Provides SQL-like language (HiveQL) and a data dictionary
‣Provides a means to define “tables”, into
which file data is loaded, and then queried
via MapReduce
‣Accessed via Hive JDBC driver(separate
Hadoop install required
on ODI server, for client libs)
•Additional access through
Oracle Direct Connector for HDFS
and Oracle Loader for Hadoop
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
Hadoop Cluster
MapReduce
Hive Server
ODI 11g
Oracle RDBMS
HiveQL
Direct-path loads using
Oracle Loader for Hadoop,
transformation logic in
MapReduce
25. ODI as Part of Oracle’s Big Data Strategy
•ODI is the data integration tool for extracting data from Hadoop/MapReduce, and loading
into Oracle Big Data Appliance, Oracle Exadata and Oracle Exalytics
•Oracle Application Adaptor for Hadoop provides required data adapters
‣Load data into Hadoop from local filesystem,
or HDFS (Hadoop clustered FS)
‣Read data from Hadoop/MapReduce using
Apache Hive (JDBC) and HiveQL, load
into Oracle RDBMS using
Oracle Loader for Hadoop
•Supported by Oracle’s Engineered Systems
‣Exadata
‣Exalytics
‣Big Data Appliance
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
26. Support for Heterogenous Sources and Targets
•ODI12c isn’t just a big data ETL tool though
•Technology adapters for most RDBMSs, file types, OBIEE, application sources
•Multidimensional servers such as Oracle Essbase, and associated EPM apps
•XML sources, and JMS queues
•SOA environments, using messaging
and service buses, typically in real-time
•All enabled through “knowledge module”
approach - ODI acts as orchestrator and
code generator, uses E-L-T approach
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
27. The Key to ODI Extensibility - Knowledge Modules
•Divides the ETL process into separate steps - extract (load), integrate, check constraints etc
•ODI generates native code for each platform, taking a template for each step + adding
table names, column names, join conditions etc
‣Easy to extend
‣Easy to read the code
‣Makes it possible for ODI to
support Spark, Pig etc in future
‣Uses the power of the target
platform for integration tasks
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
-Hadoop-native ETL
28. Part of the Wider Oracle Data Integration Platform
•Oracle Data Integrator for large-scale data integration across heterogenous sources and
targets
•Oracle GoldenGate for heterogeneous data replication and changed data capture
•Oracle Enterprise Data Quality for data profiling and cleansing
•Oracle Data Services Integrator
for SOA message-based
data federation
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
29. ODI and Big Data Integration Example
•In this example, we’ll show an end-to-end ETL process on Hadoop using ODI12c & BDA
•Scenario: load webserver log data into Hadoop, process enhance and aggregate,
then load final summary table into Oracle Database 12c
‣Process using Hadoop framework
‣Leverage Big Data Connectors
‣Metadata-based ETL development
using ODI12c
‣Real-world example
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
30. ETL & Data Flow through BDA System
•Five-step process to load, transform, aggregate and filter incoming log data
•Leverage ODI’s capabilities where possible
•Make use of Hadoop power
+ scalability
Flume
Agent
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
Sqoop extract
!
posts
(Hive Table)
IKM Hive Control Append
(Hive table join & load into
target hive table)
categories_sql_
extract
(Hive Table)
E : info@rittmanmead.com
W : www.rittmanmead.com
hive_raw_apache_
access_log
(Hive Table)
Flume
Agent
!!!!!!
Apache HTTP
Server
Log Files (HDFS)
Flume Messaging
TCP Port 4545
(example)
IKM File to Hive
1 using RegEx SerDe
log_entries_
and post_detail
(Hive Table)
IKM Hive Control Append
(Hive table join & load into
target hive table)
hive_raw_apache_
access_log
(Hive Table)
2 3
Geocoding
IP>Country list
(Hive Table)
IKM Hive Transform
(Hive streaming through
Python script)
4 5
hive_raw_apache_
access_log
(Hive Table)
IKM File / Hive to Oracle
(bulk unload to Oracle DB)
31. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
Five-Step ETL Process
1. Take the incoming log files (via Flume) and load into a structured Hive table
2. Enhance data from that table to include details on authors, posts from other Hive tables
3. Join to some additional ref. data held in an Oracle database, to add author details
4. Geocode the log data, so that we have the country for each calling IP address
5. Output the data in summary form to an Oracle database
32. Using Flume to Transport Log Files to BDA
•Apache Flume is the standard way to transport log files from source through to target
•Initial use-case was webserver log files, but can transport any file from A>B
•Does not do data transformation, but can send to multiple targets / target types
•Mechanisms and checks to ensure successful transport of entries
•Has a concept of “agents”, “sinks” and “channels”
•Agents collect and forward log data
•Sinks store it in final destination
•Channels store log data en-route
•Simple configuration through INI files
•Handled outside of ODI12c
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
33. GoldenGate for Continuous Streaming to Hadoop
•Oracle GoldenGate is also an option, for streaming RDBMS transactions to Hadoop
•Leverages GoldenGate & HDFS / Hive Java APIs
•Sample Implementations on MOS Doc.ID 1586210.1 (HDFS) and 1586188.1 (Hive)
•Likely to be formal part of GoldenGate in future release - but usable now
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
34. Using Flume for Distributed Log Capture to Single Target
•Multiple agents can be used to capture logs from many sources, combine into one output
•Needs at least one source agent, and a target agent
•Agents can be multi-step, handing-off data across the topology
•Channels store data in files, or in RAM, as a buffer between steps
•Log files being continuously written to have
contents trickle-fed across to source
•Sink types for Hive, HBase and many others
•Free software, part of Hadoop platform
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
35. Configuring Flume for Log Transport to the BDA
•Conf file for source system agent
•TCP port, channel size+type, source type
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
•Conf file for target system agent
•TCP port, channel size+type, sink type
E : info@rittmanmead.com
W : www.rittmanmead.com
36. Starting the Agents, Check Files Landing in HDFS Directory
•Start the Flume agents on source and target (BDA) servers
•Check that incoming file data starts appearing in HDFS
‣Note - files will be continuously written-to as
entries added to source log files
‣Channel size for source, target agents
determines max no. of events buffered
‣If buffer exceeded, new events dropped
until buffer < channel size
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
37. Load Incoming Log Files into Hive Table
•First step in process is to load the incoming log files into a Hive table
‣Also need to parse the log entries to extract request, date, IP address etc columns
‣Hive table can then easily be used in
downstream transformations
•Use IKM File to Hive (LOAD DATA) KM
‣Source can be local files or HDFS
‣Either load file into Hive HDFS area,
or leave as external Hive table
‣Ability to use SerDe to parse file data
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
1
38. First Though … Need to Setup Topology and Models
•HDFS data servers (source) defined using generic File technology
•Workaround to support IKM Hive Control Append
•Leave JDBC driver blank, put HDFS URL in JDBC URL field
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
39. Defining Physical Schema and Model for HDFS Directory
•Hadoop processes typically access a whole directory of files in HDFS, rather than single one
•Hive, Pig etc aggregate all files in that directory and treat as single file
•ODI Models usually point to a single file though -
how do you set up access correctly?
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
40. Defining Physical Schema and Model for HDFS Directory
•ODI appends file name to Physical Schema name for Hive access
•To access a directory, set physical
schema to parent directory
•Set model Resource Name to
directory you want to use as source
•Note - need to manually enter file/
resource names, and “Test” button
does not work for HDFS sources
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
41. Defining Topology and Model for Hive Sources
•Hive supported “out-of-the-box” with ODI12c (but requires ODIAAH license for KMs)
•Most recent Hadoop distributions use HiveServer2 rather than HiveServer
•Need to ensure JDBC drivers support Hive version
•Use correct JDBC URL format (jdbc:hive2//…)
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
42. Hive Tables and Underlying HDFS Storage Permissions
•Hadoop by default has quite loose security
•Files in HDFS organized into directories, using Unix-like permissions
•Hive tables can be created by any user, over directories they have read-access to
‣But that user might not have write permissions on the underlying directory
‣Causes mapping execution failures in ODI if directory read-only
•Therefore ensure you have read/write access to directories used by Hive,
and create tables under the HDFS user you’ll access files through JDBC
‣Simplest approach - create Hue user for “oracle”, create Hive tables under that user
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
43. Final Model and Datastore Definitions
•HDFS files for incoming log data, and any other input data
•Hive tables for ETL targets and downstream processing
•Use RKM Hive to reverse-engineer column definition from Hive
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
44. Using IKM File to Hive to Load Web Log File Data into Hive
•Create mapping to load file source (single column for weblog entries) into Hive table
•Target Hive table should have column for incoming log row, and parsed columns
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
45. Specifying a SerDe to Parse Incoming Hive Data
•SerDe (Serializer-Deserializer) interfaces give Hive the ability to process new file formats
•Distributed as JAR file, gives Hive ability to parse semi-structured formats
•We can use the RegEx SerDe to parse the Apache CombinedLogFormat file into columns
•Enabled through OVERRIDE_ROW_FORMAT IKM File to Hive (LOAD DATA) KM option
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
46. Distributing SerDe JAR Files for Hive across Cluster
•Hive SerDe functionality typically requires additional JARs to be made available to Hive
•Following steps must be performed across ALL BDA nodes:
‣Add JAR reference to HIVE_AUX_JARS_PATH in /usr/lib/hive/conf/hive.env.sh
!
!
!
‣Add JAR file to /usr/lib/hadoop
!
!
!
‣Restart YARN / MR1 TaskTrackers across cluster
export HIVE_AUX_JARS_PATH=/usr/lib/hive/lib/hive-contrib-0.12.0-cdh5.0.1.jar:$
(echo $HIVE_AUX_JARS_PATH…
[root@bdanode1 hadoop]# ls /usr/lib/hadoop/hive-*
/usr/lib/hadoop/hive-contrib-0.12.0-cdh5.0.1.jar
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
47. Executing First ODI12c Mapping
•EXTERNAL_TABLE option chosen in IKM File to Hive (LOAD DATA) as Flume will continue
writing to it until source log rotate
•View results of data load in ODI Studio
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
48. Join to Additional Hive Tables, Transform using HiveQL
•IKM Hive to Hive Control Append can be used to perform Hive table joins, filtering, agg. etc.
•INSERT only, no DELETE, UPDATE etc
•Not all ODI12c mapping operators supported, but basic functionality works OK
•Use this KM to join to other Hive tables,
adding more details on post, title etc
•Perform DISTINCT on join output, load
into summary Hive table
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
2
49. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
Joining Hive Tables
•Only equi-joins supported
•Must use ANSI syntax
•More complex joins may not produce
valid HiveQL (subqueries etc)
50. Filtering, Aggregating and Transforming Within Hive
•Aggregate (GROUP BY), DISTINCT, FILTER, EXPRESSION, JOIN, SORT etc mapping
operators can be added to mapping to manipulate data
•Generates HiveQL functions, clauses etc
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
51. Executing Second Mapping
•ODI IKM Hive to Hive Control Append generates HiveQL to perform data loading
•In the background, Hive on BDA creates MapReduce job(s) to load and transform HDFS data
•Automatically runs across the cluster, in parallel and with fault tolerance, HA
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
52. Bring in Reference Data from Oracle Database
•In this third step, additional reference data from Oracle Database needs to be added
•In theory, should be able to add Oracle-sourced datastores to mapping and join as usual
•But … Oracle / JDBC-generic LKMs don’t get work with Hive
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
3
53. Options for Importing Oracle / RDBMS Data into Hadoop
•Using ODI, only KM option currently is IKM File to Hive (LOAD DATA)
•But this involves an unnecessary export to file before loading
•One option is to use Apache Sqoop, and call from an ODI Procedure
•Hadoop-native, automatically runs in parallel
•Uses native JDBC drivers, or OraOop (for example)
•Bi-directional in-and-out of Hadoop to RDBMS
•Run from OS command-line
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
54. Creating an ODI Procedure to Invoke Sqoop
•Create an OS task, can then reference whole Oracle tables, or an SQL SELECT
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
55. Sqoop Command-Line Parameters
sqoop import —connect jdbc:oracle:thin:@centraldb11gr2.rittmandev.com:1521/
ctrl11g.rittmandev.com —username blog_refdata —password password —query ‘SELECT
p.post_id, c.cat_name from post_one_cat p, categories c where p.cat_id = c.cat_id
and $CONDITIONS’ —target_dir /user/oracle/post_categories —hive-import —hive-overwrite
—hive-table post_categories —split-by p.post_id
•—username, —password : database account username and password
•—query : SELECT statement to retrieve data (can use —table instead, for single table)
•$CONDITIONS, —split-by : column by which MapReduce jobs can be run in parallel
•—hive-import, —hive-overwrite, —hive-table : name and load mode for Hive table
•— target_dir : target HDFS directory to land data in initially (required for SELECT)
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
56. Initial Sqoop Invocation to Create Hive Target Table
•Run Sqoop once (from command-line, or from ODI Procedure) to create the target Hive table
•Can then reverse-engineer the table metadata using RKM Hive, to add to Model
•Thereafter, run as part of Package or Load Plan
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
57. Join Oracle-Sourced Hive Table to Existing Hive Table
•Oracle-sourced reference data in Hive can then be joined to existing Hive table as normal
•Filters, aggregation operators etc can be added to mapping if required
•Use IKM Hive Control Append as integration KM
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
58. Note - New in ODI12c 12.1.3 - Sqoop KM and HBase KMs
•At the time of writing (May 2013) there was no
official Sqoop support in ODI12c
•ODI 12.1.3 (introduced July 2013) introduced a
number of new KMs including
‣IKM SQL to Hive-HBase-File (Sqoop)
‣LKM HBase to Hive
‣IKM Hive to HBase
‣RKM HBase
‣IKM File-Hive to SQL (Sqoop)
•See http://www.ateam-oracle.com/importing-data-
from-sql-databases-into-hadoop-with-sqoop-
and-oracle-data-integrator-odi/ for details
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
59. ODI Static and Flow Control : Data Quality and Error Handling
•CKM Hive can be used with IKM Hive to Hive Control Append to filter out erroneous data
•Static controls can be used to create “data firewalls”
•Flow control used in Physical mapping view to handle errors, exceptions
•Example: Filter out rows where IP address is from a test harness
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
60. Enabling Flow Control in IKM Hive to Hive Control Append
•Check the ENABLE_FLOW_CONTROL option in KM settings
•Select CKM Hive as the check knowledge module
•Erroneous rows will get moved to E_ table in Hive, not loaded into target Hive table
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
61. Using Hive Streaming and Python for Geocoding Data
•Another requirement we have is to “geocode” the webserver log entries
•Allows us to aggregate page views by country
•Based on the fact that IP ranges can usually be attributed to specific countries
•Not functionality normally found in Hive etc, but can be done with add-on APIs
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
4
62. How GeoIP Geocoding Works
•Uses free Geocoding API and database from Maxmind
•Convert IP address to an integer
•Find which integer range our IP address sits within
•But Hive can’t use BETWEEN in a join…
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
63. Solution : IKM Hive Transform
•IKM Hive Transform can pass the output of a Hive SELECT statement through
a perl, python, shell etc script to transform content
•Uses Hive TRANSFORM … USING … AS functionality
hive> add file file:///tmp/add_countries.py;
Added resource: file:///tmp/add_countries.py
hive> select transform (hostname,request_date,post_id,title,author,category)
> using 'add_countries.py'
> as (hostname,request_date,post_id,title,author,category,country)
> from access_per_post_categories;
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
64. Creating the Python Script for Hive Streaming
•Solution requires a Python API to be installed on all Hadoop nodes, along with geocode DB
wget !
https://raw.github.com/pypa/pip/master/contrib/get-pip.py
python !
get-pip.py pip
install pygeoip
!
•Python script then parses incoming stdin lines using tab-separation of fields, outputs same
(but with extra field for the country)
#!/usr/bin/python
import sys
sys.path.append('/usr/lib/python2.6/site-packages/')
import pygeoip
gi = pygeoip.GeoIP('/tmp/GeoIP.dat')
for line in sys.stdin:
line = line.rstrip()
hostname,request_date,post_id,title,author,category = line.split('t')
country = gi.country_name_by_addr(hostname)
print hostname+'t'+request_date+'t'+post_id+'t'+title+'t'+author
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
+'t'+country+'t'+category
65. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
Setting up the Mapping
•Map source Hive table to target, which includes column for extra “country” column
!
!
!
!
!
!
!
•Copy script + GeoIP.dat file to every node’s /tmp directory
•Ensure all Python APIs and libraries are installed on each Hadoop node
66. Configuring IKM Hive Transform
•TRANSFORM_SCRIPT_NAME specifies name of
script, and path to script
•TRANSFORM_SCRIPT has issues with parsing;
do not use, leave blank and KM will use existing one
•Optional ability to specify sort and distribution
columns (can be compound)
•Leave other options at default
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
67. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
Executing the Mapping
•KM automatically registers the script with Hive (which caches it on all nodes)
•HiveQL output then runs the contents of the first Hive table through the script, outputting
results to target table
68. Bulk Unload Summary Data to Oracle Database
•Final requirement is to unload final Hive table contents to Oracle Database
•Several use-cases for this:
•Use Hadoop / BDA for ETL offloading
•Use analysis capabilities of BDA, but then output results to RDBMS data mart or DW
•Permit use of more advanced SQL query tools
•Share results with other applications
•Can use Sqoop for this, or use Oracle Big Data Connectors
•Fast bulk unload, or transparent Oracle access to Hive
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
5
69. Oracle Direct Connector for HDFS
•Enables HDFS as a data-source for Oracle Database external tables
•Effectively provides Oracle SQL access over HDFS
•Supports data query, or import into Oracle DB
•Treat HDFS-stored files in the same way as regular files
•But with HDFS’s low-cost
•… and fault-tolerance
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
70. Oracle Loader for Hadoop (OLH)
•Oracle technology for accessing Hadoop data, and loading it into an Oracle database
•Pushes data transformation, “heavy lifting” to the Hadoop cluster, using MapReduce
•Direct-path loads into Oracle Database, partitioned and non-partitioned
•Online and offline loads
•Load from HDFS or Hive tables
•Key technology for fast load of
Hadoop results into Oracle DB
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
71. IKM File/Hive to Oracle (OLH/ODCH)
•KM for accessing HDFS/Hive data from Oracle
•Either sets up ODCH connectivity, or bulk-unloads via OLH
•Map from HDFS or Hive source to Oracle tables (via Oracle technology in Topology)
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
72. Environment Variable Requirements
•Hardest part in setting up OLH / IKM File/Hive to Oracle is getting environment variables
correct - OLH needs to be able to see correct JARs, configuration files
•Set in /home/oracle/.bashrc - see example below
export HIVE_HOME=/usr/lib/hive
export HADOOP_CLASSPATH=/home/oracle/oracle/product/oraloader-3.0.0-h2/jlib/*:/etc/hive/conf:$HIVE_HOME/lib/
hive-metastore-0.12.0-cdh5.0.1.jar:$HIVE_HOME/lib/libthrift.jar:$HIVE_HOME/lib/libfb303-0.9.0.jar:$HIVE_HOME/
lib/hive-common-0.12.0-cdh5.0.1.jar:$HIVE_HOME/lib/hive-exec-0.12.0-cdh5.0.1.jar
export OLH_HOME=/home/oracle/oracle/product/oraloader-3.0.0-h2
export HADOOP_HOME=/usr/lib/hadoop
export JAVA_HOME=/usr/java/jdk1.7.0_60
export ODI_HIVE_SESSION_JARS=/usr/lib/hive/lib/hive-contrib.jar
export ODI_OLH_JARS=/home/oracle/oracle/product/oraloader-3.0.0-h2/jlib/ojdbc6.jar,/home/oracle/oracle/
product/oraloader-3.0.0-h2/jlib/orai18n.jar,/home/oracle/oracle/product/oraloader-3.0.0-h2/jlib/orai18n-utility.
jar,/home/oracle/oracle/product/oraloader-3.0.0-h2/jlib/orai18n-mapping.jar,/home/oracle/oracle/
product/oraloader-3.0.0-h2/jlib/orai18n-collation.jar,/home/oracle/oracle/product/oraloader-3.0.0-h2/jlib/
oraclepki.jar,/home/oracle/oracle/product/oraloader-3.0.0-h2/jlib/osdt_cert.jar,/home/oracle/oracle/product/
oraloader-3.0.0-h2/jlib/osdt_core.jar,/home/oracle/oracle/product/oraloader-3.0.0-h2/jlib/commons-math-
2.2.jar,/home/oracle/oracle/product/oraloader-3.0.0-h2/jlib/jackson-core-asl-1.8.8.jar,/home/oracle/
oracle/product/oraloader-3.0.0-h2/jlib/jackson-mapper-asl-1.8.8.jar,/home/oracle/oracle/product/
oraloader-3.0.0-h2/jlib/avro-1.7.3.jar,/home/oracle/oracle/product/oraloader-3.0.0-h2/jlib/avro-mapred-1.7.3-
hadoop2.jar,/home/oracle/oracle/product/oraloader-3.0.0-h2/jlib/oraloader.jar,/usr/lib/hive/lib/hive-metastore.
jar,/usr/lib/hive/lib/libthrift-0.9.0.cloudera.2.jar,/usr/lib/hive/lib/libfb303-0.9.0.jar,/usr/lib/
hive/lib/hive-common-0.12.0-cdh5.0.1.jar,/usr/lib/hive/lib/hive-exec.jar
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
73. Configuring the KM Physical Settings
•For the access table in Physical view, change LKM to LKM SQL Multi-Connect
•Delegates the multi-connect capabilities to the downstream node, so you can use a multi-connect
IKM such as IKM File/Hive to Oracle
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
74. Configuring the KM Physical Settings
•For the target table, select IKM File/Hive to Oracle
•Only becomes available to select once
LKM SQL Multi-Connect selected for access table
•Key option values to set are:
•OLH_OUTPUT_MODE (use JDBC initially, OCI
if Oracle Client installed on Hadoop client node)
•MAPRED_OUTPUT_BASE_DIR (set to directory
on HFDS that OS user running ODI can access)
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
75. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
Executing the Mapping
•Executing the mapping will invoke
OLH from the OS command line
•Hive table (or HDFS file) contents
copied to Oracle table
76. Create Package to Sequence ETL Steps
•Define package (or load plan) within ODI12c to orchestrate the process
•Call package / load plan execution from command-line, web service call, or schedule
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
77. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
Execute Overall Package
•Each step executed in sequence
•End-to-end ETL process, using ODI12c’s metadata-driven development process,
data quality handing, heterogenous connectivity, but Hadoop-native processing
78. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
Conclusions
•Hadoop, and the Oracle Big Data Appliance, is an excellent platform for data capture,
analysis and processing
•Hadoop tools such as Hive, Sqoop, MapReduce and Pig provide means to process and
analyse data in parallel, using languages + approach familiar to Oracle developers
•ODI12c provides several benefits when working with ETL and data loading on Hadoop
‣Metadata-driven design; data quality handling; KMs to handle technical complexity
•Oracle Data Integrator Adapter for Hadoop provides several KMs for Hadoop sources
•In this presentation, we’ve seen an end-to-end example of big data ETL using ODI
‣The power of Hadoop and BDA, with the ETL orchestration of ODI12c
79. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
Thank You for Attending!
•Thank you for attending this presentation, and more information can be found at http://
www.rittmanmead.com
•Contact us at info@rittmanmead.com or mark.rittman@rittmanmead.com
•Look out for our book, “Oracle Business Intelligence Developers Guide” out now!
•Follow-us on Twitter (@rittmanmead) or Facebook (facebook.com/rittmanmead)
80. ODI12c as your Big Data Integration Hub
Mark Rittman, CTO, Rittman Mead
July 2014
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com