As presented at OGh SQL Celebration Day in June 2016, NL. Covers new features in Big Data SQL including storage indexes, storage handlers and ability to install + license on commodity hardware
Big Data for Oracle Devs - Towards Spark, Real-Time and Predictive AnalyticsMark Rittman
This is a session for Oracle DBAs and devs that looks at the cutting edge big data techs like Spark, Kafka etc, and through demos shows how Hadoop is now a a real-time platform for fast analytics, data integration and predictive modeling
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...Mark Rittman
Hadoop and NoSQL platforms initially focused on Java developers and slow but massively-scalable MapReduce jobs as an alternative to high-end but limited-scale analytics RDBMS engines. Apache Hive opened-up Hadoop to non-programmers by adding a SQL query engine and relational-style metadata layered over raw HDFS storage, and since then open-source initiatives such as Hive Stinger, Cloudera Impala and Apache Drill along with proprietary solutions from closed-source vendors have extended SQL-on-Hadoop’s capabilities into areas such as low-latency ad-hoc queries, ACID-compliant transactions and schema-less data discovery – at massive scale and with compelling economics.
In this session we’ll focus on technical foundations around SQL-on-Hadoop, first reviewing the basic platform Apache Hive provides and then looking in more detail at how ad-hoc querying, ACID-compliant transactions and data discovery engines work along with more specialised underlying storage that each now work best with – and we’ll take a look to the future to see how SQL querying, data integration and analytics are likely to come together in the next five years to make Hadoop the default platform running mixed old-world/new-world analytics workloads.
Oracle BI Hybrid BI : Mode 1 + Mode 2, Cloud + On-Premise Business AnalyticsMark Rittman
Presented at the UKOUG Business Analytics SIG Meeting in April 2016, addresses the question as to whether enterprise BI tools such as OBIEE12c are relevant in the world of Gartner BiModal Mode 1 + Mode 2 analytics, and Hybrid cloud/on-premise deployments
SQL-on-Hadoop for Analytics + BI: What Are My Options, What's the Future?Mark Rittman
There are many options for providing SQL access over data in a Hadoop cluster, including proprietary vendor products along with open-source technologies such as Apache Hive, Cloudera Impala and Apache Drill; customers are using those to provide reporting over their Hadoop and relational data platforms, and looking to add capabilities such as calculation engines, data integration and federation along with in-memory caching to create complete analytic platforms. In this session we’ll look at the options that are available, compare database vendor solutions with their open-source alternative, and see how emerging vendors are going beyond simple SQL-on-Hadoop products to offer complete “data fabric” solutions that bring together old-world and new-world technologies and allow seamless offloading of archive data and compute work to lower-cost Hadoop platforms.
Using Oracle Big Data Discovey as a Data Scientist's ToolkitMark Rittman
As delivered at Trivadis Tech Event 2016 - how Big Data Discovery along with Python and pySpark was used to build predictive analytics models against wearables and smart home data
Social Network Analysis using Oracle Big Data Spatial & Graph (incl. why I di...Mark Rittman
As presented at OGh SQL Celebration Day 2016 - including new content on why NoSQL and Hadoop is a better solution for social network analysis than the Oracle Database (for now...)
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop : Mark Rittman
There are many options for providing SQL access over data in a Hadoop cluster, including proprietary vendor products such as Oracle Big Data SQL on the Oracle Big Data Appliance along with open-source technologies such as Apache Hive, Cloudera Impala and Apache Drill; customers are using those to provide reporting over their Hadoop and relational data platforms, and looking to add capabilities such as calculation engines, data integration and federation along with in-memory caching to create complete analytic platforms. In this session we'll look at the options that are available, compare database vendor solutions with their open-source alternative, and see how emerging vendors are going beyond simple SQL-on-Hadoop products to offer complete "data fabric" solutions that bring together old-world and new-world technologies and allow seamless offloading of archive data and compute work to lower-cost Hadoop platforms.
Big Data for Oracle Devs - Towards Spark, Real-Time and Predictive AnalyticsMark Rittman
This is a session for Oracle DBAs and devs that looks at the cutting edge big data techs like Spark, Kafka etc, and through demos shows how Hadoop is now a a real-time platform for fast analytics, data integration and predictive modeling
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...Mark Rittman
Hadoop and NoSQL platforms initially focused on Java developers and slow but massively-scalable MapReduce jobs as an alternative to high-end but limited-scale analytics RDBMS engines. Apache Hive opened-up Hadoop to non-programmers by adding a SQL query engine and relational-style metadata layered over raw HDFS storage, and since then open-source initiatives such as Hive Stinger, Cloudera Impala and Apache Drill along with proprietary solutions from closed-source vendors have extended SQL-on-Hadoop’s capabilities into areas such as low-latency ad-hoc queries, ACID-compliant transactions and schema-less data discovery – at massive scale and with compelling economics.
In this session we’ll focus on technical foundations around SQL-on-Hadoop, first reviewing the basic platform Apache Hive provides and then looking in more detail at how ad-hoc querying, ACID-compliant transactions and data discovery engines work along with more specialised underlying storage that each now work best with – and we’ll take a look to the future to see how SQL querying, data integration and analytics are likely to come together in the next five years to make Hadoop the default platform running mixed old-world/new-world analytics workloads.
Oracle BI Hybrid BI : Mode 1 + Mode 2, Cloud + On-Premise Business AnalyticsMark Rittman
Presented at the UKOUG Business Analytics SIG Meeting in April 2016, addresses the question as to whether enterprise BI tools such as OBIEE12c are relevant in the world of Gartner BiModal Mode 1 + Mode 2 analytics, and Hybrid cloud/on-premise deployments
SQL-on-Hadoop for Analytics + BI: What Are My Options, What's the Future?Mark Rittman
There are many options for providing SQL access over data in a Hadoop cluster, including proprietary vendor products along with open-source technologies such as Apache Hive, Cloudera Impala and Apache Drill; customers are using those to provide reporting over their Hadoop and relational data platforms, and looking to add capabilities such as calculation engines, data integration and federation along with in-memory caching to create complete analytic platforms. In this session we’ll look at the options that are available, compare database vendor solutions with their open-source alternative, and see how emerging vendors are going beyond simple SQL-on-Hadoop products to offer complete “data fabric” solutions that bring together old-world and new-world technologies and allow seamless offloading of archive data and compute work to lower-cost Hadoop platforms.
Using Oracle Big Data Discovey as a Data Scientist's ToolkitMark Rittman
As delivered at Trivadis Tech Event 2016 - how Big Data Discovery along with Python and pySpark was used to build predictive analytics models against wearables and smart home data
Social Network Analysis using Oracle Big Data Spatial & Graph (incl. why I di...Mark Rittman
As presented at OGh SQL Celebration Day 2016 - including new content on why NoSQL and Hadoop is a better solution for social network analysis than the Oracle Database (for now...)
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop : Mark Rittman
There are many options for providing SQL access over data in a Hadoop cluster, including proprietary vendor products such as Oracle Big Data SQL on the Oracle Big Data Appliance along with open-source technologies such as Apache Hive, Cloudera Impala and Apache Drill; customers are using those to provide reporting over their Hadoop and relational data platforms, and looking to add capabilities such as calculation engines, data integration and federation along with in-memory caching to create complete analytic platforms. In this session we'll look at the options that are available, compare database vendor solutions with their open-source alternative, and see how emerging vendors are going beyond simple SQL-on-Hadoop products to offer complete "data fabric" solutions that bring together old-world and new-world technologies and allow seamless offloading of archive data and compute work to lower-cost Hadoop platforms.
Riga dev day 2016 adding a data reservoir and oracle bdd to extend your ora...Mark Rittman
This talk focus is on what a data reservoir is, how it related to the RDBMS DW, and how Big Data Discovery provides access to it to business and BI users
A series of tweets I posted about my 11hr struggle to make a cup of tea with my WiFi kettle ended-up going viral, got picked-up by the national and then international press, and led to thousands of retweets, comments and references in the media. In this session we’ll take the data I recorded on this Twitter activity over the period and use Oracle Big Data Graph and Spatial to understand what caused the breakout and the tweet going viral, who were the key influencers and connectors, and how the tweet spread over time and over geography from my original series of posts in Hove, England.
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...Rittman Analytics
Most DBAs are aware something interesting is going on with big data and the Hadoop product ecosystem that underpins it, but aren't so clear about what each component in the stack does, what problem each part solves and why those problems couldn't be solved using the old approach. We'll look at where it's all going with the advent of Spark and machine learning, what's happening with ETL, metadata and analytics on this platform ... why IaaS and datawarehousing-as-a-service will have such a big impact, sooner than you think
OBIEE12c and Embedded Essbase 12c - An Initial Look at Query Acceleration Use...Mark Rittman
OBIEE12c comes with an updated version of Essbase that focuses entirely in this release on the query acceleration use-case. This presentation looks at this new release and explains how the new BI Accelerator Wizard manages the creation of Essbase cubes to accelerate OBIEE query performance
Innovation in the Data Warehouse - StampedeCon 2016StampedeCon
Enterprise Holding’s first started with Hadoop as a POC in 2013. Today, we have clusters on premises and in the cloud. This talk will explore our experience with Big Data and outline three common big data architectures (batch, lambda, and kappa). Then, we’ll dive into the decision points to necessary for your own cluster, for example: cloud vs on premises, physical vs virtual, workload, and security. These decisions will help you understand what direction to take. Finally, we’ll share some lessons learned with the pieces of our architecture worked well and rant about those which didn’t. No deep Hadoop knowledge is necessary, architect or executive level.
Big data architectures and the data lakeJames Serra
With so many new technologies it can get confusing on the best approach to building a big data architecture. The data lake is a great new concept, usually built in Hadoop, but what exactly is it and how does it fit in? In this presentation I'll discuss the four most common patterns in big data production implementations, the top-down vs bottoms-up approach to analytics, and how you can use a data lake and a RDBMS data warehouse together. We will go into detail on the characteristics of a data lake and its benefits, and how you still need to perform the same data governance tasks in a data lake as you do in a data warehouse. Come to this presentation to make sure your data lake does not turn into a data swamp!
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...StampedeCon
This session will be a detailed recount of the design, implementation, and launch of the next-generation Shutterstock Data Platform, with strong emphasis on conveying clear, understandable learnings that can be transferred to your own organizations and projects. This platform was architected around the prevailing use of Kafka as a highly-scalable central data hub for shipping data across your organization in batch or streaming fashion. It also relies heavily on Avro as a serialization format and a global schema registry to provide structure that greatly improves quality and usability of our data sets, while also allowing the flexibility to evolve schemas and maintain backwards compatibility.
As a company, Shutterstock has always focused heavily on leveraging open source technologies in developing its products and infrastructure, and open source has been a driving force in big data more so than almost any other software sub-sector. With this plethora of constantly evolving data technologies, it can be a daunting task to select the right tool for your problem. We will discuss our approach for choosing specific existing technologies and when we made decisions to invest time in home-grown components and solutions.
We will cover advantages and the engineering process of developing language-agnostic APIs for publishing to and consuming from the data platform. These APIs can power some very interesting streaming analytics solutions that are easily accessible to teams across our engineering organization.
We will also discuss some of the massive advantages a global schema for your data provides for downstream ETL and data analytics. ETL into Hadoop and creation and maintenance of Hive databases and tables becomes much more reliable and easily automated with historically compatible schemas. To complement this schema-based approach, we will cover results of performance testing various file formats and compression schemes in Hadoop and Hive, the massive performance benefits you can gain in analytical workloads by leveraging highly optimized columnar file formats such as ORC and Parquet, and how you can use good old fashioned Hive as a tool for easily and efficiently converting exiting datasets into these formats.
Finally, we will cover lessons learned in launching this platform across our organization, future improvements and further design, and the need for data engineers to understand and speak the languages of data scientists and web, infrastructure, and network engineers.
Turn Data Into Actionable Insights - StampedeCon 2016StampedeCon
At Monsanto, emerging technologies such as IoT, advanced imaging and geo-spatial platforms; molecular breeding, ancestry and genomics data sets have made us rethink how we approach developing, deploying, scaling and distributing our software to accelerate predictive and prescriptive decisions. We created a Cloud based Data Science platform for the enterprise to address this need. Our primary goals were to perform analytics@scale and integrate analytics with our core product platforms.
As part of this talk, we will be sharing our journey of transformation showing how we enabled: a collaborative discovery analytics environment for data science teams to perform model development, provisioning data through APIs, streams and deploying models to production through our auto-scaling big-data compute in the cloud to perform streaming, cognitive, predictive, prescriptive, historical and batch analytics@scale, integrating analytics with our core product platforms to turn data into actionable insights.
Big Data Architecture Workshop - Vahid Amiridatastack
Big Data Architecture Workshop
This slide is about big data tools, thecnologies and layers that can be used in enterprise solutions.
TopHPC Conference
2019
Hadoop Data Lake vs classical Data Warehouse: How to utilize best of both wor...Kolja Manuel Rödel
Looking at the IT landscape of big and medium-sized companies, Hadoop Data Lakes are no rarity anymore. Classical Data Warehouses stay on the map as well. So we usually have a hybrid landscape, historically grown and more or less loosely coupled. To gain value from this setup, it requires a holistic and use case oriented approach. This session presents a best-practice architecture. We illustrate the strengths and shortcomings of its components. Regarding typical use cases we discuss which challenge can be tackled best by which part.
Build a simple data lake on AWS using a combination of services, including AWS Glue Data Catalog, AWS Glue Crawlers, AWS Glue Jobs, AWS Glue Studio, Amazon Athena, Amazon Relational Database Service (Amazon RDS), and Amazon S3.
Link to the blog post and video: https://garystafford.medium.com/building-a-simple-data-lake-on-aws-df21ca092e32
There is a fundamental shift underway in IT to include open, software defined, distributed systems like Hadoop. As a result, every Oracle professional should strive to learn these new technologies or risk being left behind. This session is designed specifically for Oracle database professionals so they can better understand SQL on Hadoop and the benefits it brings to the enterprise. Attendees will see how SQL on Hadoop compares to Oracle in areas such as data storage, data ingestion, and SQL processing. Various live demos will provide attendees with a first-hand look at these new world technologies. Presented at Collaborate 18.
http://www.opitz-consulting.com
"Spark vs. PL/SQL" war das Thema unserer Experten Christopher Thomsen und Marian Strüby bei der DOAG 2015 Konferenz und Ausstellung.
Mit Hadoop 2.0 öffnete sich die Big Data-Plattform für neue Algorithmen und Technologien, um auch als Basis für In-memory Computing, Ad-Hoc Query und Streaming-Anwendungen nutzbar zu sein. Apache Spark etabliert sich hier derzeit als Vorreiter unter den Hadoop-Allzweckwaffen und kommt in immer mehr Produkten - unter anderem auch dem Big Data Connector des Oracle Data Integrators und Oracle Big Data Discovery - als Ausführungsframework zum Einsatz. Was Spark ist, wo es in den neuen Oracle-Produkten in welcher Form zum Einsatz kommt und wie sich ETL-Prozesse, -Werkzeuge und die Datenintegration im Data Warehouse dadurch verändern, soll in diesem Vortrag exemplarisch durch Gegenüberstellung von in PL/SQL und Spark implementierten Anwendungsbeispielen aufgezeigt werden.
_ _
Über uns:
Als führender Projektspezialist für ganzheitliche IT-Lösungen tragen wir zur Wertsteigerung der Organisationen unserer Kunden bei und bringen IT und Business in Einklang. Mit OPITZ CONSULTING als zuverlässigem Partner können sich unsere Kunden auf ihr Kerngeschäft konzentrieren und ihre Wettbewerbsvorteile nachhaltig absichern und ausbauen.
Über unsere IT-Beratung: http://www.opitz-consulting.com
Unser Leistungsangebot: http://www.opitz-consulting.com
Karriere bei OPITZ CONSULTING: http://www.opitz-consulting.com
Riga dev day 2016 adding a data reservoir and oracle bdd to extend your ora...Mark Rittman
This talk focus is on what a data reservoir is, how it related to the RDBMS DW, and how Big Data Discovery provides access to it to business and BI users
A series of tweets I posted about my 11hr struggle to make a cup of tea with my WiFi kettle ended-up going viral, got picked-up by the national and then international press, and led to thousands of retweets, comments and references in the media. In this session we’ll take the data I recorded on this Twitter activity over the period and use Oracle Big Data Graph and Spatial to understand what caused the breakout and the tweet going viral, who were the key influencers and connectors, and how the tweet spread over time and over geography from my original series of posts in Hove, England.
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...Rittman Analytics
Most DBAs are aware something interesting is going on with big data and the Hadoop product ecosystem that underpins it, but aren't so clear about what each component in the stack does, what problem each part solves and why those problems couldn't be solved using the old approach. We'll look at where it's all going with the advent of Spark and machine learning, what's happening with ETL, metadata and analytics on this platform ... why IaaS and datawarehousing-as-a-service will have such a big impact, sooner than you think
OBIEE12c and Embedded Essbase 12c - An Initial Look at Query Acceleration Use...Mark Rittman
OBIEE12c comes with an updated version of Essbase that focuses entirely in this release on the query acceleration use-case. This presentation looks at this new release and explains how the new BI Accelerator Wizard manages the creation of Essbase cubes to accelerate OBIEE query performance
Innovation in the Data Warehouse - StampedeCon 2016StampedeCon
Enterprise Holding’s first started with Hadoop as a POC in 2013. Today, we have clusters on premises and in the cloud. This talk will explore our experience with Big Data and outline three common big data architectures (batch, lambda, and kappa). Then, we’ll dive into the decision points to necessary for your own cluster, for example: cloud vs on premises, physical vs virtual, workload, and security. These decisions will help you understand what direction to take. Finally, we’ll share some lessons learned with the pieces of our architecture worked well and rant about those which didn’t. No deep Hadoop knowledge is necessary, architect or executive level.
Big data architectures and the data lakeJames Serra
With so many new technologies it can get confusing on the best approach to building a big data architecture. The data lake is a great new concept, usually built in Hadoop, but what exactly is it and how does it fit in? In this presentation I'll discuss the four most common patterns in big data production implementations, the top-down vs bottoms-up approach to analytics, and how you can use a data lake and a RDBMS data warehouse together. We will go into detail on the characteristics of a data lake and its benefits, and how you still need to perform the same data governance tasks in a data lake as you do in a data warehouse. Come to this presentation to make sure your data lake does not turn into a data swamp!
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...StampedeCon
This session will be a detailed recount of the design, implementation, and launch of the next-generation Shutterstock Data Platform, with strong emphasis on conveying clear, understandable learnings that can be transferred to your own organizations and projects. This platform was architected around the prevailing use of Kafka as a highly-scalable central data hub for shipping data across your organization in batch or streaming fashion. It also relies heavily on Avro as a serialization format and a global schema registry to provide structure that greatly improves quality and usability of our data sets, while also allowing the flexibility to evolve schemas and maintain backwards compatibility.
As a company, Shutterstock has always focused heavily on leveraging open source technologies in developing its products and infrastructure, and open source has been a driving force in big data more so than almost any other software sub-sector. With this plethora of constantly evolving data technologies, it can be a daunting task to select the right tool for your problem. We will discuss our approach for choosing specific existing technologies and when we made decisions to invest time in home-grown components and solutions.
We will cover advantages and the engineering process of developing language-agnostic APIs for publishing to and consuming from the data platform. These APIs can power some very interesting streaming analytics solutions that are easily accessible to teams across our engineering organization.
We will also discuss some of the massive advantages a global schema for your data provides for downstream ETL and data analytics. ETL into Hadoop and creation and maintenance of Hive databases and tables becomes much more reliable and easily automated with historically compatible schemas. To complement this schema-based approach, we will cover results of performance testing various file formats and compression schemes in Hadoop and Hive, the massive performance benefits you can gain in analytical workloads by leveraging highly optimized columnar file formats such as ORC and Parquet, and how you can use good old fashioned Hive as a tool for easily and efficiently converting exiting datasets into these formats.
Finally, we will cover lessons learned in launching this platform across our organization, future improvements and further design, and the need for data engineers to understand and speak the languages of data scientists and web, infrastructure, and network engineers.
Turn Data Into Actionable Insights - StampedeCon 2016StampedeCon
At Monsanto, emerging technologies such as IoT, advanced imaging and geo-spatial platforms; molecular breeding, ancestry and genomics data sets have made us rethink how we approach developing, deploying, scaling and distributing our software to accelerate predictive and prescriptive decisions. We created a Cloud based Data Science platform for the enterprise to address this need. Our primary goals were to perform analytics@scale and integrate analytics with our core product platforms.
As part of this talk, we will be sharing our journey of transformation showing how we enabled: a collaborative discovery analytics environment for data science teams to perform model development, provisioning data through APIs, streams and deploying models to production through our auto-scaling big-data compute in the cloud to perform streaming, cognitive, predictive, prescriptive, historical and batch analytics@scale, integrating analytics with our core product platforms to turn data into actionable insights.
Big Data Architecture Workshop - Vahid Amiridatastack
Big Data Architecture Workshop
This slide is about big data tools, thecnologies and layers that can be used in enterprise solutions.
TopHPC Conference
2019
Hadoop Data Lake vs classical Data Warehouse: How to utilize best of both wor...Kolja Manuel Rödel
Looking at the IT landscape of big and medium-sized companies, Hadoop Data Lakes are no rarity anymore. Classical Data Warehouses stay on the map as well. So we usually have a hybrid landscape, historically grown and more or less loosely coupled. To gain value from this setup, it requires a holistic and use case oriented approach. This session presents a best-practice architecture. We illustrate the strengths and shortcomings of its components. Regarding typical use cases we discuss which challenge can be tackled best by which part.
Build a simple data lake on AWS using a combination of services, including AWS Glue Data Catalog, AWS Glue Crawlers, AWS Glue Jobs, AWS Glue Studio, Amazon Athena, Amazon Relational Database Service (Amazon RDS), and Amazon S3.
Link to the blog post and video: https://garystafford.medium.com/building-a-simple-data-lake-on-aws-df21ca092e32
There is a fundamental shift underway in IT to include open, software defined, distributed systems like Hadoop. As a result, every Oracle professional should strive to learn these new technologies or risk being left behind. This session is designed specifically for Oracle database professionals so they can better understand SQL on Hadoop and the benefits it brings to the enterprise. Attendees will see how SQL on Hadoop compares to Oracle in areas such as data storage, data ingestion, and SQL processing. Various live demos will provide attendees with a first-hand look at these new world technologies. Presented at Collaborate 18.
http://www.opitz-consulting.com
"Spark vs. PL/SQL" war das Thema unserer Experten Christopher Thomsen und Marian Strüby bei der DOAG 2015 Konferenz und Ausstellung.
Mit Hadoop 2.0 öffnete sich die Big Data-Plattform für neue Algorithmen und Technologien, um auch als Basis für In-memory Computing, Ad-Hoc Query und Streaming-Anwendungen nutzbar zu sein. Apache Spark etabliert sich hier derzeit als Vorreiter unter den Hadoop-Allzweckwaffen und kommt in immer mehr Produkten - unter anderem auch dem Big Data Connector des Oracle Data Integrators und Oracle Big Data Discovery - als Ausführungsframework zum Einsatz. Was Spark ist, wo es in den neuen Oracle-Produkten in welcher Form zum Einsatz kommt und wie sich ETL-Prozesse, -Werkzeuge und die Datenintegration im Data Warehouse dadurch verändern, soll in diesem Vortrag exemplarisch durch Gegenüberstellung von in PL/SQL und Spark implementierten Anwendungsbeispielen aufgezeigt werden.
_ _
Über uns:
Als führender Projektspezialist für ganzheitliche IT-Lösungen tragen wir zur Wertsteigerung der Organisationen unserer Kunden bei und bringen IT und Business in Einklang. Mit OPITZ CONSULTING als zuverlässigem Partner können sich unsere Kunden auf ihr Kerngeschäft konzentrieren und ihre Wettbewerbsvorteile nachhaltig absichern und ausbauen.
Über unsere IT-Beratung: http://www.opitz-consulting.com
Unser Leistungsangebot: http://www.opitz-consulting.com
Karriere bei OPITZ CONSULTING: http://www.opitz-consulting.com
Realtime Analytical Query Processing and Predictive Model Building on High Di...Spark Summit
Spark SQL and Mllib are optimized for running feature extraction and machine learning algorithms on row based columnar datasets through full scan but does not provide constructs for column indexing and time series analysis. For dealing with document datasets with timestamps where the features are represented as variable number of columns in each document and use-cases demand searching over columns and time to retrieve documents to generate learning models in realtime, a close integration within Spark and Lucene was needed. We introduced LuceneDAO in Spark Summit Europe 2016 to build distributed lucene shards from data frame but the time series attributes were not part of the data model. In this talk we present our extension to LuceneDAO to maintain time stamps with document-term view for search and allow time filters. Lucene shards maintain the time aware document-term view for search and vector space representation for machine learning pipelines. We used Spark as our distributed query processing engine where each query is represented as boolean combination over terms with filters on time. LuceneDAO is used to load the shards to Spark executors and power sub-second distributed document retrieval for the queries.
Our synchronous API uses Spark-as-a-Service to power analytical queries while our asynchronous API uses kafka, spark streaming and HBase to power time series prediction algorithms. In this talk we will demonstrate LuceneDAO write and read performance on millions of documents with 1M+ terms and configurable time stamp aggregate columns. We will demonstrate the latency of APIs on a suite
of queries generated from terms. Key takeaways from the talk will be a thorough understanding of how to make Lucene powered time aware search a first class citizen in Spark to build interactive analytical query processing and time series prediction algorithms.
"Analyzing Twitter Data with Hadoop - Live Demo", presented at Oracle Open World 2014. The repository for the slides is in https://github.com/cloudera/cdh-twitter-example
A talk given by Ted Dunning on February 2013 on Apache Drill, an open-source community-driven project to provide easy, dependable, fast and flexible ad hoc query capabilities.
Presentation by Mark Rittman, Technical Director, Rittman Mead, on ODI 11g features that support enterprise deployment and usage. Delivered at BIWA Summit 2013, January 2013.
SQL on Hadoop
Looking for the correct tool for your SQL-on-Hadoop use case?
There is a long list of alternatives to choose from; how to select the correct tool?
The tool selection is always based on use case requirements.
Read more on alternatives and our recommendations.
Spark, the ultra-fast, general purpose big data computing platform provides some very flexible options for processing and accessing data. In a previous meetup we covered PySpark and the Schema RDD. In this session we reviewed and expanded on this, with an in-depth exploration of Spark SQL.
- Overview of Spark in the Hadoop ecosystem
- Deep dive into Spark SQL with step by steps on how to implement and use it
If you have questions about the presentation or want to learn more about our services, please visit our website: http://casertaconcepts.com/
Big Data is the reality of modern business: from big companies to small ones, everybody is trying to find their own benefit. Big Data technologies are not meant to replace traditional ones, but to be complementary to them. In this presentation you will hear what is Big Data and Data Lake and what are the most popular technologies used in Big Data world. We will also speak about Hadoop and Spark, and how they integrate with traditional systems and their benefits.
Big Data 2.0: YARN Enablement for Distributed ETL & SQL with HadoopCaserta
In our most recent Big Data Warehousing Meetup, we learned about transitioning from Big Data 1.0 with Hadoop 1.x with nascent technologies to the advent of Hadoop 2.x with YARN to enable distributed ETL, SQL and Analytics solutions. Caserta Concepts Chief Architect Elliott Cordo and an Actian Engineer covered the complete data value chain of an Enterprise-ready platform including data connectivity, collection, preparation, optimization and analytics with end user access.
For more information on our services or upcoming events, please visit our website at http://www.casertaconcepts.com/.
Big Data visualization with Apache Spark and Zeppelinprajods
This presentation gives an overview of Apache Spark and explains the features of Apache Zeppelin(incubator). Zeppelin is the open source tool for data discovery, exploration and visualization. It supports REPLs for shell, SparkSQL, Spark(scala), python and angular. This presentation was made on the Big Data Day, at the Great Indian Developer Summit, Bangalore, April 2015
Modernize Your Existing EDW with IBM Big SQL & Hortonworks Data PlatformHortonworks
Find out how Hortonworks and IBM help you address these challenges to enable success to optimize your existing EDW environment.
https://hortonworks.com/webinar/modernize-existing-edw-ibm-big-sql-hortonworks-data-platform/
Cómo Oracle ha logrado separar el motor SQL de su emblemática base de datos para procesar las consultas y los drivers de acceso que permiten leer datos, tanto de ficheros sobre el Hadoop Distributed File System, como de la herramienta de Data Warehousing, HIVE.
In this slidecast, Alex Gorbachev from Pythian presents a Practical Introduction to Hadoop. This is a great primer for viewers who want to get the big picture on how Hadoop works with Big Data and how this approach differs from relational databases.
Watch the presentation: http://inside-bigdata.com/slidecast-a-practical-introduction-to-hadoop/
Download the audio:
Similar to Using Oracle Big Data SQL 3.0 to add Hadoop & NoSQL to your Oracle Data Warehouse (20)
What is Big Data Discovery, and how it complements traditional business anal...Mark Rittman
Data Discovery is an analysis technique that complements traditional business analytics, and enables users to combine, explore and analyse disparate datasets to spot opportunities and patterns that lie hidden within your data. Oracle Big Data discovery takes this idea and applies it to your unstructured and big data datasets, giving users a way to catalogue, join and then analyse all types of data across your organization.
In this session we'll look at Oracle Big Data Discovery and how it provides a "visual face" to your big data initatives, and how it complements and extends the work that you currently do using business analytics tools.
Deploying Full Oracle BI Platforms to Oracle Cloud - OOW2015Mark Rittman
Presentation given at Oracle Openworld 2015 on moving an existing OBIEE11g BI platform to Oracle Public Cloud, including accompanying DW database and continuing the ETL process. Explores migration process and what's now possible in Oracle Cloud for hosting full OBIEE platforms, and looks at what the benefits of such a migration might be for customers and end-users.
Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Ar...Mark Rittman
Presentation from the Rittman Mead BI Forum 2015 masterclass, pt.2 of a two-part session that also covered creating the Discovery Lab. Goes through setting up Flume log + twitter feeds into CDH5 Hadoop using ODI12c Advanced Big Data Option, then looks at the use of OBIEE11g with Hive, Impala and Big Data SQL before finally using Oracle Big Data Discovery for faceted search and data mashup on-top of Hadoop
OBIEE11g Seminar by Mark Rittman for OU Expert Summit, Dubai 2015Mark Rittman
Slides from a two-day OBIEE11g seminar in Dubai, February 2015, at the Oracle University Expert Summit. Covers the following topics:
1. OBIEE 11g Overview & New Features
2. Adding Exalytics and In-Memory Analytics to OBIEE 11g
3. Source Control and Concurrent Development for OBIEE
4. No Silver Bullets - OBIEE 11g Performance in the Real World
5. Oracle BI Cloud Service Overview, Tips and Techniques
6. Moving to Oracle BI Applications 11g + ODI
7. Oracle Essbase and Oracle BI EE 11g Integration Tips and Techniques
8. OBIEE 11g and Predictive Analytics, Hadoop & Big Data
UKOUG Tech'14 Super Sunday : Deep-Dive into Big Data ETL with ODI12cMark Rittman
Slides from my 2hr session at the UKOUG Tech'14 Super Sunday event, covering Hadoop basics and use of Oracle Data Integrator 12c for ETL on the Hadoop platform. Also some coverage of Oracle big data product announcements from OOW2014.
Part 1 - Introduction to Hadoop and Big Data Technologies for Oracle BI & DW ...Mark Rittman
Delivered as a one-day seminar at the SIOUG and HROUG Oracle User Group Conferences, October 2014
In this presentation we cover some key Hadoop concepts including HDFS, MapReduce, Hive and NoSQL/HBase, with the focus on Oracle Big Data Appliance and Cloudera Distribution including Hadoop. We explain how data is stored on a Hadoop system and the high-level ways it is accessed and analysed, and outline Oracle’s products in this area including the Big Data Connectors, Oracle Big Data SQL, and Oracle Business Intelligence (OBI) and Oracle Data Integrator (ODI).
Part 4 - Hadoop Data Output and Reporting using OBIEE11gMark Rittman
Delivered as a one-day seminar at the SIOUG and HROUG Oracle User Group Conferences, October 2014.
Once insights and analysis have been produced within your Hadoop cluster by analysts and technical staff, it’s usually the case that you want to share the output with a wider audience in the organisation. Oracle Business Intelligence has connectivity to Hadoop through Apache Hive compatibility, and other Oracle tools such as Oracle Big Data Discovery and Big Data SQL can be used to visualise and publish Hadoop data. In this final session we’ll look at what’s involved in connecting these tools to your Hadoop environment, and also consider where data is optimally located when large amounts of Hadoop data need to be analysed alongside more traditional data warehouse datasets
Part 2 - Hadoop Data Loading using Hadoop Tools and ODI12cMark Rittman
Delivered as a one-day seminar at the SIOUG and HROUG Oracle User Group Conferences, October 2014.
There are many ways to ingest (load) data into a Hadoop cluster, from file copying using the Hadoop Filesystem (FS) shell through to real-time streaming using technologies such as Flume and Hadoop streaming. In this session we’ll take a high-level look at the data ingestion options for Hadoop, and then show how Oracle Data Integrator and Oracle GoldenGate leverage these technologies to load and process data within your Hadoop cluster. We’ll also consider the updated Oracle Information Management Reference Architecture and look at the best places to land and process your enterprise data, using Hadoop’s schema-on-read approach to hold low-value, low-density raw data, and then use the concept of a “data factory” to load and process your data into more traditional Oracle relational storage, where we hold high-density, high-value data.
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
2. info@rittmanmead.com www.rittmanmead.com @rittmanmead 2
•Many customers and organisations are now running initiatives around “big data”
•Some are IT-led and are looking for cost-savings around data warehouse storage + ETL
•Others are “skunkworks” projects in the marketing department that are now scaling-up
•Projects now emerging from pilot exercises
•And design patterns starting to emerge
Many Organisations are Running Big Data Initiatives
3. info@rittmanmead.com www.rittmanmead.com @rittmanmead
•Gives us an ability to store more data, at more detail, for longer
•Provides a cost-effective way to analyse vast amounts of data
•Hadoop & NoSQL technologies can give us “schema-on-read” capabilities
•There’s vast amounts of innovation in this area we can harness
•And it’s very complementary to Oracle BI & DW
Why is Hadoop of Interest to Us?
4. info@rittmanmead.com www.rittmanmead.com @rittmanmead 4
•Mark Rittman, Co-Founder of Rittman Mead
‣Oracle ACE Director, specialising in Oracle BI&DW
‣14 Years Experience with Oracle Technology
‣Regular columnist for Oracle Magazine
•Author of two Oracle Press Oracle BI books
‣Oracle Business Intelligence Developers Guide
‣Oracle Exalytics Revealed
‣Writer for Rittman Mead Blog :
http://www.rittmanmead.com/blog
•Email : mark.rittman@rittmanmead.com
•Twitter : @markrittman
About the Speaker
5. info@rittmanmead.com www.rittmanmead.com @rittmanmead
Flexible Cheap Storage for Logs, Feeds + Social Data
$50k
Hadoop
Node
Voice + Chat
Transcripts
Call Center LogsChat Logs iBeacon Logs Website LogsCRM Data Transactions Social FeedsDemographics
Raw Data
Customer 360 Apps
Predictive
Models
SQL-on-Hadoop
Business analytics
Real-time Feeds,
batch and API
8. info@rittmanmead.com www.rittmanmead.com @rittmanmead
•Oracle Big Data Appliance - Engineered System for running Hadoop alongside Exadata
•Oracle Big Data Connectors - Utility from Oracle for feeding Hadoop data into Oracle
•Oracle Data Integrator EE Big Data Option - Add Spark, Pig data transforms to Oracle ODI
•Oracle BI Enterprise Edition - can connect to Hive, Impala for federated queries
•Oracle Big Data Discovery - data wrangling + visualization tool for Hadoop data reservoirs
•Oracle Big Data SQL - extend Oracle SQL
language + processing to Hadoop
Oracle Software Initiatives around Big Data
14. info@rittmanmead.com www.rittmanmead.com @rittmanmead
Where Can SQL Processing Be Useful with Hadoop?
•Hadoop is not a cheap substitute for enterprise DW platforms - don’t use it like this
•But adding SQL processing and abstraction can help in many scenarios:
• Query access to data stored in Hadoop as an archive
• Aggregating, sorting, filtering and transforming data
• Set-based transformation capabilities for other frameworks (e.g. Spark)
• Ad-hoc analysis and data discovery in-real time
• Providing tabular abstractions over complex datatypes
SQL!
Though
SQL
isn’t actually
relational
According
to Chris Date
SQL is just
mappings
Tedd Codd
used
Predicate
Calculus
and there’s
never been
a mainstream
relational
DBMS
but it is the
standard
language for
RDBMSs
and it’s great
for set-based
transforms
& queries
so
Yes SQL!
15. info@rittmanmead.com www.rittmanmead.com @rittmanmead 15
•Original developed at Facebook, now foundational within the Hadoop project
•Allows users to query Hadoop data using SQL-like language
•Tabular metadata layer that overlays files, can interpret semi-structured data (e.g. JSON)
•Generates MapReduce code to return required data
•Extensible through SerDes and Storage Handlers
•JDBC and ODBC drivers for most platforms/tools
•Perfect for set-based access + batch ETL work
Apache Hive : SQL Metadata + Engine over Hadoop
16. info@rittmanmead.com www.rittmanmead.com @rittmanmead
•Hive uses a RBDMS metastore to hold
table and column definitions in schemas
•Hive tables then map onto HDFS-stored files
‣Managed tables
‣External tables
•Oracle-like query optimizer, compiler,
executor
•JDBC and OBDC drivers,
plus CLI etc
16
How Does Hive Translate SQL into MapReduce?
Hive Thrift
Server
JDBC / ODBC
Parser Planner
Execution Engine
Metastore
MapReduc
e
HDFS
HueCLI
17. info@rittmanmead.com www.rittmanmead.com @rittmanmead
•Hive uses a RBDMS metastore to hold
table and column definitions in schemas
•Hive tables then map onto HDFS-stored files
‣Managed tables
‣External tables
•Oracle-like query optimizer, compiler,
executor
•JDBC and OBDC drivers,
plus CLI etc
17
How Does Hive Translate SQL into MapReduce?
hive> select count(*) from src_customer;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=
In order to set a constant number of reducers:
set mapred.reduce.tasks=
Starting Job = job_201303171815_0003, Tracking URL =
http://localhost.localdomain:50030/jobdetails.jsp…
Kill Command = /usr/lib/hadoop-0.20/bin/
hadoop job -Dmapred.job.tracker=localhost.localdomain:8021
-kill job_201303171815_0003
2013-04-17 04:06:59,867 Stage-1 map = 0%, reduce = 0%
2013-04-17 04:07:03,926 Stage-1 map = 100%, reduce = 0%
2013-04-17 04:07:14,040 Stage-1 map = 100%, reduce = 33%
2013-04-17 04:07:15,049 Stage-1 map = 100%, reduce = 100%
Ended Job = job_201303171815_0003
OK
25
Time taken: 22.21 seconds
HiveQL
Query
MapReduce
Job submitted
Results
returned
19. info@rittmanmead.com www.rittmanmead.com @rittmanmead 19
•Cloudera’s answer to Hive query response time issues
•MPP SQL query engine running on Hadoop, bypasses MapReduce for
direct data access
•Mostly in-memory, but spills to disk if required
•Uses Hive metastore to access Hive table metadata
•Similar SQL dialect to Hive - not as rich though and no support for Hive
SerDes, storage handlers etc
Cloudera Impala - Fast, MPP-style Access to Hadoop Data
20. info@rittmanmead.com www.rittmanmead.com @rittmanmead
•Apache Drill is another SQL-on-Hadoop project that focus on schema-free data discovery
•Inspired by Google Dremel, innovation is querying raw data with schema optional
•Automatically infers and detects schema from semi-structured datasets and NoSQL DBs
•Join across different silos of data e.g. JSON records, Hive tables and HBase database
•Aimed at different use-cases than Hive -
low-latency queries, discovery
(think Endeca vs OBIEE)
Apache Drill - SQL for Schema-Free Data Discovery
21. info@rittmanmead.com www.rittmanmead.com @rittmanmead 21
•A replacement for Hive, but uses Hive concepts and
data dictionary (metastore)
•MPP (Massively Parallel Processing) query engine
that runs within Hadoop
‣Uses same file formats, security,
resource management as Hadoop
•Processes queries in-memory
•Accesses standard HDFS file data
•Option to use Apache AVRO, RCFile,
LZO or Parquet (column-store)
•Designed for interactive, real-time
SQL-like access to Hadoop
How Impala Works
Impala
Hadoop
HDFS etc
BI Server
Presentation Svr
Cloudera Impala
ODBC Driver
Impala
Hadoop
HDFS etc
Impala
Hadoop
HDFS etc
Impala
Hadoop
HDFS etc
Impala
Hadoop
HDFS etc
Multi-Node
Hadoop Cluster
24. info@rittmanmead.com www.rittmanmead.com @rittmanmead
•Originally part of Oracle Big Data 4.0 (BDA-only)
‣Also required Oracle Database 12c, Oracle Exadata Database Machine
•Extends Oracle Data Dictionary to cover Hive
•Extends Oracle SQL and SmartScan to Hadoop
•Extends Oracle Security Model over Hadoop
‣Fine-grained access control
‣Data redaction, data masking
‣Uses fast c-based readers where possible
(vs. Hive MapReduce generation)
‣Map Hadoop parallelism to Oracle PQ
‣Big Data SQL engine works on top of YARN
‣Like Spark, Tez, MR2
Oracle Big Data SQL
Exadata
Storage Servers
Hadoop
Cluster
Exadata Database
Server
Oracle Big
Data SQL
SQL Queries
SmartScan SmartScan
25. info@rittmanmead.com www.rittmanmead.com @rittmanmead
•As with other next-gen SQL access layers, uses common Hive metastore table metadata
•leverages Hadoop standard APIs for HDFS file access, metadata integration etc
Leverages Hive Metastore and Hadoop file access APIs
26. info@rittmanmead.com www.rittmanmead.com @rittmanmead
•Brings query-offloading features of Exadata
to Oracle Big Data Appliance
•Query across both Oracle and Hadoop sources
•Intelligent query optimisation applies SmartScan
close to ALL data
•Use same SQL dialect across both sources
•Apply same security rules, policies,
user access rights across both sources
Extending SmartScan, and Oracle SQL, Across All Data
27. info@rittmanmead.com www.rittmanmead.com @rittmanmead
•Read data from HDFS Data Node
‣Direct-path reads
‣C-based readers when possible
‣Use native Hadoop classes otherwise
•Translate bytes to Oracle
•Apply SmartScan to Oracle bytes
‣Apply filters
‣Project columns
‣Parse JSON/XML
‣Score models
How Big Data SQL Accesses Hadoop (HDFS) Data
Disks%
Data$Node$
Big$Data$SQL$Server$
External$Table$Services$
Smart$Scan$
RecordReader%
SerDe%
10110010%10110010%10110010%
1%
2%
3%
1
2
3
28. info@rittmanmead.com www.rittmanmead.com @rittmanmead
•“Query Franchising – dispatch of query processing to self-similar compute agents on
disparate systems without loss of operational fidelity”
•Contrast with OBIEE which provides a query federation capability over Hadoop
•Sends sub-queries to each data source
•Relies on each data source’s native query engine, and resource management
•Query franchising using Big Data SQL ensures consistent resource management
•And contrast with SQL translation tools (i.e. Oracle SQL to Impala)
•Either limits Oracle SQL to the subset that Hive, Impala supports
•Or translation engine has to transform each Oracle feature into Hive, Impala SQL
Query Franchising vs. SQL Translation / Federation
29. info@rittmanmead.com www.rittmanmead.com @rittmanmead
•Oracle Database 12c 12.1.0.2.0 with Big Data SQL option can view Hive table metadata
‣Linked by Exadata configuration steps to one or more BDA clusters
•DBA_HIVE_TABLES and USER_HIVE_TABLES exposes Hive metadata
•Oracle SQL*Developer 4.0.3, with Cloudera Hive drivers, can connect to Hive metastore
View Hive Table Metadata in the Oracle Data Dictionary
SQL> col database_name for a30
SQL> col table_name for a30
SQL> select database_name, table_name
2 from dba_hive_tables;
DATABASE_NAME TABLE_NAME
------------------------------ ------------------------------
default access_per_post
default access_per_post_categories
default access_per_post_full
default apachelog
default categories
default countries
default cust
default hive_raw_apache_access_log
30. info@rittmanmead.com www.rittmanmead.com @rittmanmead
•Big Data SQL accesses Hive tables through external table mechanism
‣ORACLE_HIVE external table type imports Hive metastore metadata
‣ORACLE_HDFS requires metadata to be specified
•Access parameters cluster and tablename specify Hive table source and BDA cluster
Hive Access through Oracle External Tables + Hive Driver
CREATE TABLE access_per_post_categories(
hostname varchar2(100),
request_date varchar2(100),
post_id varchar2(10),
title varchar2(200),
author varchar2(100),
category varchar2(100),
ip_integer number)
organization external
(type oracle_hive
default directory default_dir
access parameters(com.oracle.bigdata.tablename=default.access_per_post_categories));
31. info@rittmanmead.com www.rittmanmead.com @rittmanmead
•Run normal Oracle SQL from the Oracle Database server
•Big Data SQL query franchising then uses agents on Hadoop nodes to query and return
data independent of YARN scheduling; Oracle Database combines and returns full results
Running Oracle SQL on Hadoop Data Nodes
SELECT w.sess_id,w.cust_id,c.name
FROM web_logs w, customers c
WHERE w.source_country = ‘Brazil’
AND c.customer_id = w.cust_id
32. info@rittmanmead.com www.rittmanmead.com @rittmanmead
•OBIEE can access Hadoop data via Hive, but it’s slow
•(Impala only has subset of Oracle SQL capabilities)
•Big Data SQL presents all data to OBIEE as Oracle data, with full advanced analytic
capabilities across both platforms
Example : Combining Hadoop + Oracle Data for BI
Hive Weblog Activity table
Oracle Dimension lookup tables
Combined output
in report form
33. info@rittmanmead.com www.rittmanmead.com @rittmanmead
•Not all functions can be offloaded to Hadoop tier
•Even for non-offloadable operations Big Data SQL will perform column pruning and
datatype conversion (which saves a lot of resources)
•Other operations (non-offloadable) will be done on the database side
•Requires Oracle Database 12.1.0.2 + patchset, and per-disk licensing for Big Data SQL
•You need and Oracle Big Data Appliance, and Oracle Exadata, to use Big Data SQL
Restrictions when using Oracle Big Data SQL
SELECT NAME FROM v$sqlfn_metadata WHERE offloadable ='YES'
34. info@rittmanmead.com www.rittmanmead.com @rittmanmead
•From Big Data SQL 3.0, commodity hardware can be used instead of BDA and Exadata
•Oracle Database 12.1.0.2 on x86_64 with Jan/Apr Proactive Bundle Patches
•Cloudera CDH 5.5 or Hortonworks HDP 2.3 on RHEL/OEL6
•See MOS Doc ID 2119369.1 - note cannot mix Engineered/Non-Engineered platforms
Running Big Data SQL on Commodity Hardware
35. info@rittmanmead.com www.rittmanmead.com @rittmanmead
•No functional differences when running Big Data SQL on commodity hardware
•External table capability lives with the database, and the performance functionality with
the BDS cell software.
•All BDS features (SmartScan, offloading, storage indexes etc still available)
•But hardware can be a factor now, as we’re pushing processing down and data up the wire
•1GB ethernet can be too slow, 10Gb is a minimum (i.e. no InfiniBand)
•If you run on an undersized system you may see bottlenecks on the DB side.
Big Data SQL on Commodity Hardware Considerations
38. info@rittmanmead.com www.rittmanmead.com @rittmanmead
•Subsequent releases of Big Data SQL have extended its Hadoop capabilties
•Support for Hive storage handlers (HBase, MongoDB etc)
•Hive partition elimination
•Better, more efficient access to Hadoop data
•Storage Indexes
•Predicate Push-Down for Parquet, ORC, HBase, Oracle NoSQL
•Bloom Filters
•Coming with Oracle Database 12.2
•Big Data-aware optimizer
•Dense Bloom Filters
•Oracle managed Big Data partitions
Going beyond Fast Unified Query Access to HDFS Data
39. info@rittmanmead.com www.rittmanmead.com @rittmanmead
•Hive Storage handlers give Hive the ability
to access data from non-HDFS sources
•MongoDB
•HBase
•Oracle NoSQL database
•Run HiveQL queries against NoSQL DBs
•From BDS1.1, Hive storage handlers can
be used with Big Data SQL
•Only MongoDB, HBase and NoSQL
currently “supported”
•Others should work but not tested
Big Data SQL and Hive Storage Handlers
40. info@rittmanmead.com www.rittmanmead.com @rittmanmead
•Create Hive table over HBase database as normal
•Typically done to add INSERT and DELETE capabilities to Hive, for DW dimension ETL
•Create Oracle external table as normal, using ORACLE_HIVE driver
Use of Hive Storage Handlers Transparent to BDS
CREATE EXTERNAL TABLE tablename colname coltype[, colname coltype,...]
ROW FORMAT
SERDE 'org.apache.hadoop.hive.hbase.HBaseSerDe'
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES (
'serialization.format'='1',
'hbase.columns.mapping'=':key,value:key,value:
CREATE TABLE tablename(colname colType[, colname colType...])
ORGANIZATION EXTERNAL
(TYPE ORACLE_HIVE DEFAULT DIRECTORY DEFAULT_DIR
ACCESS PARAMETERS
(access parameters)
)
REJECT LIMIT UNLIMITED;
41. info@rittmanmead.com www.rittmanmead.com @rittmanmead
•From Big Data SQL 2.0, Storage Indexes
are automatically created in Big Data SQL
agents
•Check index before reading blocks – Skip
unnecessary I/Os
•An average of 65% faster than BDS 1.x
•Up to 100x faster for highly selective
queries
•Columns in SQL are mapped to fields in
the HDFS file via External Table Definitions
•Min / max value is recorded for each
HDFS Block in a storage index
Big Data SQL Storage Indexes
42. info@rittmanmead.com www.rittmanmead.com @rittmanmead
•Hadoop supports predicate push-down through several mechanisms (filetypes, Hive
partition pruning etc)
•Original BDS 1.0 supported Hive predicate push-down as part of SmartScan
•BDS 3.0 extends this by pushing SARGable (Search ARGument ABLE) predicates
•Into Parquet and ORCFile to reduce I/O when
reading files from disk
•Into HBAse and Oracle NoSQL database
to drive subscans of data from remote DB
•Oracle Database 12.2 will add more optimisations
•Columnar-caching
•Big Data-Aware Query Optimizer,
•Managed Hadoop partitions
•Dense Bloom Filters
Extending Predicate Push-Down Beyond Hive
43. info@rittmanmead.com www.rittmanmead.com @rittmanmead
•Typically a one-way street - queries run in Hadoop but results delivered through Oracle
•What if you want to load data into Hadoop, update data, do Hadoop>Hadoop transforms?
•Still requires formal Hive metadata, whereas direction is towards Drill & schema-free queries
•What if you have other RDBMSs as well as Oracle RDBMS?
•Trend is towards moving all high-end analytic workloads into Hadoop - BDS is Oracle-only
•Requires Oracle 12c database, no 11g support
•And cost … BDS is $3k/Hadoop disk drive
•Can cost more than an Oracle BDA
•High-end, high-cost Oracle-centric solution
•of course!
… So What’s the Catch?