There is a fundamental shift underway in IT to include open, software defined, distributed systems like Hadoop. As a result, every Oracle professional should strive to learn these new technologies or risk being left behind. This session is designed specifically for Oracle database professionals so they can better understand SQL on Hadoop and the benefits it brings to the enterprise. Attendees will see how SQL on Hadoop compares to Oracle in areas such as data storage, data ingestion, and SQL processing. Various live demos will provide attendees with a first-hand look at these new world technologies. Presented at Collaborate 18.
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
Recently, a set of modern table formats such as Delta Lake, Hudi, Iceberg spring out. Along with Hive Metastore these table formats are trying to solve problems that stand in traditional data lake for a long time with their declared features like ACID, schema evolution, upsert, time travel, incremental consumption etc.
Data Con LA 2020
Description
In this session, I introduce the Amazon Redshift lake house architecture which enables you to query data across your data warehouse, data lake, and operational databases to gain faster and deeper insights. With a lake house architecture, you can store data in open file formats in your Amazon S3 data lake.
Speaker
Antje Barth, Amazon Web Services, Sr. Developer Advocate, AI and Machine Learning
Build a simple data lake on AWS using a combination of services, including AWS Glue Data Catalog, AWS Glue Crawlers, AWS Glue Jobs, AWS Glue Studio, Amazon Athena, Amazon Relational Database Service (Amazon RDS), and Amazon S3.
Link to the blog post and video: https://garystafford.medium.com/building-a-simple-data-lake-on-aws-df21ca092e32
Data Engineer's Lunch #55: Get Started in Data EngineeringAnant Corporation
In Data Engineer's Lunch #55, CEO of Anant, Rahul Singh, will cover 10 resources every data engineer needs to get started or master their game.
Accompanying Blog: Coming Soon!
Accompanying YouTube: Coming Soon!
Sign Up For Our Newsletter: http://eepurl.com/grdMkn
Join Data Engineer’s Lunch Weekly at 12 PM EST Every Monday:
https://www.meetup.com/Data-Wranglers-DC/events/
Cassandra.Link:
https://cassandra.link/
Follow Us and Reach Us At:
Anant:
https://www.anant.us/
Awesome Cassandra:
https://github.com/Anant/awesome-cassandra
Email:
solutions@anant.us
LinkedIn:
https://www.linkedin.com/company/anant/
Twitter:
https://twitter.com/anantcorp
Eventbrite:
https://www.eventbrite.com/o/anant-1072927283
Facebook:
https://www.facebook.com/AnantCorp/
Join The Anant Team:
https://www.careers.anant.us
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
Recently, a set of modern table formats such as Delta Lake, Hudi, Iceberg spring out. Along with Hive Metastore these table formats are trying to solve problems that stand in traditional data lake for a long time with their declared features like ACID, schema evolution, upsert, time travel, incremental consumption etc.
Data Con LA 2020
Description
In this session, I introduce the Amazon Redshift lake house architecture which enables you to query data across your data warehouse, data lake, and operational databases to gain faster and deeper insights. With a lake house architecture, you can store data in open file formats in your Amazon S3 data lake.
Speaker
Antje Barth, Amazon Web Services, Sr. Developer Advocate, AI and Machine Learning
Build a simple data lake on AWS using a combination of services, including AWS Glue Data Catalog, AWS Glue Crawlers, AWS Glue Jobs, AWS Glue Studio, Amazon Athena, Amazon Relational Database Service (Amazon RDS), and Amazon S3.
Link to the blog post and video: https://garystafford.medium.com/building-a-simple-data-lake-on-aws-df21ca092e32
Data Engineer's Lunch #55: Get Started in Data EngineeringAnant Corporation
In Data Engineer's Lunch #55, CEO of Anant, Rahul Singh, will cover 10 resources every data engineer needs to get started or master their game.
Accompanying Blog: Coming Soon!
Accompanying YouTube: Coming Soon!
Sign Up For Our Newsletter: http://eepurl.com/grdMkn
Join Data Engineer’s Lunch Weekly at 12 PM EST Every Monday:
https://www.meetup.com/Data-Wranglers-DC/events/
Cassandra.Link:
https://cassandra.link/
Follow Us and Reach Us At:
Anant:
https://www.anant.us/
Awesome Cassandra:
https://github.com/Anant/awesome-cassandra
Email:
solutions@anant.us
LinkedIn:
https://www.linkedin.com/company/anant/
Twitter:
https://twitter.com/anantcorp
Eventbrite:
https://www.eventbrite.com/o/anant-1072927283
Facebook:
https://www.facebook.com/AnantCorp/
Join The Anant Team:
https://www.careers.anant.us
Ingesting Data at Blazing Speed Using Apache OrcDataWorks Summit
Big SQL is a SQL engine for Hadoop that excels at performance and scalability at high concurrency. Big SQL complements and integrates with Apache Hive for both data and metadata. An architecture that separates compute from storage allows Big SQL to support multiple open data formats natively. Until recently, Parquet provided a significant performance advantage over other data formats for SQL on Hadoop. The landscape changed when ORC became a top level Apache project independent from Hive. Gone were the days of reading ORC files using slow, single-row-at-a-time Hive Serdes. The new vectorized APIs in the Apache ORC libraries make it possible to ingest ORC data at blazing speed. This talk is about the journey leading to ORC taking the crown of best performing data format for Big SQL away from Parquet. We'll have a look under the hood at the architecture of Big SQL ORC readers, and how to tune them. We'll share lessons learned in walking the fine line between maximizing performance at scale and avoiding dreaded Java OOMs . You'll learn the techniques that SQL engines use for fast data ingestion, so that you can leverage the full potential of Apache ORC in any application.
Speaker:
Gustavo Arocena, Big Data Architect, IBM
Here are the slides for my talk "An intro to Azure Data Lake" at Techorama NL 2018. The session was held on Tuesday October 2nd from 15:00 - 16:00 in room 7.
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...Mark Rittman
Hadoop and NoSQL platforms initially focused on Java developers and slow but massively-scalable MapReduce jobs as an alternative to high-end but limited-scale analytics RDBMS engines. Apache Hive opened-up Hadoop to non-programmers by adding a SQL query engine and relational-style metadata layered over raw HDFS storage, and since then open-source initiatives such as Hive Stinger, Cloudera Impala and Apache Drill along with proprietary solutions from closed-source vendors have extended SQL-on-Hadoop’s capabilities into areas such as low-latency ad-hoc queries, ACID-compliant transactions and schema-less data discovery – at massive scale and with compelling economics.
In this session we’ll focus on technical foundations around SQL-on-Hadoop, first reviewing the basic platform Apache Hive provides and then looking in more detail at how ad-hoc querying, ACID-compliant transactions and data discovery engines work along with more specialised underlying storage that each now work best with – and we’ll take a look to the future to see how SQL querying, data integration and analytics are likely to come together in the next five years to make Hadoop the default platform running mixed old-world/new-world analytics workloads.
What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016StampedeCon
Spark 2.0 includes many exciting new features including Structured Streaming, and the unification of Datasets (new in 1.6) with DataFrames. Structured Streaming allows one to define recurrent queries on a stream of data that is handled as an infinite DataFrame. This query is incrementally updated with new data. This allows for code reuse between batch and streaming and an easier logical model to reason about. Datasets, an extension of DataFrames, were added as an experimental feature in Spark 1.6. They allow us to manipulate collections of objects in a type-safe fashion. In Spark 2.0 the two abstractions have been unified and now DataFrame = Dataset[Row]. We will discuss both of these new features and look at practical real world examples.
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsKhalid Salama
In essence, a data lake is commodity distributed file system that acts as a repository to hold raw data file extracts of all the enterprise source systems, so that it can serve the data management and analytics needs of the business. A data lake system provides means to ingest data, perform scalable big data processing, and serve information, in addition to manage, monitor and secure the it environment. In these slide, we discuss building data lakes using Azure Data Factory and Data Lake Analytics. We delve into the architecture if the data lake and explore its various components. We also describe the various data ingestion scenarios and considerations. We introduce the Azure Data Lake Store, then we discuss how to build Azure Data Factory pipeline to ingest the data lake. After that, we move into big data processing using Data Lake Analytics, and we delve into U-SQL.
Innovation in the Data Warehouse - StampedeCon 2016StampedeCon
Enterprise Holding’s first started with Hadoop as a POC in 2013. Today, we have clusters on premises and in the cloud. This talk will explore our experience with Big Data and outline three common big data architectures (batch, lambda, and kappa). Then, we’ll dive into the decision points to necessary for your own cluster, for example: cloud vs on premises, physical vs virtual, workload, and security. These decisions will help you understand what direction to take. Finally, we’ll share some lessons learned with the pieces of our architecture worked well and rant about those which didn’t. No deep Hadoop knowledge is necessary, architect or executive level.
RDX Insights Presentation - Microsoft Business IntelligenceChristopher Foot
May's RDX Insights Series Presentation focuses on Microsoft's BI products. We begin with an overview of Power BI, SSIS, SSAS and SSRS and how the products integrate with each other. The webinar continues with a detailed discussion on how to use Power BI to capture, model, transform, analyze and visualize key business metrics. We’ll finish with a Power BI demo highlighting some of its most beneficial and interesting features.
These are the slides for my talk "An intro to Azure Data Lake" at Azure Lowlands 2019. The session was held on Friday January 25th from 14:20 - 15:05 in room Santander.
Details:
• DevOps and Business Intelligence?
• CI/CD Pipelines: What are they?
• Database Deployments: State based vs Migration based
• Snowflake features for CI/CD
• Azure DevOps: Build and Release Pipelines
• Putting it all together: End to End solution
• Demo
Top Trends in Building Data Lakes for Machine Learning and AI Holden Ackerman
Presentation by Ashish Thusoo, Co-Founder & CEO at Qubole, on exploring the big data industry trends in moving from data warehouses to cloud-based data lakes.This presentation will cover how companies today are seeing a significant rise in the success of their big data projects by moving to the cloud to iteratively build more cost-effective data pipelines and new products with ML and AI.
Uncovering how services like AWS, Google, Oracle, and Microsoft Azure provide the storage and compute infrastructure to build self-service data platforms that can enable all teams and new products to scale iteratively.
Presentación sobre la futura base de datos 18c, en la cual se incorpora todo lo mejor de las tecnologías Oracle, perfilando así una base de datos autónoma.
Spark is fast and general engine for large-scale data processing which can solve all of your problems.
… Or can it?
This talk will cover real-world issues encountered during migration of the existing product to Spark infrastructure.
Aimed at software engineers that just started to evaluate Spark or those who are already using it.
Ingesting Data at Blazing Speed Using Apache OrcDataWorks Summit
Big SQL is a SQL engine for Hadoop that excels at performance and scalability at high concurrency. Big SQL complements and integrates with Apache Hive for both data and metadata. An architecture that separates compute from storage allows Big SQL to support multiple open data formats natively. Until recently, Parquet provided a significant performance advantage over other data formats for SQL on Hadoop. The landscape changed when ORC became a top level Apache project independent from Hive. Gone were the days of reading ORC files using slow, single-row-at-a-time Hive Serdes. The new vectorized APIs in the Apache ORC libraries make it possible to ingest ORC data at blazing speed. This talk is about the journey leading to ORC taking the crown of best performing data format for Big SQL away from Parquet. We'll have a look under the hood at the architecture of Big SQL ORC readers, and how to tune them. We'll share lessons learned in walking the fine line between maximizing performance at scale and avoiding dreaded Java OOMs . You'll learn the techniques that SQL engines use for fast data ingestion, so that you can leverage the full potential of Apache ORC in any application.
Speaker:
Gustavo Arocena, Big Data Architect, IBM
Here are the slides for my talk "An intro to Azure Data Lake" at Techorama NL 2018. The session was held on Tuesday October 2nd from 15:00 - 16:00 in room 7.
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...Mark Rittman
Hadoop and NoSQL platforms initially focused on Java developers and slow but massively-scalable MapReduce jobs as an alternative to high-end but limited-scale analytics RDBMS engines. Apache Hive opened-up Hadoop to non-programmers by adding a SQL query engine and relational-style metadata layered over raw HDFS storage, and since then open-source initiatives such as Hive Stinger, Cloudera Impala and Apache Drill along with proprietary solutions from closed-source vendors have extended SQL-on-Hadoop’s capabilities into areas such as low-latency ad-hoc queries, ACID-compliant transactions and schema-less data discovery – at massive scale and with compelling economics.
In this session we’ll focus on technical foundations around SQL-on-Hadoop, first reviewing the basic platform Apache Hive provides and then looking in more detail at how ad-hoc querying, ACID-compliant transactions and data discovery engines work along with more specialised underlying storage that each now work best with – and we’ll take a look to the future to see how SQL querying, data integration and analytics are likely to come together in the next five years to make Hadoop the default platform running mixed old-world/new-world analytics workloads.
What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016StampedeCon
Spark 2.0 includes many exciting new features including Structured Streaming, and the unification of Datasets (new in 1.6) with DataFrames. Structured Streaming allows one to define recurrent queries on a stream of data that is handled as an infinite DataFrame. This query is incrementally updated with new data. This allows for code reuse between batch and streaming and an easier logical model to reason about. Datasets, an extension of DataFrames, were added as an experimental feature in Spark 1.6. They allow us to manipulate collections of objects in a type-safe fashion. In Spark 2.0 the two abstractions have been unified and now DataFrame = Dataset[Row]. We will discuss both of these new features and look at practical real world examples.
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsKhalid Salama
In essence, a data lake is commodity distributed file system that acts as a repository to hold raw data file extracts of all the enterprise source systems, so that it can serve the data management and analytics needs of the business. A data lake system provides means to ingest data, perform scalable big data processing, and serve information, in addition to manage, monitor and secure the it environment. In these slide, we discuss building data lakes using Azure Data Factory and Data Lake Analytics. We delve into the architecture if the data lake and explore its various components. We also describe the various data ingestion scenarios and considerations. We introduce the Azure Data Lake Store, then we discuss how to build Azure Data Factory pipeline to ingest the data lake. After that, we move into big data processing using Data Lake Analytics, and we delve into U-SQL.
Innovation in the Data Warehouse - StampedeCon 2016StampedeCon
Enterprise Holding’s first started with Hadoop as a POC in 2013. Today, we have clusters on premises and in the cloud. This talk will explore our experience with Big Data and outline three common big data architectures (batch, lambda, and kappa). Then, we’ll dive into the decision points to necessary for your own cluster, for example: cloud vs on premises, physical vs virtual, workload, and security. These decisions will help you understand what direction to take. Finally, we’ll share some lessons learned with the pieces of our architecture worked well and rant about those which didn’t. No deep Hadoop knowledge is necessary, architect or executive level.
RDX Insights Presentation - Microsoft Business IntelligenceChristopher Foot
May's RDX Insights Series Presentation focuses on Microsoft's BI products. We begin with an overview of Power BI, SSIS, SSAS and SSRS and how the products integrate with each other. The webinar continues with a detailed discussion on how to use Power BI to capture, model, transform, analyze and visualize key business metrics. We’ll finish with a Power BI demo highlighting some of its most beneficial and interesting features.
These are the slides for my talk "An intro to Azure Data Lake" at Azure Lowlands 2019. The session was held on Friday January 25th from 14:20 - 15:05 in room Santander.
Details:
• DevOps and Business Intelligence?
• CI/CD Pipelines: What are they?
• Database Deployments: State based vs Migration based
• Snowflake features for CI/CD
• Azure DevOps: Build and Release Pipelines
• Putting it all together: End to End solution
• Demo
Top Trends in Building Data Lakes for Machine Learning and AI Holden Ackerman
Presentation by Ashish Thusoo, Co-Founder & CEO at Qubole, on exploring the big data industry trends in moving from data warehouses to cloud-based data lakes.This presentation will cover how companies today are seeing a significant rise in the success of their big data projects by moving to the cloud to iteratively build more cost-effective data pipelines and new products with ML and AI.
Uncovering how services like AWS, Google, Oracle, and Microsoft Azure provide the storage and compute infrastructure to build self-service data platforms that can enable all teams and new products to scale iteratively.
Presentación sobre la futura base de datos 18c, en la cual se incorpora todo lo mejor de las tecnologías Oracle, perfilando así una base de datos autónoma.
Spark is fast and general engine for large-scale data processing which can solve all of your problems.
… Or can it?
This talk will cover real-world issues encountered during migration of the existing product to Spark infrastructure.
Aimed at software engineers that just started to evaluate Spark or those who are already using it.
it is bit towards Hadoop/Hive installation experience and ecosystem concept. The outcome of this slide is derived from a under published book Fundamental of Big Data.
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosLester Martin
A walk-thru of core Hadoop, the ecosystem tools, and Hortonworks Data Platform (HDP) followed by code examples in MapReduce (Java and C#), Pig, and Hive.
Presented at the Atlanta .NET User Group meeting in July 2014.
Big Data and New Challenges for DBAs (Michael Naumov, LivePerson)
Hadoop has become a popular platform for managing large datasets of structured and unstructured data. It does not replace existing infrastructures, but instead augments them. Most companies will still use relational databases for transactional processing and low-latency queries, but can benefit from Hadoop for reporting, machine learning or ETL. This session will cover:
What is Hadoop and why do I care?
What do people do with Hadoop?
How can SQL Server DBAs add Hadoop to their architecture?
Data Warehouse - Incremental Migration to the CloudMichael Rainey
A data warehouse (DW) migration is no small undertaking, especially when moving from on-premises to the cloud. A typical data warehouse has numerous data sources connecting and loading data into the DW, ETL tools and data integration scripts performing transformations, and reporting, advanced analytics, or ad-hoc query tools accessing the data for insights and analysis. That’s a lot to coordinate and the data warehouse cannot be migrated all at once. Using a data replication technology such as Oracle GoldenGate, the data warehouse migration can be performed incrementally by keeping the data in-sync between the original DW and the new, cloud DW. This session will dive into the steps necessary for this incremental migration approach and walk through a customer use case scenario, leaving attendees with an understanding of how to perform a data warehouse migration to the cloud.
Presented at RMOUG Training Days 2019
Continuous Data Replication into Cloud Storage with Oracle GoldenGateMichael Rainey
Continuous flow. Streaming. Near real-time. These are all terms used to identify the business’s need for quick access to data. It’s a common request, even if the data must flow from on-premises to the cloud. Oracle GoldenGate is the data replication solution built for fast data. In this session, we’ll look at how GoldenGate can be configured to extract transactions from the Oracle database and load them into a cloud object store, such as Amazon S3. There are many different use cases for this type of continuous load of data into the cloud. We’ll explore these solutions and the various tools that can be used to access and analyze the data from the cloud object store, leaving attendees with ideas for implementing a full source-to-cloud data replication solution.
Presented at ITOUG Tech Days 2019
Going Serverless - an Introduction to AWS GlueMichael Rainey
Going "serverless" is the latest technology trend for enterprises moving their processing to the cloud, including data integration and ETL tools. But what does that mean and when should I use serverless ETL? In this session, we'll dive into the world of Amazon's fully managed data processing service called AWS Glue. With no server to provision or resources to allocate, and an easy to populate metadata catalog, AWS Glue allows the data engineer to focus on his or her craft; building data transformations and pipelines. Gaining an understanding of the similarities and differences between traditional ETL tools, such as Oracle Data Integrator, and Glue will prepare attendees for the new world of data integration. Presented at Collaborate 18.
Offload, Transform, and Present - the New World of Data IntegrationMichael Rainey
How much time and effort (and budget) do organizations spend moving data around the enterprise? Unfortunately, quite a lot. These days, ETL developers are tasked with performing the Extract (E) and Load (L), and spending less time on their craft, building Transformations (T). This changes in the new world of data integration. By offloading data from the RDBMS to Hadoop, with the ability to present it back to the relational database, data can be seamlessly integrated between different source and target systems. Transformations occur on data offloaded to Hadoop, using the latest ETL technologies, or in the target database, with a standard ETL-on-RDBMS tool. In this session, we’ll discuss how the new world of data integration will provide focus on transforming data into insightful information by simplifying the data movement process.
Presented at Enkitec E4 2017.
Oracle Data Integrator 12c - Getting StartedMichael Rainey
I think it’s time for a fresh look at Oracle Data Integrator 12c. What is ODI? How has it evolved over the years and where is it going? And, of course, how do you get started with Oracle Data Integrator? I plan to share what I love about ODI, how to get started building your first ODI project, and what makes Oracle Data Integrator 12c the premier ETL and data warehousing tool on the market. It’s time to get back to the basics!
Presented at UTOUG Training Days 2017.
As a data integration professional, it’s almost a guarantee that you’ve heard of real-time stream processing of Big Data. The usual players in the open source world are Apache Kafka, used to move data in real-time, and Spark Streaming, built for in-flight transformations. But what about relational data? Quite often we forget that products incubated in the Apache Foundation can also serve a purpose for “standard” relational databases as well. But how? Well, let’s introduce Oracle GoldenGate and Oracle Data Integrator for Big Data. GoldenGate can extract relational data in real time and produce Kafka messages, ensuring relational data is a part of the enterprise data bus. These messages can then be ingested via ODI through a Spark Streaming process, integrating with additional data sources, such as other relational tables, flat files, etc, as needed. Finally, the output can be sent to multiple locations: on through to a data warehouse for analytical reporting, back to Kafka for additional targets to consume, or any number of targets. Attendees will walk away with a framework on which they can build their data streaming projects, combining relational data with big data and using a common, structured approach via the Oracle Data Integration product stack.
Presented at BIWA Summit 2017.
Oracle data integrator 12c - getting startedMichael Rainey
I think it’s time for a fresh look at Oracle Data Integrator 12c. What is ODI? How has it evolved over the years and where is it going? And, of course, how do you get started with Oracle Data Integrator? I plan to share what I love about ODI, how to get started building your first ODI project, and what makes Oracle Data Integrator 12c the premier ETL and data warehousing tool on the market. It’s time to get back to the basics!
Presented at BIWA Summit 2017
Oracle GoldenGate and Apache Kafka: A Deep Dive Into Real-Time Data StreamingMichael Rainey
We produce quite a lot of data! Much of the data are business transactions stored in a relational database. More frequently, the data are non-structured, high volume and rapidly changing datasets known in the industry as Big Data. The challenge for data integration professionals is to combine and transform the data into useful information. Not just that, but it must also be done in near real-time and using a target system such as Hadoop. The topic of this session, real-time data streaming, provides a great solution for this challenging task. By integrating GoldenGate, Oracle’s premier data replication technology, and Apache Kafka, the latest open-source streaming and messaging system, we can implement a fast, durable, and scalable solution.
Presented at Oracle OpenWorld 2016
A Walk Through the Kimball ETL Subsystems with Oracle Data IntegrationMichael Rainey
Big Data integration is an excellent feature in the Oracle Data Integration product suite (Oracle Data Integrator, GoldenGate, & Enterprise Data Quality). But not all analytics require big data technologies, such as labor cost, revenue, or expense reporting. Ralph Kimball, an original architect of the dimensional model in data warehousing, spent much of his career working to build an enterprise data warehouse methodology that can meet these reporting needs. His book, "The Data Warehouse ETL Toolkit", is a guide for many ETL developers. This session will walk you through his ETL Subsystem categories; Extracting, Cleaning & Conforming, Delivering, and Managing, describing how the Oracle Data Integration products are perfectly suited for the Kimball approach. Presented at KScope16.
Oracle GoldenGate and Apache Kafka A Deep Dive Into Real-Time Data StreamingMichael Rainey
We produce quite a lot of data! Much of the data are business transactions stored in a relational database. More frequently, the data are non-structured, high volume and rapidly changing datasets known in the industry as Big Data. The challenge for data integration professionals is to combine and transform the data into useful information. Not just that, but it must also be done in near real-time and using a target system such as Hadoop. The topic of this session, real-time data streaming, provides a great solution for this challenging task. By integrating GoldenGate, Oracle’s premier data replication technology, and Apache Kafka, the latest open-source streaming and messaging system, we can implement a fast, durable, and scalable solution. Presented at KScope16.
Oracle GoldenGate and Apache Kafka A Deep Dive Into Real-Time Data StreamingMichael Rainey
We produce quite a lot of data. Some of this data comes in the form of business transactions and is stored in a relational database. This relational data is often combined with other non-structured, high volume and rapidly changing datasets known in the industry as Big Data. The challenge for us as data integration professionals is to then combine this data and transform it into something useful. Not just that, but we must also do it in near real-time and using a big data target system such as Hadoop. The topic of this session, real-time data streaming, provides us a great solution for that challenging task. By combining GoldenGate, Oracle’s premier data replication technology, and Apache Kafka, the latest open-source streaming and messaging system for big data, we can implement a fast, durable, and scalable solution. This session will walk through the implementation of GoldenGate and Kafka.
Presented at Collaborate16 in Las Vegas.
A Walk Through the Kimball ETL Subsystems with Oracle Data Integration - Coll...Michael Rainey
Big Data integration is an excellent feature in the Oracle Data Integration product suite (Oracle Data Integrator, GoldenGate, & Enterprise Data Quality). But not all analytics require big data technologies, such as labor cost, revenue, or expense reporting. Ralph Kimball, an original architect of the dimensional model in data warehousing, spent much of his career working to build an enterprise data warehouse methodology that can meet these reporting needs. His book, "The Data Warehouse ETL Toolkit", is a guide for many ETL developers. This session will walk you through his ETL Subsystem categories; Extracting, Cleaning & Conforming, Delivering, and Managing, describing how the Oracle Data Integration products are perfectly suited for the Kimball approach.
Presented at Collaborate16 in Las Vegas.
Practical Tips for Oracle Business Intelligence Applications 11g ImplementationsMichael Rainey
It's time to move to the Oracle Data Integrator version of Oracle Business Intelligence Applications! This has been the theme since it was recently announced that Oracle BI Applications 11g will move forward without updating functionality in the Informatica version of the product. Implementing Oracle BI Applications can be quite a challenge, specifically with Oracle Data Integrator being the “new” ETL tool. This session will provide attendees with practical tips, based on real-world experience, to help them get started with their implementation. How and why to use Oracle GoldenGate, high availability considerations, disaster recovery setup, and other functional and design factors will be covered, enhancing the attendee's ability to make the best design decisions for their BI Applications project.
Presented at KScope15 & NWOUG 2015.
GoldenGate and Oracle Data Integrator - A Perfect Match- Upgrade to 12cMichael Rainey
Oracle Data Integrator and Oracle GoldenGate excel as standalone products, but paired together they are the perfect match for real-time data warehousing. Over the last few years, many real-time data integration solutions have been developed using the 11g version of the products. Now that 12c is available, it’s time to upgrade. This discussion will provide best practices on how to prepare, perform, and validate your data warehouse upgrade to GoldenGate and ODI 12c. Help kickstart your move to 12c with tips, tricks, and lessons learned from successful migrations.
Presented at KScope15.
Real-Time Data Replication to Hadoop using GoldenGate 12c AdaptorsMichael Rainey
Oracle GoldenGate 12c is well known for its highly performant data replication between relational databases. With the GoldenGate Adaptors, the tool can now apply the source transactions to a Big Data target, such as HDFS. In this session, we'll explore the different options for utilizing Oracle GoldenGate 12c to perform real-time data replication from a relational source database into HDFS. The GoldenGate Adaptors will be used to load movie data from the source to HDFS for use by Hive. Next, we'll take the demo a step further and publish the source transactions to a Flume agent, allowing Flume to handle the final load into the targets.
Presented at the Oracle Technology Network Virtual Technology Summit February/March 2015.
Tame Big Data with Oracle Data IntegrationMichael Rainey
In this session, Oracle Product Management covers how Oracle Data Integrator and Oracle GoldenGate are vital to big data initiatives across the enterprise, providing the movement, translation, and transformation of information and data not only heterogeneously but also in big data environments. Through a metadata-focused approach for cataloging, defining, and reusing big data technologies such as Hive, Hadoop Distributed File System (HDFS), HBase, Sqoop, Pig, Oracle Loader for Hadoop, Oracle SQL Connector for Hadoop Distributed File System, and additional big data projects, Oracle Data Integrator bridges the gap in the ability to unify data across these systems and helps deliver timely and trusted data to analytic and decision support platforms.
Co-presented with Alex Kotopoulis at Oracle OpenWorld 2014.
Real-time Data Warehouse Upgrade – Success StoriesMichael Rainey
Providing a real-time BI solution for its global customers and operations department is a necessity for IFPI, the International Federation of the Phonographic Industry, whose primary objective is to safeguard the rights of record producers through various anti-piracy strategies.
For the data warehousing team at IFPI, using Oracle Streams and Oracle Warehouse Builder (OWB) for real-time data replication and integration was becoming a challenge. The solution was difficult to maintain and overall throughput was degrading as data volumes increased. The need for greater stability and performance led IFPI to implement Oracle GoldenGate and Oracle Data Integrator.
Co-presented with Nick Hurt at Rittman Mead BI Forum 2014 and KScope14.
We all have business user requirements that drive our data warehouse and business intelligence system implementation. Delivered in various formats, these documents, emails, chats, or meeting minutes can be interpreted many different ways, depending on the reader. An effective approach to end the miscommunication...draw a picture. When properly constructed, an architecture diagram, data model, or process model will have only a single meaning. That is why a picture can replace a thousand words.
Presented as a TED style talk at Rittman Mead BI Forum 2014.
GoldenGate and Oracle Data Integrator - A Perfect Match...Michael Rainey
Oracle Data Integrator and Oracle GoldenGate excel as standalone products, but paired together they are the perfect match for real-time data warehousing. Following Oracle’s Next Generation Reference Data Warehouse Architecture, this discussion will provide best practices on how to configure, implement, and process data in real-time using ODI and GoldenGate. Attendees will see common real-time challenges solved, including parent-child relationships within micro-batch ETL.
Presented at Rittman Mead BI Forum 2013 Masterclass.
GoldenGate and ODI - A Perfect Match for Real-Time Data WarehousingMichael Rainey
Oracle Data Integrator and Oracle GoldenGate excel as standalone products, but paired together they are the perfect match for real-time data warehousing. Following Oracle’s Next Generation Reference Data Warehouse Architecture, this discussion will provide best practices on how to configure, implement, and process data in real-time using ODI and GoldenGate. Attendees will see common real-time challenges solved, including parent-child relationships within micro-batch ETL.
Presented at RMOUG Training Days 2013 & KScope13.
StarCompliance is a leading firm specializing in the recovery of stolen cryptocurrency. Our comprehensive services are designed to assist individuals and organizations in navigating the complex process of fraud reporting, investigation, and fund recovery. We combine cutting-edge technology with expert legal support to provide a robust solution for victims of crypto theft.
Our Services Include:
Reporting to Tracking Authorities:
We immediately notify all relevant centralized exchanges (CEX), decentralized exchanges (DEX), and wallet providers about the stolen cryptocurrency. This ensures that the stolen assets are flagged as scam transactions, making it impossible for the thief to use them.
Assistance with Filing Police Reports:
We guide you through the process of filing a valid police report. Our support team provides detailed instructions on which police department to contact and helps you complete the necessary paperwork within the critical 72-hour window.
Launching the Refund Process:
Our team of experienced lawyers can initiate lawsuits on your behalf and represent you in various jurisdictions around the world. They work diligently to recover your stolen funds and ensure that justice is served.
At StarCompliance, we understand the urgency and stress involved in dealing with cryptocurrency theft. Our dedicated team works quickly and efficiently to provide you with the support and expertise needed to recover your assets. Trust us to be your partner in navigating the complexities of the crypto world and safeguarding your investments.
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
6. 6
RDBMS architecture
Bottleneck!
SAN StorageSAN Storage
DB App 1 DB App 2 DB App N
RDBMS requires data
to be “moved” to the
application for
processing
Running out of
space? Expand
SAN.
DiskDiskDisk Disk Disk Disk
7. 7
HDFS scalability
Hadoop Distributed File System (HDFS)
Name
Node
Name
Node2
Data
Node
Data
Node
Data
Node
Data
Node
Data
Node
Data
Node
Data
Node
“Where is
my data?”
Running out of
space? Add nodes.
Metadata
Node 1
“Access
your data!”
Node 2 Node 3 Node 4 Node ... Node N Node X
8. 8
HDFS scalability and performance
Hadoop Distributed File System (HDFS)
Name
Node
Name
Node2
Data
Node
Data
Node
Data
Node
Data
Node
Data
Node
Data
Node
Data
Node
Metadata
Node 1 Node 2 Node 3 Node 4 Node ... Node N Node X
Disk Disk Disk Disk Disk Disk Disk
Computation
occurs where
the data lives
“Moving Computation is Cheaper than Moving Data”
9. 9
How data storage works in HDFS
Data file
Node 1 Node 2 Node 3 Node 4 Node 5
Block 1 Block 2 Block 3Block 1Block 1Block 1 Block 2Block 2Block 2 Block 3Block 3Block 3Block 1 Block 3
Name
Node
NameNode
determines where
blocks will be stored
Blocks that become
unavailable will be
replicated from other
nodes, ensuring at
least 3 copies exist
10. 10
• Scalability in Software!
• Open Data Formats
• Future-proof!
• One Data, Many Engines!
Scalable and open!
Hive Impala Spark Presto Giraph
SQL-on-Hadoop is only
one of many
applications of Hadoop
- Graph Processing
- Full text search
- Image Processing
- Log processing
- Streams
11. 11
Data processing examples
Read Data from Disks
Filter Data
("where sal > 50000")
Simple Processing: Aggregate,
Join, Sort, ...
Complex Processing: Business
Logic, Analytics
Data
App Server
Database
Server
Storage
Array
RDBMS + SAN
Storage Cell
Database
Server
App Server
Oracle Exadata
App code
+
Spark*
on HDFS
Hadoop + Data
Processing framework
12. 12
HDFS considerations
• HDFS is designed for "write once, read many" use cases
• Large sequential streaming reads (scans) of files
• Not optimal for random seeks and small IOs
• "Inserts" are appended to the end of the file
• No updates (by default)
• HDFS performs the replication ("mirroring") and is designed anticipating
frequent DataNode failures
Various other applications
(Hbase, Kudu, etc) allow for
updates, random lookups,
and resolve some of the
other “tradeoffs”, and should
be considered
13. 13
• Standard file formats
• Plain text (CSV, tab-delimited)
• Structured text (XML, JSON)
• Binary (images)
• Serialization formats
• Avro
• Columnar formats
• Parquet
• ORC
Data formats supported by Hadoop
15. 15
• MapReduce is just one component of Hadoop
• Hadoop v1, MapReduce was THE component for data retrieval and resource
management
• Hadoop v2 includes YARN and the open source community has built up
MapReduce data access alternatives
• MapReduce is not the only way to access data
• You don’t need an army of Java developers to use Hadoop
• SQL engines on Hadoop make use of a familiar syntax for accessing data
• MapReduce is rarely used in practice or by other software (ex. Hive on Spark)
Myth #1: Hadoop is MapReduce
Hadoop != MapReduce
16. 16
• There are many more SQL engines on Hadoop that can provide SQL-like
data access, just as Hive does
Myth #2: SQL on Hadoop is basically Hive
SQL on Hadoop is rapidly evolving!
17. 17
• Hadoop is not a single product, but an ecosystem
• Mostly (Apache) open source
• Every component/layer has multiple alternatives
• Different vendor distributions prefer different components
• SQL-on-Hadoop: Hive, Impala, Drill, SparkSQL, Presto, Phoenix
• Storage subsystems: HDFS, Kudu, S3, MapR-FS, EMC Isilon
• Resource management: YARN, Mesos, Myriad, MapReduce v1
• Hadoop Ecosystem Table
• A good reference for looking up component names & what they do
• https://hadoopecosystemtable.github.io/
Hadoop ecosystem
19. 19
Example syntax: Impala SQL Language
[WITH name AS (select_expression) [, ...] ]
SELECT
[ALL | DISTINCT]
[STRAIGHT_JOIN]
expression [, expression ...]
FROM table_reference [, table_reference ...]
[[FULL | [LEFT | RIGHT] INNER | [LEFT | RIGHT] OUTER | [LEFT | RIGHT]
SEMI | [LEFT | RIGHT] ANTI | CROSS]
JOIN table_reference [ON join_equality_clauses | USING (col1[, col2
...]] ...
WHERE conditions
GROUP BY { column | expression [ASC | DESC] [NULLS FIRST | NULLS LAST]
[, ...] }
HAVING conditions
GROUP BY { column | expression [ASC | DESC] [, ...] }
LIMIT expression [OFFSET expression]
[UNION [ALL] select_statement] ...]
SQL-like syntax, very
familiar to relational
database developers!
20. 20
SQL Execution on Hadoop: MPP Distributed processing (ideal)
Node 1
Scan
Filter
Aggregate
disk
GB/s
disk
GB/s
Node 2
Scan
Filter
Aggregate
disk
GB/s
Node 3
Scan
Filter
Aggregate
GB/s
disk
Node 4
Scan
Filter
Aggregate
disk
GB/s
Node 5
“Query
master”
Scan
Filter
Aggregate
disk
GB/s
Node …
Scan
Filter
Aggregate
disk
GB/s
Node N
Scan
Filter
Aggregate
SELECT state, SUM(amount)
FROM sales
WHERE channel = ‘Online’
GROUP BY state
Hive-
Server2
Thrift
Protocol
Listener
Final
aggregate
Send intermediate resultset to
coordinator for re-aggregation
Apache Hive, Impala,
Spark all use the
HiveServer2 Thrift
protocol
A global view of entire cluster data
21. 21
• SUM, COUNT, AVG, MIN, MAX
• Each cluster node can compute a SUM on their local data
• Coordinator performs a final global SUM (by summing together local SUMs)
Scalability – Additive Aggregations
CA = 10
NY = 5
FL = 1
CA = 7
SC = 4
NY = 3
…
SELECT state, SUM(amount)
FROM sales
WHERE channel = ‘Online’
GROUP BY state
disk disk disk disk
Local
Aggregation
(parallel)
Global/Final
Aggregation (serial)
“Query
Master”
CA = 17
NY = 8
FL = 1
SC = 4
Final results
Intermediate
results
22. 22
SQL Execution on Hadoop: MPP Distributed processing (big queries)
Node 1
Scan
Filter
Aggregate
disk
GB/s
disk
GB/s
Node 2
Scan
Filter
Aggregate
disk
GB/s
Node 3
Scan
Filter
Aggregate
GB/s
disk
Node 4
Scan
Filter
Aggregate
disk
GB/s
Node 5
“Query
master”
Scan
Filter
Aggregate
disk
GB/s
Node …
Scan
Filter
Aggregate
disk
GB/s
Node N
Scan
Filter
Aggregate
SELECT state, COUNT(DISTINCT prod_id)
FROM sales s, customers c
WHERE s.cust_id = c.cust_id
AND c.cust_age > 20
GROUP BY state
Hive-
Server2
Thrift
Protocol
Listener
Final
aggregate
A global view of entire cluster data
Nodes need to
exchange data
(shuffle) to see
full picture
23. 23
• In a distributed system with “random” physical placement of data in shards a join
must be able to compare all data from both its sides (after any direct filters)
• If one dataset is small enough, it can be broadcasted and kept in memory on all
nodes for node-local joining (broadcast join or replicated join)
Scalability – Broadcast Joins
Node 1
Node 4
Node 7
Node 2
Node 5
Node 8
Node 3
Node 6
Node 9
Large fact table
(logical layout)
Small dimension table
(logical layout)
Broadcast small table to all nodes
24. 24
• If both sides of the join are too big for broadcasting & keeping in each
node’s memory
Scalability – Large Fact-to-Fact Joins
Node 1
Node 4
Node 7
Node 2
Node 5
Node 8
Node 3
Node 6
Node 9
Large fact table
(logical layout)
Another large table
(logical layout)
Hash & Shuffle non-overlapping
subsets of data to the other join side.
Hash function on join column.
Node 1
Node 4
Node 7
Node 2
Node 5
Node 8
Node 3
Node 6
Node 9
25. 25
• Pre-join large fact tables (that are frequently joined)
• Denormalization!
• Hadoop storage is cheap – it’s affordable to keep multiple copies
• Columnar data formats
• Queries access & scan only narrow slices of the wide table
• De-normalized duplicate values compressed well thanks to columnar compression
• Approximate Distinct operations
• HyperLogLog –> COUNT(distinct) vs NDV()
• Partition-wise joins help with memory usage
• Hive calls it bucketed map join (SMB map join) – shuffling still will happen
• On Impala, currently the SQL query must join individual partitions (union all)
How to improve/combat/fix these issues
26. 26
• Provides high-performance, low-latency SQL queries
• Enables interactive exploration and fine-tuning of analytic queries
• Multiple open data formats (Parquet, Avro, CSV, etc) and storage formats
(HDFS Kudu, S3, etc) can be accessed via Impala
• Uses the Hive metastore – the “data dictionary”
for Hadoop
• Stores table structure metadata (columns, datatypes)
and file location in HDFS
• Serves two purposes: data abstraction and data discovery
• Impala uses the same metadata (Hive metastore),
SQL syntax, drivers, and UI (Hue) as Hive
Apache Impala
29. 32
Impala Hive Drill Presto
Data sources HDFS, Kudu, Cloud
Storage
HDFS, cloud storage,
Hbase
HDFS, NoSQL, cloud
storage
HDFS, NoSQL, RDBMS
Data model Relational Relational Schema free JSON Relational
Metadata Hive Metastore Hive Metastore Hive Metastore
(Optional)
Hive Metastore
(remote)
Deployment model Collocated with
Hadoop
Collocated with
Hadoop
Standalone or
collocated with
Hadoop
Collocated with
Hadoop
SQL HiveQL syntax HiveQL syntax ANSI SQL ANSI SQL
Use case Fast, interactive
analytics. High
concurrency. Large
datasets.
Data warehouse /
reporting analytics,
data intensive,
complex queries
Self-service, SQL
based analytics,
heterogeneous
sources
Interactive analytics
on large datasets
SQL on Hadoop processing engines compared
30. 33
• Transactional systems (OLTP)
• Simple - it’s possible
• Complex (ERP) - No (not anytime soon!)
• Data warehouse / Reporting / Analytics
• Traditional DW - maybe, but it will take time
• Big Data - Yes
• ETL and data integration
• Probably!
• Most ETL tools already support integration with
Big Data transformation technologies
Is Hadoop going to replace my database?
31. 34
• Large “fact to fact” table joins
• DML & Updates
• SQL engine sophistication & syntax richness
Where SQL on Hadoop can improve
hive> SELECT SUM(duration)
> FROM call_detail_records
> WHERE
> type = 'INTERNET‘
> OR phone_number IN ( SELECT phone_number
> FROM customer_details
> WHERE region = 'R04' );
FAILED: SemanticException [Error 10249]: Line 5:17
Unsupported SubQuery Expression 'phone_number':
Only SubQuery expressions that are top level
conjuncts are allowed
We RDBMS developers have
been spoiled with very
sophisticated SQL engines
for years :)
Remember,
Hadoop
evolves very
fast!
34. 37
• Challenges storing and processing genomics data constrained
the set of possible research questions
• Data discovery was limited
• Commonly referred to as “lamp-posting,” cancer researchers could only “look
where the light was”—though most interesting research questions had to do with
discovering answers still “in the dark”.
• Proprietary platforms limited scholarly collaboration and reduced research
time
• Data remained constrained in proprietary systems operated by individual
researchers
• This IT legacy of isolated data siloes hindered future collaboration
Genomics cancer research at Arizona State University
Source: https://hortonworks.com/customers/arizona-state-university
35. 38
• Hortonworks Data Platform with Hadoop stores genomic data, making it
broadly accessible at an affordable cost, allowing ASU researchers to:
• store and process huge amounts of data
• make that data and tools accessible to others within and outside of the university
• do it all at a cost that wouldn’t escalate as genomics data grew to petabytes in their
cluster
Store and access all data with Hadoop
“Your average database of 20 billion rows is simply unapproachable
with traditional, standard technology,” said Dr. Kenneth Buetow.
“With our Hadoop infrastructure, we can run data-intensive queries
of these large-scale resources, and they return results in seconds.
This is transformational.”
“My epiphany came when I ran a relatively complex SQL query against
a table that had 20 billion rows in it. The query returned results in a
minute or two. I was dumbfounded. Prior to using this environment,
one never could comprehensively construct the networks. Basically,
you couldn’t or didn’t ask those comprehensive questions.”
Source: https://hortonworks.com/customers/arizona-state-university
37. 40
• Securus Technologies is a major provider of prison telecommunications
and technology services in the United States
• Satellite Tracking of People (STOP)
handles offender ankle bracelet
device tracking
• Application users are parole/probation
officers and law enforcement agencies
• Challenges
• Database becoming increasingly overloaded, restricting application users
• IoT geospatial data constantly streams into overloaded RDBMS
• Only 2 days of history available to query without extreme wait times
• Cost to re-platform or expand database too high
Securus Technologies
38. 41
• Approach
• Use Gluent Offload to sync 100% of application
data to Hadoop with no ETL
• Use Gluent Transparent Query to allow legacy
applications to access data without code rewrite
• Results and benefits
• Significant decrease in application query response
time
• 365 days of historical data returned in just minutes!
• Analysts began to rethink what’s possible with the
historical data
• Predictive and forensic type analytics, etc.
• No immediate need to re-platform application
• Fixed performance issues with zero code changes!
Securus Technologies
Dramatic performance improvements, huge cost savings, new business opportunities
…without rewriting legacy applications
*Did not finish
DNF*
Using Impala -
SQL on Hadoop!
39. SQL on Hadoop for the Oracle Professional
Thank you for attending!
40. 43
• Email: mrainey@gluent.com
• Twitter: @mRainey
• Website: gluent.com
• Hadoop resources
• https://gluent.com/hadoop-resources
Reach out for more info!
41. 44
”All enterprise data is just a query away”
• Gluent Data Platform
• Supports all major Hadoop distributions, on-premises or in the cloud
• Consolidates data into a centralized location in open data formats
• Transparent Data Virtualization provides simple data sharing across the enterprise
• You decide when to…
• …write ETL code
• …rewrite applications
• …learn new technologies
Gluent Data Platform
42. 45
Gluent’s hybrid data world
No ETL Data Sync
Transparent Query
Gluent connects all data to all applications!
43. 46
• Selected as a Gartner 2017 Cool Vendor in Data Management
• https://gluent.com/cool-vendor-2017
• Recognized in “The 10 Coolest Big Data Startups of 2017” on CRN
• https://gluent.com/gluent-recognized-in-top-10-coolest-startups-of-2017-on-crn/
• Featured in 2017 Strata Startup Showcase
• https://gluent.com/gluent-strata-startup-showcase/
Industry recognition for Gluent
Editor's Notes
Hadoop is a “huge” topic. With one day, we had to choose what to explain. Where it’s awesome, and where it’s not-so-awesome.`
Gluent is all about enterprise data sharing and transparent data virtualization.
Why are we excited about hadoop?
The evolution of big data. How did we get here and why does this training even exist?
10 yrs ago, what was the biggest DB? 5 TB
Today, that’s not “exciting”.
If you work on ERP systems, easy to think that all data lives in tables and is easy to store. Not always true, different businesses - different departments - have different needs.
Why not just load up storage array?
Data moved to computation for processing
Records must move from storage to process
Access to storage is the bottleneck
Running out of space? Expand SAN
Becomes expensive. SAN storage arrays are not cheap.
Two common problems in db performance work
Bad sql statements (optimizer issues)
Storage issues. Usually not config issue, but pipes between database nodes and san storage is very small.
Describe NameNode & DataNodes at high-level
NameNode - where does my file - file blocks - live?
DataNode - allows access to locally stored data.
Data doesn’t move for computation. Processing occurs where the data lives
Running out of space? Add new nodes on commodity hardware.
Physically
Data nodes on commodity, inexpensive “pizza boxes”.
Looks like one big filesystem. Get deeper on HDFS later on.
Name node can be compared to ASM instance in Exadata. (TODO)
Not just distributed file system, but distributed computation. Code is distributed as well.
Data file is chunked into blocks
3 replicas of each block will be created
If a node goes down, NameNode ensures each block still has 3 replicas
Can also rebalance if needed.
Software scalability -
past, you had to buy many large, big iron hardware systems.
Now, open source software.
Hadoop ecosystem is organized around open data formats.
Makes your system future proof.
One data, many engines.
DW on Hadoop, not stuck with Hive SQL - you can use any engine you like.
Switching engines when using something like Oracle - export, load into new engine. Takes time.
Workload processing - SQL-like processing in an Oracle example.
RDBMS / enterprise database
Read data. Filter down to what you need. Perform simple processing. More complex processing.
Exadata (similar to Netezza)
Push filtering to storage level. Disk I/O (read) and filtering.
Less data moves from storage to DB. Still perform aggregations, etc in DB.
Hadoop
Not just SQL engine, but parallel processing framework that will push application code close to where the data lives.
All operations/logic occurs in the same place.
Exadata one step closer to hadoop, but not quite.
Load data once, read it many times
Best for large, sequential file scans. Not great for random. MapR has own file system, MapR-FS, multi-level namenodes. (..more info needed..)
Kudu can solve this problem as well - doesn’t use HDFS
Cannot modify file, only insert on end. If could, uncompress file, update, recompress? Not good anyways.
You can really put any type of file there, but main formats are these.
Typically start with CSV for tutorials.
Sequence files, binary-like files
Avro - serializing rows and columns
Columnar formats - used when extracting data for analysis. Faster to extract/analyze. Stored in big chunks, organized in columns, often compressed.
10 yrs ago, yes. HDFS and MapReduce in v1
Now, MR rarely used in new dev.
-starting with Hadoop, run SQL against it. It’s very good.
Hive, the first SQL engine developed at Facebook.
Converted SQL to MR under the hood to run against distributed filesystem.
Now, many engines! Even Hive evolved to use Spark or Tez rather than MR.
Hive will be described further later on.
Hadoop has a large ecosystem.
Similar to redhat linux. Different hadoop vendors.
Not vendor driven, community driven
Ecosystem table - open and show how many components
Show a little about how it works
Discuss some fundamentals and where Hadoop fits and doesn’t work as well.
The big data hype was that it would solve all of your problems and make lots of money.
Works for the folks like Facebook/Yahoo.
But we don’t want all of our SQL developers to learn Java and mapreduce, etc.
The fact you can actually run SQL on Hadoop - that’s the killer app for big data and Hadoop. For example, moving a lot of data from the expensive DW
Or using hadoop as a data sharing back-end.
Impala syntax is pretty powerful.
UDFs that can be written in C++
DEMO: show Impala, add files to HDFS, create tables, etc.
show Hue as well.
On the front-end, it looks a lot like a database. SQL statements are similar, etc.
On the backend, it’s a bit different.
Every single node has its own local disk - there is no SAN storage.
Hadoop does sequential reads/writes, not random I/O - so I/O bottleneck is not there.
What about processing? I/O throughput is great, now the CPU might be the bottleneck.
So in the ideal MPP world
Run a select statement across all data
Filter and group by. Group by state, not massive amount of records
Thrift server picks one node to connect. The “query master”.
Query master takes SQL and sends it to all nodes on the cluster
Each node only reads local data (usually) - no network bottlenecks
Query master dishes out work so unique data is read on each node.
Each node does scanning and filtering on their own.
Total sum, for aggregate, can be pushed down to the nodes as well.
Each node result set is sent to query master for final aggregation and returned to Hive
But, life is not always ideal
Scalability of aggregations
Additive - like a Sum, you can sum at lower levels (each node) then aggregate again at the query master.
Works similar to Oracle Parallel Query, at a different level. Each parallel slave can do a sum on its own, then later stage in the query plan it sums the sums.
But what about count distinct? Or Median?
Example - how many different products per customer transaction in a store?
Distincts are not additive. That’s where the scale-out “dream” breaks down. We’ll get back to aggregations
Now we’ve added distinct to the count
So nodes must exchange data to get to this final distinct count.
Each node can order the local dataset, in parallel. Final merge join in the query master, performed serially.
Nodes sharing data is called a ‘shuffle’ in order to see the full picture.
Fundamentals about joins
Oracle - using shared storage, it doesn’t scale to a thousand nodes
The benefit, all nodes “see” the storage and can scan through all of the nodes to return the data.
Hadoop has a tradeoff. 9 nodes here, with each having a portion of the data. How do we do a join?
Cannot just “join” node1 to node4 and hope all of the data is represented.
First, read all data from small table and broadcast into memory on all of the nodes that will process the query.
Not usually a problem for smaller dimension tables broadcast to join a large fact. But, it could become a problem.
I relate this to the issue of change data capture in ETL.?? maybe
The biggest remaining scalability problem for Hadoop.
Big fact to fact joins - probably won’t run well.
For small table - optimizer will broadcast to each node. But for a large table, the optimizer will need to partition the join.
We can shuffle - read through the first node and run a hash function on the key. Then based on results, send each row to different nodes - selectively.
Partition join
Then all nodes will do the same.
Unfortunately, that’s only half of the story
Shuffling must happen on the other side table as well.
If right side is sales and left is customers, we need to do the same.
Then, the local join can occur.
Lots of CPU and network traffic.
If Hadoop worked like teradata - not needed. Teradata will contain data of same IDs in same node. Better big table joins that way. But, you have to redistribute data when adding another node.
How do we workaround these issues?
We can pre-join large fact tables, during load or after load and store the copy. A form of denormalization.
Just scan through the columns you need in columnar formats
Aprox count distinct - similar Oracle operation
Uses a bitmap for storing an approximate number of distinct values per customer or state or whatever. The magic, they are additive. Each node can count distinct value, then can still add in final aggregation in query master. Gives roughly correct results. Within 1.5% of correct value - so must be able to tolerate deviation. Perfect for exploring data, but then use more intensive query.
Hive metastore
Even used by other sql engines on hadoop. It’s like a data dictionary for Hadoop.
How do we know which files in which directory to go after when running “SELECT *” in Hive? Hive metastore.
Impala, Presto, even Spark, can plug into Hive Metastore.
Different from how SQL Sever and Oracle work. Cannot share a data dictionary across these two engines.
It’s like having multiple data engines for the same database.
DEMO:
Hive and Impala both show the same metadata!
Visual of next slide, due to animations hiding content
Hive is Hortonworks
Impala is Cloudera
Drill is now sort of MapR
Presto, created by Facebook but Teradata now contributes
Very neat part of open source, companies like Facebook, Yahoo creating these - but even companies like teradata contributing.
Use cases
A little background. Hive used to be slow for short queries because it always used MapReduce behind the scenes.
Fixed on some platforms. Hive LLAP - low latency, so no slow startup overhead. Plus can use Spark or Tez behind the scenes.
Impala - Faster analytics. Hive is more functional, SQL syntax is more rich, and has partition-wise joins, which Impala doesn’t have.
But, it’s hadoop. You can use Hive for your ETL and complex transformations with user defined functions in a batch job, then use Impala for your BI users - giving them faster response time. “Use the best tool for the job”
Drill is really good for exploratory analytics with it’s use of JSON files - and no need for a “relational” type schema.
Presto - more modern version of Hive.
Preference for Hadoop vendor will drive decision of what to use.
Hortonworks - Hive or Spark SQL
Cloudera - Impala or Hive
TODO - remove alternative vendor based products like Drill / Presto?
Today? No! too much to migrate, some things cannot be migrated easily at all.
But as far as capabilities of hadoop in the next 4-5 years, some cases yes.
OLTP
Mongo DB and Cassandra already doing it - for simple OLTP systems.
Complex - not at all. RDBMS is here to stay.
DW
Traditional data warehouses, quite possible, but might take some time to take over Teradata, etc.
But, not quite ready yet. Big table joins are still an issue. But it’s coming. Companies like Netflix are using a cloud-based DW, with parquet files in S3.
Big Data - yes. Unless the systems with messages or event feeds started before the advent of big data, these IoT projects are going straight into Hadoop.
ETL and DI
Yeah, probably. A homegrown system, tough to convert to Hadoop. But ETL tools like Informatica or ODI, just point to Hadoop storage and processing apps.
It won’t happen overnight, but it’s movig that way.
Cloud does introduce an interesting shift.
Services such as Google BigQuery might not even be known for traditional enterprises, using ERP systems.
But Marketing firms, they just dump data to the cloud and then run analytics against them.
Onsite - that’s more of where hadoop will rule.
Already discussed some of these.
Hadoop can improve as well. Just to be realistic - we’re not selling Hadoop licenses or anything!
Subquery under OR condition - not supported by Hive or Impala even.
Could rewrite this query, but still an issue. You might not be able to just migrate all data and switch from Oracle PL/SQL to HiveQL.
DEMO - #3EU 3:00
- Same query in Impala
Case study using HortonWorks on open source software?
FYI - cost savings about 5 mill (existing) vs 2 mill (gluent)
TODO
https://www.slideshare.net/AmazonWebServices/how-to-build-a-data-lake-with-aws-glue-data-catalog-abd213r
Introducing…Gluent Data Platform
Support big 3 Hadoop vendors, on-premises and in the cloud
- Many customers are cloud-first - recent customer example
- Or shifting to the cloud. They look for help from us to do so