This document discusses using SQOOP to connect Hadoop and relational databases. It describes four common interoperability scenarios and provides an overview of SQOOP's features. It then focuses on optimizing SQOOP for Oracle databases by discussing how the Quest/Cloudera OraOop extension improves performance by bypassing Oracle parallelism and buffering. The document concludes by recommending best practices for using SQOOP and its extensions.
Apache Sqoop: Unlocking Hadoop for Your Relational Database huguk
Kathleen Ting, Technical Account Manager @ Cloudera and Sqoop Committer
Unlocking data stored in an organization's RDBMS and transferring it to Apache Hadoop is a major concern in the big data industry. Apache Sqoop enables users with information stored in existing SQL tables to use new analytic tools like Apache HBase and Apache Hive. This talk will go over how to deploy and apply Sqoop in your environment as well as transferring data from MySQL, Oracle, PostgreSQL, SQL Server, Netezza, Teradata, and other relational systems. In addition, we'll show you how to keep table data and Hadoop in sync by importing data incrementally as well as how to customize transferred data by calling various database functions.
Apache Sqoop efficiently transfers bulk data between Apache Hadoop and structured datastores such as relational databases. Sqoop helps offload certain tasks (such as ETL processing) from the EDW to Hadoop for efficient execution at a much lower cost. Sqoop can also be used to extract data from Hadoop and export it into external structured datastores. Sqoop works with relational databases such as Teradata, Netezza, Oracle, MySQL, Postgres, and HSQLDB
Apache Sqoop: Unlocking Hadoop for Your Relational Database huguk
Kathleen Ting, Technical Account Manager @ Cloudera and Sqoop Committer
Unlocking data stored in an organization's RDBMS and transferring it to Apache Hadoop is a major concern in the big data industry. Apache Sqoop enables users with information stored in existing SQL tables to use new analytic tools like Apache HBase and Apache Hive. This talk will go over how to deploy and apply Sqoop in your environment as well as transferring data from MySQL, Oracle, PostgreSQL, SQL Server, Netezza, Teradata, and other relational systems. In addition, we'll show you how to keep table data and Hadoop in sync by importing data incrementally as well as how to customize transferred data by calling various database functions.
Apache Sqoop efficiently transfers bulk data between Apache Hadoop and structured datastores such as relational databases. Sqoop helps offload certain tasks (such as ETL processing) from the EDW to Hadoop for efficient execution at a much lower cost. Sqoop can also be used to extract data from Hadoop and export it into external structured datastores. Sqoop works with relational databases such as Teradata, Netezza, Oracle, MySQL, Postgres, and HSQLDB
Cloudera Impala: The Open Source, Distributed SQL Query Engine for Big Data. The Cloudera Impala project is pioneering the next generation of Hadoop capabilities: the convergence of fast SQL queries with the capacity, scalability, and flexibility of a Apache Hadoop cluster. With Impala, the Hadoop ecosystem now has an open-source codebase that helps users query data stored in Hadoop-based enterprise data hubs in real time, using familiar SQL syntax.
This talk will begin with an overview of the challenges organizations face as they collect and process more data than ever before, followed by an overview of Impala from the user's perspective and a dive into Impala's architecture. It concludes with stories of how Cloudera's customers are using Impala and the benefits they see.
Building large scale transactional data lake using apache hudiBill Liu
Data is a critical infrastructure for building machine learning systems. From ensuring accurate ETAs to predicting optimal traffic routes, providing safe, seamless transportation and delivery experiences on the Uber platform requires reliable, performant large-scale data storage and analysis. In 2016, Uber developed Apache Hudi, an incremental processing framework, to power business critical data pipelines at low latency and high efficiency, and helps distributed organizations build and manage petabyte-scale data lakes.
In this talk, I will describe what is APache Hudi and its architectural design, and then deep dive to improving data operations by providing features such as data versioning, time travel.
We will also go over how Hudi brings kappa architecture to big data systems and enables efficient incremental processing for near real time use cases.
Speaker: Satish Kotha (Uber)
Apache Hudi committer and Engineer at Uber. Previously, he worked on building real time distributed storage systems like Twitter MetricsDB and BlobStore.
website: https://www.aicamp.ai/event/eventdetails/W2021043010
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard OfCharles Givre
Study after study shows that data preparation and other data janitorial work consume 50-90% of most data scientists’ time. Apache Drill is a very promising tool which can help address this. Drill works with many different forms of “self describing data” and allows analysts to run ad-hoc queries in ANSI SQL against that data. Unlike HIVE or other SQL on Hadoop tools, Drill is not a wrapper for Map-Reduce and can scale to clusters of up to 10k nodes.
In this Introduction to Apache Sqoop the following topics are covered:
1. Why Sqoop
2. What is Sqoop
3. How Sqoop Works
4. Importing and Exporting Data using Sqoop
5. Data Import in Hive and HBase with Sqoop
6. Sqoop and NoSql data store i.e. MongoDB
7. Resources
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQLArseny Chernov
Fast, demo-enabled 60-min lecture, aligned to curriculum of RDBMS / SQL course taught at Singapore University of Technology and Design (SUTD), a collaboration with MIT. More details about this lecture and some photos here: http://bit.ly/sutd-mit-lecture
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...Chicago Hadoop Users Group
John Leach Co-Founder and CTO of Splice Machine with 15+ years software development and machine learning experience will discuss how to use HBase co-processors to build an ANSI-99 SQL database with 1) parallelization of SQL execution plans, 2) ACID transactions with snapshot isolation and 3) consistent secondary indexing.
Transactions are critical in traditional RDBMSs because they ensure reliable updates across multiple rows and tables. Most operational applications require transactions, but even analytics systems use transactions to reliably update secondary indexes after a record insert or update.
In the Hadoop ecosystem, HBase is a key-value store with real-time updates, but it does not have multi-row, multi-table transactions, secondary indexes or a robust query language like SQL. Combining SQL with a full transactional model over HBase opens a whole new set of OLTP and OLAP use cases for Hadoop that was traditionally reserved for RDBMSs like MySQL or Oracle. However, a transactional HBase system has the advantage of scaling out with commodity servers, leading to a 5x-10x cost savings over traditional databases like MySQL or Oracle.
HBase co-processors, introduced in release 0.92, provide a flexible and high-performance framework to extend HBase. In this talk, we show how we used HBase co-processors to support a full ANSI SQL RDBMS without modifying the core HBase source. We will discuss how endpoint transactions are used to serialize SQL execution plans over to regions so that computation is local to where the data is stored. Additionally, we will show how observer co-processors simultaneously support both transactions and secondary indexing.
The talk will also discuss how Splice Machine extended the work of Google Percolator, Yahoo Labs’ OMID, and the University of Waterloo on distributed snapshot isolation for transactions. Lastly, performance benchmarks will be provided, including full TPC-C and TPC-H results that show how Hadoop/HBase can be a replacement of traditional RDBMS solutions.
To view the accompanying slide deck: http://www.slideshare.net/ChicagoHUG/
Apache Sqoop: A Data Transfer Tool for HadoopCloudera, Inc.
Apache Sqoop is a tool designed for efficiently transferring bulk data between Hadoop and structured datastores such as relational databases. This slide deck aims at familiarizing the user with Sqoop and how to effectively use it in real deployments.
Cloudera Impala: The Open Source, Distributed SQL Query Engine for Big Data. The Cloudera Impala project is pioneering the next generation of Hadoop capabilities: the convergence of fast SQL queries with the capacity, scalability, and flexibility of a Apache Hadoop cluster. With Impala, the Hadoop ecosystem now has an open-source codebase that helps users query data stored in Hadoop-based enterprise data hubs in real time, using familiar SQL syntax.
This talk will begin with an overview of the challenges organizations face as they collect and process more data than ever before, followed by an overview of Impala from the user's perspective and a dive into Impala's architecture. It concludes with stories of how Cloudera's customers are using Impala and the benefits they see.
Building large scale transactional data lake using apache hudiBill Liu
Data is a critical infrastructure for building machine learning systems. From ensuring accurate ETAs to predicting optimal traffic routes, providing safe, seamless transportation and delivery experiences on the Uber platform requires reliable, performant large-scale data storage and analysis. In 2016, Uber developed Apache Hudi, an incremental processing framework, to power business critical data pipelines at low latency and high efficiency, and helps distributed organizations build and manage petabyte-scale data lakes.
In this talk, I will describe what is APache Hudi and its architectural design, and then deep dive to improving data operations by providing features such as data versioning, time travel.
We will also go over how Hudi brings kappa architecture to big data systems and enables efficient incremental processing for near real time use cases.
Speaker: Satish Kotha (Uber)
Apache Hudi committer and Engineer at Uber. Previously, he worked on building real time distributed storage systems like Twitter MetricsDB and BlobStore.
website: https://www.aicamp.ai/event/eventdetails/W2021043010
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard OfCharles Givre
Study after study shows that data preparation and other data janitorial work consume 50-90% of most data scientists’ time. Apache Drill is a very promising tool which can help address this. Drill works with many different forms of “self describing data” and allows analysts to run ad-hoc queries in ANSI SQL against that data. Unlike HIVE or other SQL on Hadoop tools, Drill is not a wrapper for Map-Reduce and can scale to clusters of up to 10k nodes.
In this Introduction to Apache Sqoop the following topics are covered:
1. Why Sqoop
2. What is Sqoop
3. How Sqoop Works
4. Importing and Exporting Data using Sqoop
5. Data Import in Hive and HBase with Sqoop
6. Sqoop and NoSql data store i.e. MongoDB
7. Resources
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQLArseny Chernov
Fast, demo-enabled 60-min lecture, aligned to curriculum of RDBMS / SQL course taught at Singapore University of Technology and Design (SUTD), a collaboration with MIT. More details about this lecture and some photos here: http://bit.ly/sutd-mit-lecture
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...Chicago Hadoop Users Group
John Leach Co-Founder and CTO of Splice Machine with 15+ years software development and machine learning experience will discuss how to use HBase co-processors to build an ANSI-99 SQL database with 1) parallelization of SQL execution plans, 2) ACID transactions with snapshot isolation and 3) consistent secondary indexing.
Transactions are critical in traditional RDBMSs because they ensure reliable updates across multiple rows and tables. Most operational applications require transactions, but even analytics systems use transactions to reliably update secondary indexes after a record insert or update.
In the Hadoop ecosystem, HBase is a key-value store with real-time updates, but it does not have multi-row, multi-table transactions, secondary indexes or a robust query language like SQL. Combining SQL with a full transactional model over HBase opens a whole new set of OLTP and OLAP use cases for Hadoop that was traditionally reserved for RDBMSs like MySQL or Oracle. However, a transactional HBase system has the advantage of scaling out with commodity servers, leading to a 5x-10x cost savings over traditional databases like MySQL or Oracle.
HBase co-processors, introduced in release 0.92, provide a flexible and high-performance framework to extend HBase. In this talk, we show how we used HBase co-processors to support a full ANSI SQL RDBMS without modifying the core HBase source. We will discuss how endpoint transactions are used to serialize SQL execution plans over to regions so that computation is local to where the data is stored. Additionally, we will show how observer co-processors simultaneously support both transactions and secondary indexing.
The talk will also discuss how Splice Machine extended the work of Google Percolator, Yahoo Labs’ OMID, and the University of Waterloo on distributed snapshot isolation for transactions. Lastly, performance benchmarks will be provided, including full TPC-C and TPC-H results that show how Hadoop/HBase can be a replacement of traditional RDBMS solutions.
To view the accompanying slide deck: http://www.slideshare.net/ChicagoHUG/
Apache Sqoop: A Data Transfer Tool for HadoopCloudera, Inc.
Apache Sqoop is a tool designed for efficiently transferring bulk data between Hadoop and structured datastores such as relational databases. This slide deck aims at familiarizing the user with Sqoop and how to effectively use it in real deployments.
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...Cloudera, Inc.
As Hadoop graduates from pilot project to a mission critical component of the enterprise IT infrastructure, integrating information held in Hadoop and in Enterprise RDBMS becomes imperative. We’ll look at key scenarios driving Hadoop and RDBMS integration and review technical options. In particular, we’ll deep dive into the Apache SQOOP project, which expedites data movement between Hadoop and any JDBC database, as well as providing an framework which allows developers and vendors to create connectors optimized for specific targets such as Oracle, Netezza etc.
http://hortonworks.com/hadoop/spark/
Recording:
https://hortonworks.webex.com/hortonworks/lsr.php?RCID=03debab5ba04b34a033dc5c2f03c7967
As the ratio of memory to processing power rapidly evolves, many within the Hadoop community are gravitating towards Apache Spark for fast, in-memory data processing. And with YARN, they use Spark for machine learning and data science use cases along side other workloads simultaneously. This is a continuation of our YARN Ready Series, aimed at helping developers learn the different ways to integrate to YARN and Hadoop. Tools and applications that are YARN Ready have been verified to work within YARN.
Hortonworks Oracle Big Data Integration Hortonworks
Slides from joint Hortonworks and Oracle webinar on November 11, 2014. Covers the Modern Data Architecture with Apache Hadoop and Oracle Data Integration products.
sudoers: Benchmarking Hadoop with ALOJANicolas Poggi
Presentation for the sudoers Barcelona group 0ct 06 2015, on benchmarking Hadoop with ALOJA open source benchmarking platform. The presentation was mostly a live DEMO, posting some slides for the people who could not attend.
http://lanyrd.com/2015/sudoers-barcelona-october/
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...javier ramirez
How would you build a database to support sustained ingestion of several hundreds of thousands rows per second while running near real-time queries on top?
In this session I will go over some of the technical decisions and trade-offs we applied when building QuestDB, an open source time-series database developed mainly in JAVA, and how we can achieve over four million row writes per second on a single instance without blocking or slowing down the reads. There will be code and demos, of course.
We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
Apache Spark presentation at HasGeek FifthElelephant
https://fifthelephant.talkfunnel.com/2015/15-processing-large-data-with-apache-spark
Covering Big Data Overview, Spark Overview, Spark Internals and its supported libraries
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...Yahoo Developer Network
Splice Machine is an open-source database that combines the benefits of modern lambda architectures with the full expressiveness of ANSI-SQL. Like lambda architectures, it employs separate compute engines for different workloads - some call this an HTAP database (Hybrid Transactional and Analytical Platform). This talk describes the architecture and implementation of Splice Machine V2.0. The system is powered by a sharded key-value store for fast short reads and writes, and short range scans (Apache HBase) and an in-memory, cluster data flow engine for analytics (Apache Spark). It differs from most other clustered SQL systems such as Impala, SparkSQL, and Hive because it combines analytical processing with a distributed Multi-Value Concurrency Method that provides fine-grained concurrency which is required to power real-time applications. This talk will highlight the Splice Machine storage representation, transaction engine, cost-based optimizer, and present the detailed execution of operational queries on HBase, and the detailed execution of analytical queries on Spark. We will compare and contrast how Splice Machine executes queries with other HTAP systems such as Apache Phoenix and Apache Trafodian. We will end with some roadmap items under development involving new row-based and column-based storage encodings.
Speakers:
Monte Zweben, is a technology industry veteran. Monte’s early career was spent with the NASA Ames Research Center as the Deputy Chief of the Artificial Intelligence Branch, where he won the prestigious Space Act Award for his work on the Space Shuttle program. He then founded and was the Chairman and CEO of Red Pepper Software, a leading supply chain optimization company, which merged in 1996 with PeopleSoft, where he was VP and General Manager, Manufacturing Business Unit. In 1998, he was the founder and CEO of Blue Martini Software – the leader in e-commerce and multi-channel systems for retailers. Blue Martini went public on NASDAQ in one of the most successful IPOs of 2000, and is now part of JDA. Following Blue Martini, he was the chairman of SeeSaw Networks, a digital, place-based media company. Monte is also the co-author of Intelligent Scheduling and has published articles in the Harvard Business Review and various computer science journals and conference proceedings. He currently serves on the Board of Directors of Rocket Fuel Inc. as well as the Dean’s Advisory Board for Carnegie-Mellon’s School of Computer Science.
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...Amazon Web Services
Amazon Elastic MapReduce (Amazon EMR) makes it easy to provision and manage Hadoop in the AWS Cloud. Hadoop is available in multiple distributions and Amazon EMR gives you the option of using the Amazon Distribution or the MapR Distribution for Hadoop.
This webinar will show you examples of how to use Amazon EMR to with the MapR Distribution for Hadoop. You will learn how you can free yourself from the heavy lifting required to run Hadoop on-premises, and gain the advantages of using the cloud to increase flexibility and accelerate projects while lowering costs.
What we'll learn:
• See a live demonstration of how you can quickly and easily launch your first Hadoop cluster in a few steps.
• Examples of real world applications and customer successes in production
• Best practices for maximizing the benefits of using MapR with AWS.
Optimizing Oracle databases with SSD - April 2014Guy Harrison
Presentation on using Solid State Disk (SSD) with Oracle databases, including the 11GR2 db flash cache and using flash in Exadata. Last given at Collaborate 2014 #clv14.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
Search and Society: Reimagining Information Access for Radical FuturesBhaskar Mitra
The field of Information retrieval (IR) is currently undergoing a transformative shift, at least partly due to the emerging applications of generative AI to information access. In this talk, we will deliberate on the sociotechnical implications of generative AI for information access. We will argue that there is both a critical necessity and an exciting opportunity for the IR community to re-center our research agendas on societal needs while dismantling the artificial separation between the work on fairness, accountability, transparency, and ethics in IR and the rest of IR research. Instead of adopting a reactionary strategy of trying to mitigate potential social harms from emerging technologies, the community should aim to proactively set the research agenda for the kinds of systems we should build inspired by diverse explicitly stated sociotechnical imaginaries. The sociotechnical imaginaries that underpin the design and development of information access technologies needs to be explicitly articulated, and we need to develop theories of change in context of these diverse perspectives. Our guiding future imaginaries must be informed by other academic fields, such as democratic theory and critical theory, and should be co-developed with social science scholars, legal scholars, civil rights and social justice activists, and artists, among others.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Key Trends Shaping the Future of Infrastructure.pdf
Hadoop and rdbms with sqoop
1. Exchanging data with the Elephant: Connecting Hadoop and an RDBMS using SQOOP Guy Harrison Director, R&D Melbourne www.guyharrison.net Guy.harrison@quest.com @guyharrison
9. Options for RDBMS inter-op DBInputFormat: Allows database records to be used as mapper inputs. BUT: Not inherently scalable or efficient For repeated analysis, better to stage in Hadoop Tedious coding of DBWritable classes for each table SQOOP Open source utility provided by Cloudera Configurable command line interface to copy RDBMS->HDFS Support for Hive, Hbase Generates java classes for future M-R tasks Extensible to provide optimized adaptors for specific targets Bi-Directional
10. SQOOP Details SQOOP import Divide table into ranges using primary key max/min Create mappers for each range Mappers write to multiple HDFS nodes Creates text or sequence files Generates Java class for resulting HDFS file Generates Hive definition and auto-loads into HIVE SQOOP export Read files in HDFS directory via MapReduce Bulk parallel insert into database table
11. SQOOP details SQOOP features: Compatible with almost any JDBC enabled database Auto load into HIVE Hbase support Special handling for database LOBs Job management Cluster configuration (jar file distribution) WHERE clause support Open source, and included in Cloudera distributions SQOOP fast paths & plug ins Invokemysqldump, mysqlimport for MySQL jobs Similar fast paths for PostgreSQL Extensibility architecture for 3rd parties (like Quest ) Teradata, Netezza, etc.
12. Working with Oracle SQOOP approach is generic and applicable to all RDBMS However for Oracle, sub-optimal in some respects: Oracle may parallelize and serialize individual mappers Oracle optimizer may decline to use index range scans Oracle physical storage often deliberately not in primary key order (reverse key indexes, hash partitioning, etc) Primary keys often not be evenly distributed Index range scans use single block random reads vs.faster multi-block table scans Index range scans load into Oracle buffer cache Pollutes cache increasing IO for other users Limited help to SQOOP since rows are only read once Luckily, SQOOP extensibility allows us to add optimizations for specific targets
16. Index range scans Hadoop Mapper ID > 0 and ID < MAX/2 Hadoop Mapper ID > MAX/2 Oracle Session Oracle Session Index range scan Buffer cache Index range scan Index block Index block Index block Index block Index block Index block Oracle table
18. Quest/Cloudera OraOop for SQOOP Design goals Partition data based on physical storage By-pass Oracle buffering By-pass Oracle parallelism Do not require or use indexes Never read the same data block more than once Support esoteric datatypes (eventually) Support RAC clusters Availability: Freely available from www.quest.com/ora-oop Packaged with Cloudera Enterprise Commercial support from Quest/Cloudera within Enterprise distribution
21. Extending SQOOP SQOOP lets you concentrate on the RDBMS logic, not the Hadoop plumbing: Extend ManagerFactory (what to handle) Extend ConnManager (DB connection and metadata) For imports: Extend DataDrivenDBInputFormat (gets the data) Data allocation (getSplits()) Split serialization (“io.serializations” property) Data access logic (createDBRecordReader(), getSelectQuery()) Implement progress (nextKeyValue(), getProgress()) Similar procedure for extending exports
22. SQOOP/OraOop best practices Use sequence files for LOBs OR Set inline-lob-limit Directly control datanodes for widest destination bandwidth Can’t rely on mapred.max.maps.per.node Set number of mappers realistically Disable speculative execution (our default) Leads to duplicate DB reads Set Oracle row fetch size extra high Keeps the mappers streaming to HDFS
23. Conclusion RDBMS-Hadoop interoperability is key to Enterprise Hadoop adoption SQOOP provides a good general purpose tool for transferring data between any JDBC database and Hadoop SQOOP extensions can provide optimizations for specific targets Each RDBMS offers distinct tuning opportunities, so SQOOP extensions offer real value Try out OraOop for SQOOP!