External tables in Oracle Database can be used to access data stored in Hadoop HDFS. The FUSE project allows HDFS files to be accessed like a normal file system. By mounting HDFS using a FUSE driver, HDFS files can then be accessed via external tables.
Alternatively, Oracle table functions provide a way to fetch data from Hadoop without using external tables. A table function is created that launches a Hadoop MapReduce job using asynchronous shell scripts. The mapper processes write results to a queue, which the table function reads from to return results.
An example implementation was provided using a table function that coordinates a Hadoop job, with mappers writing to a queue and the function reading from the
Oracle Join Methods and 12c Adaptive PlansFranck Pachot
Join Methods and 12c Adaptive Plans
In its quest to improve cardinality estimation, 12c has introduced Adaptive Execution Plans which deals with the cardinalities that are difficult to estimate before execution. Ever seen a hanging query because a nested loop join is running on millions of rows?
This is the point addressed by Adaptive Joins. But that new feature is also a good occasion to look at the four possible join methods available for years.
This document summarizes a talk on optimizer hints in databases. It begins with introducing the speaker and their background. It then covers the basics of query optimization in databases and how hints can provide additional information to the optimizer. Specifically, it discusses query hints to force a plan, statistics hints to provide join selectivity, and data hints about column dependencies. It notes that PostgreSQL does not support hints directly but similar control is possible through configuration parameters. It concludes by listing some drawbacks of hints.
The document summarizes how SQL Plan Directives in Oracle 12c can help address issues caused by cardinality misestimation in the optimizer. It provides an example where the optimizer underestimates the number of rows returned by a query on a table due to not having statistics on correlated columns. In 12c, a SQL Plan Directive is automatically generated after the first execution to capture this misestimation. On subsequent queries, the directive can be used to provide more accurate cardinality estimates through automatic reoptimization or dynamic sampling.
Star Transformation, 12c Adaptive Bitmap Pruning and In-Memory optionFranck Pachot
Besides adaptive joins and adaptive parallel distribution, 12c comes with Adaptive Bitmap Pruning. I’ll describe the case it applies to and which is often not well known: the Star Transformation
CBO choice between Index and Full Scan: the good, the bad and the ugly param...Franck Pachot
Usually, the conclusion comes at the end. But here I will clearly show my goal: I wish I will never see the optimizer_index_cost_adj parameters again. Especially when going to 12c where Adaptive Join can be completely fooled because of it. Choosing between index access and full table scan is a key point when optimizing a query, and historically the CBO came with several ways to influence that choice. But on some system, the workarounds have accumulated one on top of the other – biasing completely the CBO estimations. And we see nested loops on huge number of rows because of those false estimations.
Oracle Parallel Distribution and 12c Adaptive PlansFranck Pachot
Parallel Distribution and 12c Adaptive Plans
In the previous newsletter we have seen how 12c can defer the choice of the join method to the first execution. We considered only serial execution plans. But besides join method, the cardinality estimation is a key decision for parallel distribution when joining in parallel query. Ever seen a parallel query consuming huge tempfile space because a large table is broadcasted to lot of parallel processes? This is the point addressed by Adaptive Parallel Distribution.
Once again, that new feature is a good occasion to look at the different distribution methods.
The document discusses various data modification operations and how much redo is generated for each. It shows that inserts, deletes, updates, and DML on indexed tables can generate significant redo, while direct path inserts on NOLOGGING tables can minimize redo generation. The document also explains why some redo is always needed even for temporary changes, to support functions like media recovery and standby databases.
New Features
● Developer and SQL Features
● DBA and Administration
● Replication
● Performance
By Amit Kapila at India PostgreSQL UserGroup Meetup, Bangalore at InMobi.
http://technology.inmobi.com/events/india-postgresql-usergroup-meetup-bangalore
Oracle Join Methods and 12c Adaptive PlansFranck Pachot
Join Methods and 12c Adaptive Plans
In its quest to improve cardinality estimation, 12c has introduced Adaptive Execution Plans which deals with the cardinalities that are difficult to estimate before execution. Ever seen a hanging query because a nested loop join is running on millions of rows?
This is the point addressed by Adaptive Joins. But that new feature is also a good occasion to look at the four possible join methods available for years.
This document summarizes a talk on optimizer hints in databases. It begins with introducing the speaker and their background. It then covers the basics of query optimization in databases and how hints can provide additional information to the optimizer. Specifically, it discusses query hints to force a plan, statistics hints to provide join selectivity, and data hints about column dependencies. It notes that PostgreSQL does not support hints directly but similar control is possible through configuration parameters. It concludes by listing some drawbacks of hints.
The document summarizes how SQL Plan Directives in Oracle 12c can help address issues caused by cardinality misestimation in the optimizer. It provides an example where the optimizer underestimates the number of rows returned by a query on a table due to not having statistics on correlated columns. In 12c, a SQL Plan Directive is automatically generated after the first execution to capture this misestimation. On subsequent queries, the directive can be used to provide more accurate cardinality estimates through automatic reoptimization or dynamic sampling.
Star Transformation, 12c Adaptive Bitmap Pruning and In-Memory optionFranck Pachot
Besides adaptive joins and adaptive parallel distribution, 12c comes with Adaptive Bitmap Pruning. I’ll describe the case it applies to and which is often not well known: the Star Transformation
CBO choice between Index and Full Scan: the good, the bad and the ugly param...Franck Pachot
Usually, the conclusion comes at the end. But here I will clearly show my goal: I wish I will never see the optimizer_index_cost_adj parameters again. Especially when going to 12c where Adaptive Join can be completely fooled because of it. Choosing between index access and full table scan is a key point when optimizing a query, and historically the CBO came with several ways to influence that choice. But on some system, the workarounds have accumulated one on top of the other – biasing completely the CBO estimations. And we see nested loops on huge number of rows because of those false estimations.
Oracle Parallel Distribution and 12c Adaptive PlansFranck Pachot
Parallel Distribution and 12c Adaptive Plans
In the previous newsletter we have seen how 12c can defer the choice of the join method to the first execution. We considered only serial execution plans. But besides join method, the cardinality estimation is a key decision for parallel distribution when joining in parallel query. Ever seen a parallel query consuming huge tempfile space because a large table is broadcasted to lot of parallel processes? This is the point addressed by Adaptive Parallel Distribution.
Once again, that new feature is a good occasion to look at the different distribution methods.
The document discusses various data modification operations and how much redo is generated for each. It shows that inserts, deletes, updates, and DML on indexed tables can generate significant redo, while direct path inserts on NOLOGGING tables can minimize redo generation. The document also explains why some redo is always needed even for temporary changes, to support functions like media recovery and standby databases.
New Features
● Developer and SQL Features
● DBA and Administration
● Replication
● Performance
By Amit Kapila at India PostgreSQL UserGroup Meetup, Bangalore at InMobi.
http://technology.inmobi.com/events/india-postgresql-usergroup-meetup-bangalore
The Query Optimizer is the “brain” of your Postgres database. It interprets SQL queries and determines the fastest method of execution. Using the EXPLAIN command , this presentation shows how the optimizer interprets queries and determines optimal execution.
This presentation will give you a better understanding of how Postgres optimally executes their queries and what steps you can take to understand and perhaps improve its behavior in your environment.
To listen to the webinar recording, please visit EnterpriseDB.com > Resources > Ondemand Webcasts
If you have any questions please email sales@enterprisedb.com
Percona xtra db cluster(pxc) non blocking operations, what you need to know t...Marco Tusa
Performing simple DDL operations as ADD/DROP INDEX in a tightly connected cluster as PXC, can become a nightmare. Metalock will prevent Data modifications for long period of time and to bypass this, we need to become creative, like using Rolling schema upgrade or Percona online-schema-change. With NBO, we will be able to avoid such craziness at least for a simple operation like adding an index. In this brief talk I will illustrate what you should do to see the negative effect of NON using NBO, as well what you should do to use it correctly and what to expect out of it.
Accessing Data Through Hibernate; What DBAs Should Tell Developers and Vice V...Marco Tusa
Hibernate is an object-relational mapping tool that allows developers to access and manage relational database data using object-oriented programming. It maps database tables to Java classes and rows to objects. The document discusses how Hibernate works, how it interacts with databases like MySQL, and what developers and DBAs need to know to effectively use Hibernate for data access and management while optimizing database performance.
In this presentation I am illustrating how and why InnodDB perform Merge and Split pages. I will also show what are the possible things to do to reduce the impact.
The document discusses PostgreSQL database administration. It describes how users connect and authenticate to the database using methods defined in pg_hba.conf. It lists some common PostgreSQL configuration files and authentication methods. It also covers the roles of database administration like defining users, roles, and managing access privileges.
[PGDay.Seoul 2020] PostgreSQL 13 New Featureshyeongchae lee
This document discusses the new features in PostgreSQL version 13, including:
- Improved performance from de-duplicating B-tree index entries and parallel vacuuming of indexes.
- Better query planning for queries using aggregates, partitions, and extended statistics.
- Changes required for migrating to version 13, including incompatibilities.
Oracle In-Memory option to improve analytic queries
In-Memory is now the trend for database editors. But they do not implement in-memory storage for the same reason nor the same architecture. Oracle In-Memory option directly addresses BI reporting. Oracle has always favored an hybrid approach where we can query the OLTP database (from Oracle 6 the reads do not block the writes) and the In-Memory approach follows the same philosophy: run efficient analytic queries on OLTP databases. Their first approach to columnar storage was bitmap indexes in 8i, which were very efficient to support ad-hoc queries but not compatible with an OLTP workload. Then came the Exadata SmartScan and Hybrid Columnar Compression that was still addressing datawarehouses load and reporting.
The 12c In-Memory option now gives columnar and in-memory efficiency directly on the OLTP database, without changing any design or code. A demo will show how this option can be used.
Whitepaper: Mining the AWR repository for Capacity Planning and VisualizationKristofferson A
This scenario discusses how to mine Oracle AWR reports to better understand database performance. AWR provides detailed performance data that can be queried and visualized to analyze workload characteristics, notice trends, and identify bottlenecks over time. The document shows an example SQL query that retrieves disk I/O metrics from the AWR for specified time intervals. It emphasizes correlating performance across the database, operating system, and application to fully diagnose overall response time issues. Mining AWR effectively in a time-series format with relevant statistics side-by-side allows for quick trend analysis, statistical modeling, and faster performance optimization.
This document discusses various techniques for manipulating arrays in PHP, including:
- Adding and removing elements from the beginning, end, and within arrays
- Declaring and initializing associative arrays
- Removing duplicate elements from arrays
- Iterating through arrays
- Finding and extracting specific elements and values from arrays
The document provides code examples to demonstrate each technique and discusses PHP functions like array_shift(), array_unshift(), array_pop(), array_push(), array_splice(), unset(), array_unique(), and in_array().
Testing Delphix: easy data virtualizationFranck Pachot
The document summarizes the author's testing of the Delphix data virtualization software. Some key points:
- Delphix allows users to easily provision virtual copies of database sources on demand for tasks like testing, development, and disaster recovery.
- It works by maintaining incremental snapshots of source databases and virtualizing the data access. Copies can be provisioned in minutes and rewound to past points in time.
- The author demonstrated provisioning a copy of an Oracle database using Delphix and found the process very simple. Delphix integrates deeply with databases.
- Use cases include giving databases to each tester/developer, enabling continuous integration testing, creating QA environments with real
This document discusses direct-path inserts in Oracle and how they affect performance, table size, and redo generation. Direct-path inserts append new rows to the end of the table without using the buffer cache, which can increase table size even with free space. Redo generation depends on the combination of table logging mode, insert mode (append or not), and archivelog mode. The document provides examples demonstrating the different combinations.
The document provides information on advanced functions in RHive including UDF, UDAF, and UDTF. It explains:
1) RHive allows writing UDF and UDAF functions in R and deploying them for use in Hive, allowing R functions to be used at lower complexity levels in Hive's MapReduce programming.
2) The rhive.assign, rhive.export, and rhive.exportAll functions are used to deploy R functions and objects to the distributed Hive environment for processing large datasets.
3) An example demonstrates creating a sum function in R, assigning it using rhive.assign, and executing it on the USArrests Hive table using rhive
Covered Database Maintenance & Performance and Concurrency :
1. PostgreSQL Tuning and Performance
2. Find and Tune Slow Running Queries
3. Collecting regular statistics from pg_stat* views
4. Finding out what makes SQL slow
5. Speeding up queries without rewriting them
6. Discovering why a query is not using an index
7. Forcing a query to use an index
8. EXPLAIN and SQL Execution
9. Workload Analysis
This document provides guidelines for splitting a large Perforce depot into multiple instances to address issues from growing file and metadata size. It describes using the Perforce tool perfsplit to extract a section of the depot, while ensuring zero downtime and preserving integration history. The process prepares the original instance, uses perfsplit to build the new instance foundation, then converts metadata and verifies the new instance before cleanup and completion.
[Pgday.Seoul 2019] Citus를 이용한 분산 데이터베이스PgDay.Seoul
This document summarizes how to set up and use Citus, an open-source PostgreSQL-based distributed database. It explains how to install Citus, add worker nodes, create distributed tables, and use features like reference tables to perform distributed queries across the cluster.
How to teach an elephant to rock'n'rollPGConf APAC
The document discusses techniques for optimizing PostgreSQL queries, including:
1. Using index only scans to efficiently skip large offsets in queries instead of scanning all rows.
2. Pulling the LIMIT clause under joins and aggregates to avoid processing unnecessary rows.
3. Employing indexes creatively to perform DISTINCT operations by scanning the index instead of the entire table.
4. Optimizing DISTINCT ON queries by looping through authors and returning the latest row for each instead of a full sort.
The document describes the RMAN database cloning process. It involves creating new locations for the target (auxiliary) datafiles and logs, editing initialization files, and running RMAN commands to duplicate the source database and rename the datafiles and logs. The key steps are preparing the target system, generating rename commands, editing duplicate commands in a script, and running the script in RMAN to clone the database. When cloning to a different server, additional configuration of backup software is required to transfer files between servers.
Micky Trinidad has over 20 years of experience in strategic operations management. He has a proven track record of optimizing supply chains, implementing continuous improvement initiatives, and driving profit growth. Some of his accomplishments include increasing production volume by 325% through restructuring operations at Rainin Instrument Company, and saving $1.13 billion by streamlining raw material extraction processes at Cementos Lima. Currently he is a Senior Strategic Manager at Logitech Corporation, where he has led projects that reduced inventory levels while maintaining service levels and generated supply planning strategies.
The Query Optimizer is the “brain” of your Postgres database. It interprets SQL queries and determines the fastest method of execution. Using the EXPLAIN command , this presentation shows how the optimizer interprets queries and determines optimal execution.
This presentation will give you a better understanding of how Postgres optimally executes their queries and what steps you can take to understand and perhaps improve its behavior in your environment.
To listen to the webinar recording, please visit EnterpriseDB.com > Resources > Ondemand Webcasts
If you have any questions please email sales@enterprisedb.com
Percona xtra db cluster(pxc) non blocking operations, what you need to know t...Marco Tusa
Performing simple DDL operations as ADD/DROP INDEX in a tightly connected cluster as PXC, can become a nightmare. Metalock will prevent Data modifications for long period of time and to bypass this, we need to become creative, like using Rolling schema upgrade or Percona online-schema-change. With NBO, we will be able to avoid such craziness at least for a simple operation like adding an index. In this brief talk I will illustrate what you should do to see the negative effect of NON using NBO, as well what you should do to use it correctly and what to expect out of it.
Accessing Data Through Hibernate; What DBAs Should Tell Developers and Vice V...Marco Tusa
Hibernate is an object-relational mapping tool that allows developers to access and manage relational database data using object-oriented programming. It maps database tables to Java classes and rows to objects. The document discusses how Hibernate works, how it interacts with databases like MySQL, and what developers and DBAs need to know to effectively use Hibernate for data access and management while optimizing database performance.
In this presentation I am illustrating how and why InnodDB perform Merge and Split pages. I will also show what are the possible things to do to reduce the impact.
The document discusses PostgreSQL database administration. It describes how users connect and authenticate to the database using methods defined in pg_hba.conf. It lists some common PostgreSQL configuration files and authentication methods. It also covers the roles of database administration like defining users, roles, and managing access privileges.
[PGDay.Seoul 2020] PostgreSQL 13 New Featureshyeongchae lee
This document discusses the new features in PostgreSQL version 13, including:
- Improved performance from de-duplicating B-tree index entries and parallel vacuuming of indexes.
- Better query planning for queries using aggregates, partitions, and extended statistics.
- Changes required for migrating to version 13, including incompatibilities.
Oracle In-Memory option to improve analytic queries
In-Memory is now the trend for database editors. But they do not implement in-memory storage for the same reason nor the same architecture. Oracle In-Memory option directly addresses BI reporting. Oracle has always favored an hybrid approach where we can query the OLTP database (from Oracle 6 the reads do not block the writes) and the In-Memory approach follows the same philosophy: run efficient analytic queries on OLTP databases. Their first approach to columnar storage was bitmap indexes in 8i, which were very efficient to support ad-hoc queries but not compatible with an OLTP workload. Then came the Exadata SmartScan and Hybrid Columnar Compression that was still addressing datawarehouses load and reporting.
The 12c In-Memory option now gives columnar and in-memory efficiency directly on the OLTP database, without changing any design or code. A demo will show how this option can be used.
Whitepaper: Mining the AWR repository for Capacity Planning and VisualizationKristofferson A
This scenario discusses how to mine Oracle AWR reports to better understand database performance. AWR provides detailed performance data that can be queried and visualized to analyze workload characteristics, notice trends, and identify bottlenecks over time. The document shows an example SQL query that retrieves disk I/O metrics from the AWR for specified time intervals. It emphasizes correlating performance across the database, operating system, and application to fully diagnose overall response time issues. Mining AWR effectively in a time-series format with relevant statistics side-by-side allows for quick trend analysis, statistical modeling, and faster performance optimization.
This document discusses various techniques for manipulating arrays in PHP, including:
- Adding and removing elements from the beginning, end, and within arrays
- Declaring and initializing associative arrays
- Removing duplicate elements from arrays
- Iterating through arrays
- Finding and extracting specific elements and values from arrays
The document provides code examples to demonstrate each technique and discusses PHP functions like array_shift(), array_unshift(), array_pop(), array_push(), array_splice(), unset(), array_unique(), and in_array().
Testing Delphix: easy data virtualizationFranck Pachot
The document summarizes the author's testing of the Delphix data virtualization software. Some key points:
- Delphix allows users to easily provision virtual copies of database sources on demand for tasks like testing, development, and disaster recovery.
- It works by maintaining incremental snapshots of source databases and virtualizing the data access. Copies can be provisioned in minutes and rewound to past points in time.
- The author demonstrated provisioning a copy of an Oracle database using Delphix and found the process very simple. Delphix integrates deeply with databases.
- Use cases include giving databases to each tester/developer, enabling continuous integration testing, creating QA environments with real
This document discusses direct-path inserts in Oracle and how they affect performance, table size, and redo generation. Direct-path inserts append new rows to the end of the table without using the buffer cache, which can increase table size even with free space. Redo generation depends on the combination of table logging mode, insert mode (append or not), and archivelog mode. The document provides examples demonstrating the different combinations.
The document provides information on advanced functions in RHive including UDF, UDAF, and UDTF. It explains:
1) RHive allows writing UDF and UDAF functions in R and deploying them for use in Hive, allowing R functions to be used at lower complexity levels in Hive's MapReduce programming.
2) The rhive.assign, rhive.export, and rhive.exportAll functions are used to deploy R functions and objects to the distributed Hive environment for processing large datasets.
3) An example demonstrates creating a sum function in R, assigning it using rhive.assign, and executing it on the USArrests Hive table using rhive
Covered Database Maintenance & Performance and Concurrency :
1. PostgreSQL Tuning and Performance
2. Find and Tune Slow Running Queries
3. Collecting regular statistics from pg_stat* views
4. Finding out what makes SQL slow
5. Speeding up queries without rewriting them
6. Discovering why a query is not using an index
7. Forcing a query to use an index
8. EXPLAIN and SQL Execution
9. Workload Analysis
This document provides guidelines for splitting a large Perforce depot into multiple instances to address issues from growing file and metadata size. It describes using the Perforce tool perfsplit to extract a section of the depot, while ensuring zero downtime and preserving integration history. The process prepares the original instance, uses perfsplit to build the new instance foundation, then converts metadata and verifies the new instance before cleanup and completion.
[Pgday.Seoul 2019] Citus를 이용한 분산 데이터베이스PgDay.Seoul
This document summarizes how to set up and use Citus, an open-source PostgreSQL-based distributed database. It explains how to install Citus, add worker nodes, create distributed tables, and use features like reference tables to perform distributed queries across the cluster.
How to teach an elephant to rock'n'rollPGConf APAC
The document discusses techniques for optimizing PostgreSQL queries, including:
1. Using index only scans to efficiently skip large offsets in queries instead of scanning all rows.
2. Pulling the LIMIT clause under joins and aggregates to avoid processing unnecessary rows.
3. Employing indexes creatively to perform DISTINCT operations by scanning the index instead of the entire table.
4. Optimizing DISTINCT ON queries by looping through authors and returning the latest row for each instead of a full sort.
The document describes the RMAN database cloning process. It involves creating new locations for the target (auxiliary) datafiles and logs, editing initialization files, and running RMAN commands to duplicate the source database and rename the datafiles and logs. The key steps are preparing the target system, generating rename commands, editing duplicate commands in a script, and running the script in RMAN to clone the database. When cloning to a different server, additional configuration of backup software is required to transfer files between servers.
Micky Trinidad has over 20 years of experience in strategic operations management. He has a proven track record of optimizing supply chains, implementing continuous improvement initiatives, and driving profit growth. Some of his accomplishments include increasing production volume by 325% through restructuring operations at Rainin Instrument Company, and saving $1.13 billion by streamlining raw material extraction processes at Cementos Lima. Currently he is a Senior Strategic Manager at Logitech Corporation, where he has led projects that reduced inventory levels while maintaining service levels and generated supply planning strategies.
Rimon has been operating in Israel since 1996 and has established subsidiaries and projects worldwide focused on water infrastructure. The company's expertise includes water treatment, marginal water reuse systems, and gas/energy infrastructure. Rimon has experience initiating, designing, constructing, operating and financing various water infrastructure projects such as treatment plants, pipelines, and storage reservoirs. It currently operates over 10 wastewater reclamation projects and provides solutions for municipal, agricultural, and industrial water needs.
Syed Muhammd Usman Jafri is seeking an office manager position. He has over 9 years of experience in office administration and human resources roles. His experience includes managing teams of up to 10 staff members, maintaining filing systems, preparing reports and presentations, and assisting with financial accounting. He is skilled in Microsoft Office, communication, problem solving, and adapting to new responsibilities.
- The document describes MechoNet, a network system for controlling motorized window shades, and its integration with Savant home automation systems.
- MechoNet allows up to 254 shades to be controlled over a single network and assigns each shade a unique address and ID number for individual or group control.
- Installation is simple - the network splitter is secured in place and the motor cables are trained back without interfering with shade operation.
- MechoNet offers benefits like 8 control addresses per motor, switching ports for wall control, motor programming tools, and a 5-year warranty on controls and motors.
Gitesh Batra is a talent acquisition analyst with 3 years of experience in project execution, client relationship management, and documentation. He is currently working at Catalyst Business Solutions Private Limited, where his responsibilities include handling global recruitments, vendor management, salary negotiations, and MIS management. Previously, he worked as a data analyst at Healthcare Energy Foods Private Limited, where he integrated and analyzed healthcare data from various sources. He has strong communication, problem solving, and technical skills.
Ross & Co, an estate agency, held a fundraising event where staff cycled 93 miles and raised £508 for Sports Relief by selling homemade cakes; the managing director was impressed by the large staff and community turnout in support of the charity fundraiser. The article also provides details on various residential and commercial properties available for sale, rent, and management through Ross & Co.
Outreach Services helps community hospitals navigate medical assistance programs to cover costs for uninsured and underinsured patients. They have helped over 25,000 patients get over $560 million in hospital bills covered so far in 2007. Providing outreach and advocacy for these patients benefits both the hospitals and communities. It benefits hospitals by recovering costs that would otherwise go unpaid, improving their fiscal health. It benefits communities by providing coverage for families and other providers, as well as maintaining access to care from hospitals that might otherwise close due to high uncompensated costs. Outreach Services recommends hospitals formally define and track their outreach programs to quantify the positive outcomes and demonstrate the community-wide benefits.
El documento describe 10 políticas implementadas por los gobiernos de Hugo Chávez y Nicolás Maduro en Venezuela que condujeron al colapso económico del país. Estas incluyen la depredación de los recursos petroleros, la creación de un gran estado empresario, la imposición de controles generalizados de precios y tipo de cambio, el proteccionismo, altos impuestos distorsionantes, la pérdida de autonomía del banco central, y el gasto público y emisión monetaria insostenibles. Debido a estas políticas, la econom
This document provides an overview of Organo Gold, a multi-level marketing company that sells coffee and other products containing ganoderma lucidum mushrooms. It discusses Organo Gold's products, leadership, compensation plan with various ways to earn income, and opportunities to build a home-based business by sharing Organo Gold's products and opportunity with others. The document encourages readers to join Organo Gold and provides tips and scripts for getting started as a distributor, including hosting coffee mixers and obtaining initial retail customers.
The document is a collection of 20 photos from Flickr shared under various Creative Commons licenses. The photos show a variety of subjects and were uploaded by different photographers. At the bottom is contact information for Gabriel Jiménez and a link to gabojimenez.com with a note that it is a link that can be clicked.
The document discusses the concept of Account Based Everything (ABE), which is described as a strategic growth marketing approach that orchestrates personalized marketing, sales, and customer success efforts to drive engagement and conversion at named accounts. It promotes coordinating account-based activities across functions like marketing, sales, and success using techniques such as lead-to-account matching, account-specific research, and synchronizing multi-channel interactions into coordinated "plays" to maximize impact. The presentation provides an overview of ABE and recommendations for its implementation, including defining target accounts, developing account plans, and measuring key metrics like awareness, reach, engagement, and sales impact.
Cascading is a Java framework that allows users to define data processing workflows on Hadoop clusters more easily. The author discusses connecting Cascading to the Starfish profiler and optimizer to enable automated optimization of Cascading workflows. Key points are:
1) Cascading workflows are translated to DAGs of Hadoop jobs for profiling and optimization.
2) The Cascading API is modified to use the Hadoop New API to interface with Starfish.
3) Experiments show the Starfish optimizer providing speedups of up to 1.3x for several real-world Cascading workflows.
1. The document discusses concepts related to managing big data using Hadoop including data formats, analyzing data with MapReduce, scaling out, data flow, Hadoop streaming, and Hadoop pipes.
2. Hadoop allows for distributed processing of large datasets across clusters of computers using a simple programming model. It scales out to large clusters of commodity hardware and manages data processing and storage automatically.
3. Hadoop streaming and Hadoop pipes provide interfaces for running MapReduce jobs using any programming language, such as Python or C++, instead of just Java. This allows developers to use the language of their choice.
The document proposes a system called Twiche that uses caching to improve the efficiency of incremental MapReduce jobs. Twiche indexes cached items from the map phase by their original input and applied operations. This allows it to identify duplicate computations and avoid reprocessing the same data. The experimental results show that Twiche can eliminate all duplicate tasks in incremental MapReduce jobs, reducing execution time and CPU utilization compared to traditional MapReduce.
The document provides interview questions and answers related to Hadoop. It discusses common InputFormats in Hadoop like TextInputFormat, KeyValueInputFormat, and SequenceFileInputFormat. It also describes concepts like InputSplit, RecordReader, partitioner, combiner, job tracker, task tracker, jobs and tasks relationship, debugging Hadoop code, and handling lopsided jobs. HDFS, its architecture, replication, and reading files from HDFS is also covered.
Enhancing Performance and Fault Tolerance of Hadoop ClusterIRJET Journal
This document discusses enhancing performance and fault tolerance in Hadoop clusters. It proposes a method to identify faulty nodes in the cluster that are decreasing job execution efficiency. The method monitors node performance and categorizes nodes as active or blacklisted based on the number of task failures. If a node is frequently blacklisted, it is considered faulty. The experiment shows that removing faulty nodes identified by this method improves overall cluster efficiency by reducing job execution time.
International Journal of Computational Engineering Research(IJCER) is an intentional online Journal in English monthly publishing journal. This Journal publish original research work that contributes significantly to further the scientific knowledge in engineering and Technology.
This document provides an overview of Apache Hadoop, an open-source software framework for distributed storage and processing of large datasets across clusters of commodity hardware. It describes Hadoop's core components like HDFS for distributed file storage and MapReduce for distributed processing. Key aspects covered include HDFS architecture, data flow and fault tolerance, as well as MapReduce programming model and architecture. Examples of Hadoop usage and a potential project plan for load balancing enhancements are also briefly mentioned.
Hadoop is an open-source framework that allows distributed processing of large datasets across clusters of computers. It has two major components - the MapReduce programming model for processing large amounts of data in parallel, and the Hadoop Distributed File System (HDFS) for storing data across clusters of machines. Hadoop can scale from single servers to thousands of machines, with HDFS providing fault-tolerant storage and MapReduce enabling distributed computation and processing of data in parallel.
Big data unit iv and v lecture notes qb model examIndhujeni
The document contains questions that could appear on a model exam for MapReduce concepts. It includes questions about the limitations of classic MapReduce, comparing classic MapReduce to YARN, Zookeeper, differences between HBase and HDFS, differences between RDBMS and Cassandra, Pig Latin, the use of Grunt, advantages of Hive, comparing relational databases to HBase, when to use HBase, anatomy of a classic MapReduce job run, YARN architecture, job scheduling, handling failures in classic MapReduce and YARN, HBase and Cassandra architectures and data models, HBase and Cassandra clients, Hive data types and file formats, Pig Latin scripts and Grunt shell, and HiveQL data definition.
Hadoop is an open source framework for distributed storage and processing of vast amounts of data across clusters of computers. It uses a master-slave architecture with a single JobTracker master and multiple TaskTracker slaves. The JobTracker schedules tasks like map and reduce jobs on TaskTrackers, which each run task instances in separate JVMs. It monitors task progress and reschedules failed tasks. Hadoop uses MapReduce programming model where the input is split and mapped in parallel, then outputs are shuffled, sorted, and reduced to form the final results.
This webinar recording will explain how to get started with Amazon Elastic MapReduce (EMR). EMR enables fast processing of large structured or unstructured datasets, and in this webinar we'll demonstrated how to setup an EMR job flow to analyse application logs, and perform Hive queries against it. We'll review best practices around data file organisation on Amazon Simple Storage Service (S3), how clusters can be started from the AWS web console and command line, and how to monitor the status of a Map/Reduce job. The security configuration that allows direct access to the Amazon EMR cluster in interactive mode will be shown, and we'll see how Hive provides a SQL like environment, while allowing you to dynamically grow and shrink the amount of compute used for powerful data processing activities.
Amazon EMR YouTube Recording: http://youtu.be/gSPh6VTBEbY
Apache Hive is a data warehouse system that allows users to write SQL-like queries to analyze large datasets stored in Hadoop. It converts these queries into MapReduce jobs that process the data in parallel across the Hadoop cluster. Hive provides metadata storage, SQL support, and data summarization to make analyzing large datasets easier for analysts familiar with SQL.
This document provides an overview of creating web maps using open source tools, including:
1) It discusses using databases (PostgreSQL), map servers (MapServer), and web servers to store, serve, and display spatial data on the web.
2) It provides instructions on loading spatial data into a PostgreSQL database using Quantum GIS, then querying and styling the data for display in a web map using MapServer.
3) It describes basic map file syntax for MapServer and adding layers from shapefiles, databases, and other sources for rendering in a web map.
This report describes how the Aucfanlab team used Azure’s Data Factory service to
implement the orchestration and monitoring of all data pipelines for our “Aucfan
Datalake” project.
Hot-Spot analysis Using Apache Spark frameworkSupriya .
This document describes using Apache Spark and GeoSpark to process large-scale spatial and spatial-temporal data. It discusses loading spatial data into Resilient Distributed Datasets (RDDs) using GeoSpark APIs and performing operations like spatial range queries, k-nearest neighbor queries, and spatial joins on the data. It also describes implementing hot spot analysis to identify statistically significant hot spots in the spatial data using spatial statistics in Apache Spark. The document outlines the system design, including using Hadoop and Spark on a cluster, and describes experiments run on spatial data to analyze query efficiency and performance at scale.
PostgreSQL High-Performance Cheat Sheets contains quick methods to find performance issues.
Summary of the course so that when problems arise, you are able to easily uncover what are the performance bottlenecks.
This is the Day-4 lab exercise for CGI group webinar series. It primarily includes demonstrations on Hive, Analytics and other tools on the Cloudera Hadoop Platform.
The document provides an overview of Amazon Elastic MapReduce (EMR) and how it can be used to process large amounts of data using Hadoop and other big data technologies in the AWS cloud. Some key points:
- EMR allows users to run Hadoop frameworks and analytics tools like Hive and Pig on AWS using a web service API or command line tools.
- It provides a managed Hadoop cluster and integrates with other AWS services for storage, networking, etc. allowing big data workloads to easily scale up and down based on need.
- Users can launch EMR job flows to run data processing jobs, specifying options like instance types, numbers of nodes, bootstrap actions, and steps to execute across
This document provides an overview of using R and high performance computers (HPC). It discusses why HPC is useful when data becomes too large for a local machine, and strategies like moving to more powerful hardware, using parallel packages, or rewriting code. It also covers topics like accessing HPC resources through batch jobs, setting up the R environment, profiling code, and using packages like purrr and foreach to parallelize workflows. The overall message is that HPC can scale up R analyses, but developers must adapt their code for parallel and distributed processing.
Similar to twp-integrating-hadoop-data-with-or-130063 (20)
3. 2
The simplest way to access external files or external data on a file system from within an Oracle
database is through an external table. See here for an introduction to External tables.
External tables present data stored in a file system in a table format and can be used in SQL
queries transparently. External tables could thus potentially be used to access data stored in
HDFS (the Hadoop File System) from inside the Oracle database. Unfortunately HDFS files are
not directly accessible through the normal operating system calls that the external table driver
relies on. The FUSE (File system in Userspace) project provides a workaround in this case. There
are a number of FUSE drivers that allow users to mount a HDFS store and treat it like a normal
file system. By using one of these drivers and mounting HDFS on the database instance (on
every instance if this was a RAC database), HDFS files can be easily accessed using the External
Table infrastructure.
Figure 1. Accessing via External Tables with in-database MapReduce
In Figure 1we are utilizing Oracle Database 11g to implement in-database mapreduce as
described in this article. In general, the parallel execution framework in Oracle Database 11g is
sufficient to run most of the desired operations in parallel directly from the external table.
The external table approach may not be suitable in some cases (say if FUSE is unavailable).
Oracle Table Functions provide an alternate way to fetch data from Hadoop. Our attached
example outlines one way of doing this. At a high level we implement a table function that uses
4. 3
the DBMS_SCHEDULER framework to asynchronously launch an external shell script that
submits a Hadoop Map-Reduce job. The table function and the mapper communicate using
Oracle's Advanced Queuing feature. The Hadoop mapper en-queue's data into a common queue
while the table function de-queues data from it. Since this table function can be run in parallel
additional logic is used to ensure that only one of the slaves submits the External Job.
Figure 2. Leveraging Table Functions for parallel processing
The queue gives us load balancing since the table function could run in parallel while the Hadoop
streaming job will also run in parallel with a different degree of parallelism and outside the
control of Oracle's Query Coordinator.
5. 4
As an example we translated the architecture shown in Figure 2 in a real example. Note that our
example only shows a template implementation of using a Table Function to access data stored
in Hadoop. Other, possibly better, implementations are clearly possible.
The following diagrams are a technically more accurate and more detailed representation of the
original schematic in Figure 2 explaining where and how we use the pieces of actual code that
follow:
Figure 3. Starting the Mapper jobs and retrieving data
In step 1 we figure out who gets to be the query coordinator. For this we use a simple
mechanism that writes records with the same key value into a table. The first insert wins and will
function as the query coordinator (QC) for the process. Note that the QC table function
invocation does play a processing role as well.
In step 2 this table function invocation (QC) starts an asynchronous job using dbms_scheduler –
the Job Controller in Figure 3 – that than runs the synchronous bash script on the Hadoop
cluster. This bash script, called the launcher in Figure 3 starts the mapper processes (step 3) on
the Hadoop cluster.
6. 5
The mapper processes process data and write to a queue in step 5. In the current example we
chose a queue as it is available cluster wide. For now we simply chose to write any output directly
into the queue. You can achieve better performance by either batching up the outputs and then
moving them into the queue. Obviously you can choose various other delivery mechanisms,
including pipes and relational tables.
Step 6 is then the de-queuing process which is done by parallel invocations of the table function
running in the database. As these parallel invocations process data it gets served up to the query
that is requesting the data. The table function leverages both the Oracle data and the data from
the queue and thereby integrates data from both sources in a single result set for the end user(s).
Figure 4. Monitoring the process
After the Hadoop side processes (the mappers) are kicked off, the job monitor process keeps an
eye on the launcher script. Once the mappers have finished processing data on the Hadoop
cluster, the bash script finishes as is shown in Figure 4.
The job monitor monitors a database scheduler queue and will notice that the shell script has
finished (step 7). The job monitor checks the data queue for remaining data elements, step 8. As
long as data is present in the queue the table function invocations keep on processing that data
(step 6).
7. 6
Figure 5. Closing down the processing
When the queue is fully de-queued by the parallel invocations of the table function, the job
monitor terminates the queue (step 9 in Figure 5) ensuring the table function invocations in
Oracle stop. At that point all the data has been delivered to the requesting query.
8. 7
The solution implemented in Figure 3 through Figure 5 uses the following code. All code is
tested on Oracle Database 11g and a 5 node Hadoop cluster. As with most whitepaper text, copy
these scripts into a text editor and ensure the formatting is correct.
This script contains certain set up components. For example the initial part of the script shows
the creation of the arbitrator table in step 1 of Figure 3. In the example we use the always
popular OE schema.
connect oe/oe
-- Table to use as locking mechanisim for the hdfs reader as
-- leveraged in Figure 3 step 1
DROP TABLE run_hdfs_read;
CREATE TABLE run_hdfs_read(pk_id number, status varchar2(100),
primary key (pk_id));
-- Object type used for AQ that receives the data
CREATE OR REPLACE TYPE hadoop_row_obj AS OBJECT (
a number,
b number);
/
connect / as sysdba
-- system job to launch external script
-- this job is used to eventually run the bash script
-- described in Figure 3 step 3
CREATE OR REPLACE PROCEDURE launch_hadoop_job_async(in_directory
IN VARCHAR2, id number) IS
cnt number;
BEGIN
begin DBMS_SCHEDULER.DROP_JOB ('ExtScript'||id, TRUE);
9. 8
exception when others then null; end;
-- Run a script
DBMS_SCHEDULER.CREATE_JOB (
job_name => 'ExtScript' || id,
job_type => 'EXECUTABLE',
job_action => '/bin/bash',
number_of_arguments => 1
);
DBMS_SCHEDULER.SET_JOB_ARGUMENT_VALUE ('ExtScript' || id, 1,
in_directory);
DBMS_SCHEDULER.ENABLE('ExtScript' || id);
-- Wait till the job is done. This ensures the hadoop job is
completed
loop
select count(*) into cnt
from DBA_SCHEDULER_JOBS where job_name = 'EXTSCRIPT'||id;
dbms_output.put_line('Scheduler Count is '||cnt);
if (cnt = 0) then
exit;
else
dbms_lock.sleep(5);
end if;
end loop;
-- Wait till the queue is empty and then drop it
-- as shown in Figure 5
-- The TF will get an exception and it will finish quietly
loop
10. 9
select sum(c) into cnt
from
(
select enqueued_msgs - dequeued_msgs c
from gv$persistent_queues where queue_name =
'HADOOP_MR_QUEUE'
union all
select num_msgs+spill_msgs c
from gv$buffered_queues where queue_name =
'HADOOP_MR_QUEUE'
union all
select 0 c from dual
);
if (cnt = 0) then
-- Queue is done. stop it.
DBMS_AQADM.STOP_QUEUE ('HADOOP_MR_QUEUE');
DBMS_AQADM.DROP_QUEUE ('HADOOP_MR_QUEUE');
return;
else
-- Wait for a while
dbms_lock.sleep(5);
end if;
end loop;
END;
/
-- Grants needed to make hadoop reader package work
grant execute on launch_hadoop_job_async to oe;
11. 10
grant select on v_$session to oe;
grant select on v_$instance to oe;
grant select on v_$px_process to oe;
grant execute on dbms_aqadm to oe;
grant execute on dbms_aq to oe;
connect oe/oe
-- Simple reader package to read a file containing two numbers
CREATE OR REPLACE PACKAGE hdfs_reader IS
-- Return type of pl/sql table function
TYPE return_rows_t IS TABLE OF hadoop_row_obj;
-- Checks if current invocation is serial
FUNCTION is_serial RETURN BOOLEAN;
-- Function to actually launch a Hadoop job
FUNCTION launch_hadoop_job(in_directory IN VARCHAR2, id in out
number) RETURN BOOLEAN;
-- Tf to read from Hadoop
-- This is the main processing code reading from the queue in
-- Figure 3 step 6. It also contains the code to insert into
-- the table in Figure 3 step 1
FUNCTION read_from_hdfs_file(pcur IN SYS_REFCURSOR, in_directory
IN VARCHAR2) RETURN return_rows_t
PIPELINED PARALLEL_ENABLE(PARTITION pcur BY ANY);
END;
/
12. 11
CREATE OR REPLACE PACKAGE BODY hdfs_reader IS
-- Checks if current process is a px_process
FUNCTION is_serial RETURN BOOLEAN IS
c NUMBER;
BEGIN
SELECT COUNT (*) into c FROM v$px_process WHERE sid =
SYS_CONTEXT('USERENV','SESSIONID');
IF c <> 0 THEN
RETURN false;
ELSE
RETURN true;
END IF;
exception when others then
RAISE;
END;
FUNCTION launch_hadoop_job(in_directory IN VARCHAR2, id IN OUT
NUMBER) RETURN BOOLEAN IS
PRAGMA AUTONOMOUS_TRANSACTION;
instance_id NUMBER;
jname varchar2(4000);
BEGIN
if is_serial then
-- Get id by mixing instance # and session id
id := SYS_CONTEXT('USERENV', 'SESSIONID');
SELECT instance_number INTO instance_id FROM v$instance;
id := instance_id * 100000 + id;
else
-- Get id of the QC
13. 12
SELECT ownerid into id from v$session where sid =
SYS_CONTEXT('USERENV', 'SESSIONID');
end if;
-- Create a row to 'lock' it so only one person does the job
-- schedule. Everyone else will get an exception
-- This is in Figure 3 step 1
INSERT INTO run_hdfs_read VALUES(id, 'RUNNING');
jname := 'Launch_hadoop_job_async';
-- Launch a job to start the hadoop job
DBMS_SCHEDULER.CREATE_JOB (
job_name => jname,
job_type => 'STORED_PROCEDURE',
job_action => 'sys.launch_hadoop_job_async',
number_of_arguments => 2
);
DBMS_SCHEDULER.SET_JOB_ARGUMENT_VALUE (jname, 1, in_directory);
DBMS_SCHEDULER.SET_JOB_ARGUMENT_VALUE (jname, 2, CAST (id AS
VARCHAR2));
DBMS_SCHEDULER.ENABLE('Launch_hadoop_job_async');
COMMIT;
RETURN true;
EXCEPTION
-- one of my siblings launched the job. Get out quitely
WHEN dup_val_on_index THEN
dbms_output.put_line('dup value exception');
RETURN false;
WHEN OTHERs THEN
RAISE;
END;
14. 13
FUNCTION read_from_hdfs_file(pcur IN SYS_REFCURSOR, in_directory
IN VARCHAR2) RETURN return_rows_t
PIPELINED PARALLEL_ENABLE(PARTITION pcur BY ANY)
IS
PRAGMA AUTONOMOUS_TRANSACTION;
cleanup BOOLEAN;
payload hadoop_row_obj;
id NUMBER;
dopt dbms_aq.dequeue_options_t;
mprop dbms_aq.message_properties_t;
msgid raw(100);
BEGIN
-- Launch a job to kick off the hadoop job
cleanup := launch_hadoop_job(in_directory, id);
dopt.visibility := DBMS_AQ.IMMEDIATE;
dopt.delivery_mode := DBMS_AQ.BUFFERED;
loop
payload := NULL;
-- Get next row
DBMS_AQ.DEQUEUE('HADOOP_MR_QUEUE', dopt, mprop, payload,
msgid);
commit;
pipe row(payload);
end loop;
exception when others then
if cleanup then
delete run_hdfs_read where pk_id = id;
commit;
end if;
15. 14
END;
END;
/
This short script is the outside-of-the-database controller as shown in Figure 3 steps 3 and 4.
This is a synchronous step remaining on the system for as long as the Hadoop mappers are
running.
#!/bin/bash
cd –HADOOP_HOME-
A="/net/scratch/java/jdk1.6.0_16/bin/java -classpath
/home/hadoop:/home/hadoop/ojdbc6.jar StreamingEq"
bin/hadoop fs -rmr output
bin/hadoop jar ./contrib/streaming/hadoop-0.20.0-streaming.jar -
input input/nolist.txt -output output -mapper "$A" -jobconf
mapred.reduce.tasks=0
For this example we wrote a small and simple mapper process to be executed on our Hadoop
cluster. Undoubtedly more comprehensive mappers exist. This one converts a string into two
numbers and provides those to the queue in a row-by-row fashion.
// Simplified mapper example for Hadoop cluster
import java.sql.*;
//import oracle.jdbc.*;
//import oracle.sql.*;
import oracle.jdbc.pool.*;
//import java.util.Arrays;
//import oracle.sql.ARRAY;
//import oracle.sql.ArrayDescriptor;
16. 15
public class StreamingEq {
public static void main (String args[]) throws Exception
{
// connect through driver
OracleDataSource ods = new OracleDataSource();
Connection conn = null;
Statement stmt = null;
ResultSet rset = null;
java.io.BufferedReader stdin;
String line;
int countArr = 0, i;
oracle.sql.ARRAY sqlArray;
oracle.sql.ArrayDescriptor arrDesc;
try {
ods.setURL("jdbc:oracle:thin:oe/oe@$ORACLE_INSTANCE
");
conn = ods.getConnection();
}
catch (Exception e){
System.out.println("Got exception " + e);
if (rset != null) rset.close();
if (conn != null) conn.close();
System.exit(0);
return ;
}
System.out.println("connection works conn " + conn);
stdin = new java.io.BufferedReader(new
java.io.InputStreamReader(System.in));
18. 17
}
An example of running a select on the system as described above – leveraging the table function
processing – would be:
-- Set up phase for the data queue
execute DBMS_AQADM.CREATE_QUEUE_TABLE('HADOOP_MR_QUEUE',
'HADOOP_ROW_OBJ');
execute DBMS_AQADM.CREATE_QUEUE('HADOOP_MR_QUEUE',
'HADOOP_MR_QUEUE');
execute DBMS_AQADM.START_QUEUE ('HADOOP_MR_QUEUE');
-- Query being executed by an end user or BI tool
-- Note the hint is not a real hint, but a comment
-- to clarify why we use the cursor
select max(a), avg(b)
from table(hdfs_reader.read_from_hdfs_file
(cursor
(select /*+ FAKE cursor to kick start parallelism */ *
from orders), '/home/hadoop/eq_test4.sh'));
19. 18
As the examples in this paper show integrating a Hadoop system with Oracle Database 11g is
easy to do.
The approaches discussed in this paper allow customers to stream data directly from Hadoop
into an Oracle query. This avoids fetching the data into a local file system as well as materializing
it into an Oracle table before accessing it in a SQL query.