The document describes creating and loading sample data into HAWQ internal tables. First, the retail_demo schema and tables are dropped and recreated. Then, sample data files stored in HDFS are copied into the corresponding HAWQ tables using the COPY command in psql. The load is verified by running a script to check the row counts in each table. This demonstrates how to define and populate tables within HAWQ using sample data for testing and analysis.
DATABASE AUTOMATION with Thousands of database, monitoring and backupSaewoong Lee
This is my presentation document at AnsibleFest 2018 in Austin, Texas.
This topic is ‘Database Automation with thousands of database, monitoring and backup’.
In this document I want to tell you database automation using Ansible.
So I expect to give more confidence to infra engineer like me.
Cassandra Troubleshooting (for 2.0 and earlier)J.B. Langston
I’ll give a general lay of the land for troubleshooting Cassandra. Then I’ll take you on a deep dive through nodetool and system.log and give you a guided tour of the useful information they provide for troubleshooting. I’ll devote special attention to monitoring the various processes that Cassandra uses to do its work and how to effectively search for information about specific error messages online.
This is the old version of this presentation for Cassandra 2.0 and earlier. Check out the updated slide deck for Cassandra 2.1.
DATABASE AUTOMATION with Thousands of database, monitoring and backupSaewoong Lee
This is my presentation document at AnsibleFest 2018 in Austin, Texas.
This topic is ‘Database Automation with thousands of database, monitoring and backup’.
In this document I want to tell you database automation using Ansible.
So I expect to give more confidence to infra engineer like me.
Cassandra Troubleshooting (for 2.0 and earlier)J.B. Langston
I’ll give a general lay of the land for troubleshooting Cassandra. Then I’ll take you on a deep dive through nodetool and system.log and give you a guided tour of the useful information they provide for troubleshooting. I’ll devote special attention to monitoring the various processes that Cassandra uses to do its work and how to effectively search for information about specific error messages online.
This is the old version of this presentation for Cassandra 2.0 and earlier. Check out the updated slide deck for Cassandra 2.1.
MySQL 5.7 innodb_enhance_partii_20160527Saewoong Lee
Release Date : 2016.05.27
Version : MySQL 5.7
Index :
- Part I : InnoDB Performance
- Part I : InnoDB Buffer Pool Flushing
- Part I : InnoDB internal Transaction General
- Part I : InnoDB Improved adaptive flushing
- Part II : InnoDB Online DDL
- Part II : Tablespace management
- Part II : InnoDB Bulk Load for Create Index
- Part II : InnoDB Temporary Tables
- Part II : InnoDB Full-Text CJK Support
- Part II : Support Syslog on Linux / Unix OS
- Part II : Performance_schema
- Part II : Useful tips
RAC+ASM: Lessons learned after 2 years in production
Managing over 70 databases for 4 major customers, I have some good stories to share. Running almost all possible combinations of ASM, RAC, NETAPP and NFS.
Success, failure and gotchas. This presentation is the equivalent of years of experience, condensed in major highlights in 45 minutes. To list a few stories:
Yeti-DNS is an international research project with the purpose of testing new technologies and procedures in running the Internet root zone. The project runs tests on DNSSEC key rollovers in the root, as well as experimenting with new ways to manage the DNSSEC keys (multiple zone signing keys).
An interview with Shane Kerr, a coordinator for the Yeti-DNS project, forms part of this webinar. The interview sheds light on the technical and political aspects of the project and introduces the latest results from experiments.
The webinar also includes a tutorial on how to use the Yeti-DNS root name servers to configure a BIND 9 DNS resolver in order to take part in the project.
Red Hat Enterprise Linux OpenStack Platform on Inktank Ceph EnterpriseRed_Hat_Storage
This session describes how to get the most out of OpenStack Cinder volumes on Ceph.
We’ll discuss:
Performance configuration, tuning, and workloads.
Performance test results of Red Hat Enterprise Linux OpenStack Platform 5, Red Hat Enterprise Linux OpenStack Platform 6, Red Hat Ceph Storage 1.2.3, and Firefly.
Anticipated improvements in performance for Red Hat Ceph Storage 1.3.
MySQL 5.7 innodb_enhance_partii_20160527Saewoong Lee
Release Date : 2016.05.27
Version : MySQL 5.7
Index :
- Part I : InnoDB Performance
- Part I : InnoDB Buffer Pool Flushing
- Part I : InnoDB internal Transaction General
- Part I : InnoDB Improved adaptive flushing
- Part II : InnoDB Online DDL
- Part II : Tablespace management
- Part II : InnoDB Bulk Load for Create Index
- Part II : InnoDB Temporary Tables
- Part II : InnoDB Full-Text CJK Support
- Part II : Support Syslog on Linux / Unix OS
- Part II : Performance_schema
- Part II : Useful tips
RAC+ASM: Lessons learned after 2 years in production
Managing over 70 databases for 4 major customers, I have some good stories to share. Running almost all possible combinations of ASM, RAC, NETAPP and NFS.
Success, failure and gotchas. This presentation is the equivalent of years of experience, condensed in major highlights in 45 minutes. To list a few stories:
Yeti-DNS is an international research project with the purpose of testing new technologies and procedures in running the Internet root zone. The project runs tests on DNSSEC key rollovers in the root, as well as experimenting with new ways to manage the DNSSEC keys (multiple zone signing keys).
An interview with Shane Kerr, a coordinator for the Yeti-DNS project, forms part of this webinar. The interview sheds light on the technical and political aspects of the project and introduces the latest results from experiments.
The webinar also includes a tutorial on how to use the Yeti-DNS root name servers to configure a BIND 9 DNS resolver in order to take part in the project.
Red Hat Enterprise Linux OpenStack Platform on Inktank Ceph EnterpriseRed_Hat_Storage
This session describes how to get the most out of OpenStack Cinder volumes on Ceph.
We’ll discuss:
Performance configuration, tuning, and workloads.
Performance test results of Red Hat Enterprise Linux OpenStack Platform 5, Red Hat Enterprise Linux OpenStack Platform 6, Red Hat Ceph Storage 1.2.3, and Firefly.
Anticipated improvements in performance for Red Hat Ceph Storage 1.3.
Apache HAWQ (Incubating) along with its extension framework (PXF) provides a high-performance massively-parallel SQL processing framework on unmanaged data stores/formats in the hadoop ecosystem. HCatalog provides a glue for the entire Hadoop ecosystem by providing a relational abstraction for HDFS data. This presentation introduces the integration of Hcatalog metadata into HAWQ's in memory catalog, which provides a simple and seamless access paradigm to data managed by Hive.
Zeppelin Interpreters
PSQL (to became JDBC in 0.6.x)
Geode
SpringXD
Apache Ambari
Zeppelin Service
Geode, HAWQ and Spring XD services
Webpage Embedder View
HAWQ: a massively parallel processing SQL engine in hadoopBigData Research
HAWQ, developed at Pivotal, is a massively parallel processing SQL engine sitting on top of HDFS. As a hybrid of MPP database and Hadoop, it inherits the merits from both parties. It adopts a layered architecture and relies on the distributed file system for data replication and fault tolerance. In addition, it is standard SQL compliant, and unlike other SQL engines on Hadoop, it is fully transactional. This paper presents the novel design of HAWQ, including query processing, the scalable software interconnect based on UDP protocol, transaction management, fault tolerance, read optimized storage, the extensible framework for supporting various popular Hadoop based data stores and formats, and various optimization choices we considered to enhance the query performance. The extensive performance study shows that HAWQ is about 40x faster than Stinger, which is reported 35x-45x faster than the original Hive.
Pivotal HAWQ is a high performance SQL engine on top of Hadoop. It support SQL 92, multi-way joins and has one of the best query processing engines on top of Hadoop. This presentation explains some of the design principles behind HAWQ HA and offers insight into how it works with Hadoop HA.
Pivotal is a trusted partner for IT innovation and transformation. From the technology, to the people, to the way people interact with technology, Pivotal is transforming how the world builds software.
At Strata NYC 2015, Pivotal, announced it will Supercharge the Hadoop Ecosystem by contributing the HAWQ advanced SQL on Hadoop analytics and MADlib machine learning technologies to The Apache Software Foundation.
Instrumentación de entrega continua con GitlabSoftware Guru
Mostraremos el caso real de cómo tenemos implementado en nuestra empresa el flujo de desarrollo para integración y entrega continua, instrumentado con GitLab.
Sesión presentada por David Padilla en SG Next 2017
PostgreSQL Portland Performance Practice Project - Database Test 2 HowtoMark Wong
Fourth presentation in a speaker series sponsored by the Portland State University Computer Science Department. The series covers PostgreSQL performance with an OLTP (on-line transaction processing) workload called Database Test 2 (DBT-2). This presentation is a set of examples to go along with the live presentation given on March 12, 2009.
Talk given by David Petersen, Lead Systems Engineer at Salesforce, at Stacki Webinar on Nov 2016
Learn how we've integrated chef into Stacki provisioning system and how we've automated the process.
Nagios Conference 2014 - Rob Hassing - How To Maintain Over 20 Monitoring App...Nagios
Rob Hassing's presentation on How To Maintain Over 20 Monitoring Appliances.
The presentation was given during the Nagios World Conference North America held Oct 13th - Oct 16th, 2014 in Saint Paul, MN. For more information on the conference (including photos and videos), visit: http://go.nagios.com/conference
AWS Study Group - Chapter 03 - Elasticity and Scalability Concepts [Solution ...QCloudMentor
Ch3 Elasticity and Scalability Concepts
Technical requirements
Sources of failure
Dividing and conquering
Virtualization technologies
LAMP installation
Scaling the webserver
Resiliency
EC2 persistence model
Disaster recovery
Cascading deletion
Bootstrapping
Scaling the compute layer
Scaling a database server
Summary
Further reading
Using the Test::PostgreSQL, App::Sqitch, DBIx::Class and DBIx::ClassEasyFixture perl modules to build a testing environment that is dynamically created and destroyed with each execution. Test can be run both serial or parallel without impacting each other.
Kicking off with Zend Expressive and Doctrine ORM (PHPNW2016)James Titcumb
You've heard of Zend's new framework, Expressive, and you've heard it's the new hotness. In this talk, I will introduce the concepts of Expressive, how to bootstrap a simple application with the framework using best practices, and finally how to integrate a third party tool like Doctrine ORM.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Generating a custom Ruby SDK for your web service or Rails API using Smithyg2nightmarescribd
Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
24. pivotal.io
875 Howard Street, Fifth Floor, San Francisco, CA 94103
DOCUMENT
CONTROL
For any questions regarding this document contact:
Name: Seungdon Choi
E-mail: schoi@pivotal.io
Document Revision History
Date Version Description Author Reviewer
01/02/2015 0.1 Draft for internal
review
03/02/2015 0.9 For Distribution
37. pivotal.io
875 Howard Street, Fifth Floor, San Francisco, CA 94103
Big Data 와 Hadoop 이 계속 시장에서 화두가 되고 있으며, 인터넷 기업뿐만 아니라
기업환경에서도 Hadoop 이 새로운 데이터 소스와 인프라로서 자리잡아 가고 있다. 하지만
운영과 분석을 위해서는 Map Reduce 등 새로운 기술을 배워야 하는 learning curve 들로
인해서 많은 기업들이 도입을 두려워 하고 있는 것도 현실이다.
Pivotal 의 HAWQ 제품은 SQL-On-Hadoop 제품으로 기존에 사용자와 개발자가 익숙한
SQL 인터페이스를 제공하여 Hadoop 의 workload 를 기존의 Data Warehouse 처럼 쉽고
빠르게 빅데이터 프로젝트를 수행할 수 있게 한다.
본 문서는 Pivotal HD Single Node VM 을 사용하여 HAWQ 의 기본 사용법에 대한 Hands
On Practice 를 다룬다.
53. pivotal.io
875 Howard Street, Fifth Floor, San Francisco, CA 94103
HAWQ 의 internal table 을 생성하고, Sample 데이터를 로드하도록 하자.
우선 기존에 있는 retail_demo schema 를 삭제 후 재생성한다.
[pivhdsne:hawq_tables]$ psql
psql (8.2.15)
Type help for help.
gpadmin=# drop schema retail_demo;
[pivhdsne:hawq_tables]$ psql
psql (8.2.15)
Type help for help.
gpadmin=# drop schema retail_demo;
gpadmin=# i /pivotal-samples/hawq/hawq_tables/create_hawq_tables.sql
DROP TABLE
CREATE TABLE
DROP TABLE
CREATE TABLE
DROP TABLE
CREATE TABLE
DROP TABLE
CREATE TABLE
DROP TABLE
CREATE TABLE
DROP TABLE
CREATE TABLE
DROP TABLE
CREATE TABLE
DROP TABLE
CREATE TABLE
DROP TABLE
CREATE TABLE
Sample Data 를 copy 명령어를 사용하여 각 HAWQ table 에 load 한다.
[pivhdsne:hawq_tables]$ cd /home/gpadmin/retail_demo/
[pivhdsne:retail_demo]$ ls -lrt
total 293632
54. pivotal.io
875 Howard Street, Fifth Floor, San Francisco, CA 94103
-rw-r--r-- 1 gpadmin gpadmin 590 Jan 30 14:42 categories_dim.tsv.gz
-rw-r--r-- 1 gpadmin gpadmin 7760971 Jan 30 14:42 email_addresses_dim.tsv.gz
-rw-r--r-- 1 gpadmin gpadmin 17772 Jan 30 14:42 date_dim.tsv.gz
-rw-r--r-- 1 gpadmin gpadmin 4646775 Jan 30 14:42 customers_dim.tsv.gz
-rw-r--r-- 1 gpadmin gpadmin 53995977 Jan 30 14:42 customer_addresses_dim.tsv.gz
-rw-r--r-- 1 gpadmin gpadmin 137780165 Jan 30 14:42 order_lineitems.tsv.gz
-rw-r--r-- 1 gpadmin gpadmin 23333203 Jan 30 14:42 products_dim.tsv.gz
-rw-r--r-- 1 gpadmin gpadmin 99 Jan 30 14:42 payment_methods.tsv.gz
-rw-r--r-- 1 gpadmin gpadmin 72797064 Jan 30 14:42 orders.tsv.gz
zcat customers_dim.tsv.gz | psql -c COPY retail_demo.customers_dim_hawq FROM STDIN DELIMITER E't' NULL E'';
zcat categories_dim.tsv.gz | psql -c COPY retail_demo.categories_dim_hawq FROM STDIN DELIMITER E't' NULL E'';
zcat order_lineitems.tsv.gz | psql -c COPY retail_demo.order_lineitems_hawq FROM STDIN DELIMITER E't' NULL E'';
zcat orders.tsv.gz | psql -c COPY retail_demo.orders_hawq FROM STDIN DELIMITER E't' NULL E'';
zcat customer_addresses_dim.tsv.gz | psql -c COPY retail_demo.customer_addresses_dim_hawq FROM STDIN DELIMITER E't'
NULL E'';
zcat email_addresses_dim.tsv.gz | psql -c COPY retail_demo.email_addresses_dim_hawq FROM STDIN DELIMITER E't' NULL E'';
zcat products_dim.tsv.gz | psql -c COPY retail_demo.products_dim_hawq FROM STDIN DELIMITER E't' NULL E'';
zcat payment_methods.tsv.gz | psql -c COPY retail_demo.payment_methods_hawq FROM STDIN DELIMITER E't' NULL E'';
zcat date_dim.tsv.gz | psql -c COPY retail_demo.date_dim_hawq FROM STDIN DELIMITER E't' NULL E'';
데이터가 정상적으로 load 되었는지 확인하자.
[pivhdsne:hawq_tables]$ pwd
/pivotal-samples/hawq/hawq_tables
[pivhdsne:hawq_tables]$ sh ./verify_load_hawq_tables.sh
Table Name | Count
-----------------------------+------------------------
customers_dim_hawq | 401430
categories_dim_hawq | 56
customer_addresses_dim_hawq | 1130639
email_addresses_dim_hawq | 401430
order_lineitems_hawq | 1024158
orders_hawq | 512071
payment_methods_hawq | 5
products_dim_hawq | 698911
-----------------------------+------------------------
HAWQ 는 기본적으로 Greenplum 엔진(postgresql 8.2 를 기반으로 개발된 MPP SQL 엔진)
을 그대로 하둡 HDFS 에서 구현한 SQL on Hadoop 엔진이므로,
Greenplum/Postgresql 에서 사용하던 SQL 구문을 그대로 사용할 수 있다.
55. pivotal.io
875 Howard Street, Fifth Floor, San Francisco, CA 94103
HAWQ 에 query 를 날려보도록 하자. Order table 에서 각 우편번호별 총 지급액, 총세액을
구하는 query 이다.
[pivhdsne:hawq_tables]$ psql
psql (8.2.15)
Type help for help.
gpadmin=# select billing_address_postal_code, sum(total_paid_amount::float8) as total,
sum(total_tax_amount::float8) as tax
from retail_demo.orders_hawq
group by billing_address_postal_code
order by total desc limit 10;
billing_address_postal_code | total | tax
-----------------------------+-----------+-----------
48001 | 111868.32 | 6712.0992
15329 | 107958.24 | 6477.4944
42714 | 103244.58 | 6194.6748
41030 | 101365.5 | 6081.93
50223 | 100511.64 | 6030.6984
03106 | 83566.41 | 0
57104 | 77383.63 | 3095.3452
23002 | 73673.66 | 3683.683
25703 | 68282.12 | 4096.9272
26178 | 66836.4 | 4010.184
(10 rows)
gpadmin=#
[예제]PXF
59.
여기서는 HAWQ 의 PXF External Table 을 생성하는 예제를 설명한다. HAWQ PXF
External Table 을 사용하면 Pivotal HD 상의 다양한 native format(comma separated, tab
delimited, plain text file 등) 으로 정의된 dataset 을 읽고 쓸 수 있다.
Load 할 data set 을 확인하자.
[pivhdsne:retail_demo]$ hadoop fs -ls /retail_demo
Found 9 items
drwxr-xr-x - gpadmin hadoop 0 2015-02-02 13:45 /retail_demo/categories_dim
60. pivotal.io
875 Howard Street, Fifth Floor, San Francisco, CA 94103
drwxr-xr-x - gpadmin hadoop 0 2015-02-02 13:45 /retail_demo/customer_addresses_dim
drwxr-xr-x - gpadmin hadoop 0 2015-02-02 13:45 /retail_demo/customers_dim
drwxr-xr-x - gpadmin hadoop 0 2015-02-02 13:45 /retail_demo/date_dim
drwxr-xr-x - gpadmin hadoop 0 2015-02-02 13:45 /retail_demo/email_addresses_dim
drwxr-xr-x - gpadmin hadoop 0 2015-02-02 13:45 /retail_demo/order_lineitems
drwxr-xr-x - gpadmin hadoop 0 2015-02-02 13:45 /retail_demo/orders
drwxr-xr-x - gpadmin hadoop 0 2015-02-02 13:45 /retail_demo/payment_methods
drwxr-xr-x - gpadmin hadoop 0 2015-02-02 13:45 /retail_demo/products_dim
[pivhdsne:retail_demo]$ hadoop fs -ls /retail_demo/categories_dim
Found 1 items
-rw-r--r-- 3 gpadmin hadoop 590 2015-02-02 13:45
/retail_demo/categories_dim/categories_dim.tsv.gz
데이터 확인
hadoop fs -cat /retail_demo/categories_dim.tsv.gz |zcat
External Table 을 생성한다. ! 이 query 로 수행하면 Fragmenter deprecate 가 되어
워닝이 난다. http://pivotalhd.docs.pivotal.io/tutorial/getting-started/hawq/pxf-
external-tables.html 에 있는 External Table Creation 명령어를 이용하자.
[pivhdsne:pxf_tables]$ pwd
/pivotal-samples/hawq/pxf_tables
[pivhdsne:pxf_tables]$ psql
psql (8.2.15)
Type help for help.
gpadmin=# i create_pxf_tables.sql
External Table 의 문법을 확인해보자. 미리 정의되어 있는 Profile – HdfsTextSimple 을
사용하여 정의한다. 기정의되어 있는 Profile
List 는 http://pivotalhd.docs.pivotal.io/doc/2100/webhelp/index.html#topics/PXFInsta
llationandAdministration.html 에서 확인할 수 있다.
CREATE EXTERNAL TABLE retail_demo.payment_methods_pxf
(
payment_method_id smallint,
payment_method_code character varying(20)
)
LOCATION
('pxf://pivhdsne:50070/retail_demo/payment_methods/payment_methods.tsv.gz?profile=HdfsTextSimple')
FORMAT 'TEXT' (DELIMITER = E't');
61. pivotal.io
875 Howard Street, Fifth Floor, San Francisco, CA 94103
Table 이 제대로 생성이 되었는지 dictionary 를 확인한다.
gpadmin=# dx retail_demo.*_pxf
List of relations
Schema | Name | Type | Owner | Storage
-------------+----------------------------+-------+---------+----------
retail_demo | categories_dim_pxf | table | gpadmin | external
retail_demo | customer_addresses_dim_pxf | table | gpadmin | external
retail_demo | customers_dim_pxf | table | gpadmin | external
retail_demo | date_dim_pxf | table | gpadmin | external
retail_demo | email_addresses_dim_pxf | table | gpadmin | external
retail_demo | order_lineitems_pxf | table | gpadmin | external
retail_demo | orders_pxf | table | gpadmin | external
retail_demo | payment_methods_pxf | table | gpadmin | external
retail_demo | products_dim_pxf | table | gpadmin | external
(9 rows)
실제 External Table 로 HDFS 에 있는 file 의 row count 를 세어 보자.
[pivhdsne:pxf_tables]$ pwd
/pivotal-samples/hawq/pxf_tables
[pivhdsne:pxf_tables]$ sh verify_load_pxf_tables.sh
Table Name | Count
-----------------------------+------------------------
customers_dim_pxf | 401430
categories_dim_pxf | 56
customer_addresses_dim_pxf | 1130639
email_addresses_dim_pxf | 401430
order_lineitems_pxf | 1024158
orders_pxf | 512071
payment_methods_pxf | 5
products_dim_pxf | 698911
-----------------------------+------------------------
External Table 을 이용해서 앞에 수행했던 HAWQ query 를 수행해 보자.
gpadmin=#select billing_address_postal_code,
sum(total_paid_amount::float8) as total,
sum(total_tax_amount::float8) as tax
from retail_demo.orders_pxf
group by billing_address_postal_code
order by total desc limit 10;
billing_address_postal_code | total | tax
62. pivotal.io
875 Howard Street, Fifth Floor, San Francisco, CA 94103
-----------------------------+-----------+-----------
48001 | 111868.32 | 6712.0992
15329 | 107958.24 | 6477.4944
42714 | 103244.58 | 6194.6748
41030 | 101365.5 | 6081.93
50223 | 100511.64 | 6030.6984
03106 | 83566.41 | 0
57104 | 77383.63 | 3095.3452
23002 | 73673.66 | 3683.683
25703 | 68282.12 | 4096.9272
26178 | 66836.4 | 4010.184
(10 rows)
통계정보의 생성
PXF External Table 의 경우도 하기의 예제와 같이 통계정보를 수집하여 SQL 수행시 최적의
query plan 을 작성하는데 도움을 줄 수 있다.
[pivhdsne:~]$ seq 1 10000000 /tmp/demo.txt 셈플 데이터 생성
[pivhdsne:~]$ hadoop fs -put /tmp/demo.txt / Hadoop 에 load
[pivhdsne:~]$ psql
psql (8.2.15)
Type help for help.
gpadmin=# timing
Timing is on.
gpadmin=# CREATE EXTERNAL TABLE demo (val INT) External Table 생성
gpadmin-# LOCATION
('pxf://pivhdsne:50070/demo.txt?Fragmenter=com.pivotal.pxf.plugins.hdfs.HdfsDataFragmenterAnal
yzer=com.pivotal.pxf.plugins.hdfs.HdfsAnalyzerAccessor=com.pivotal.pxf.plugins.hdfs.TextFileAccess
orResolver=com.pivotal.pxf.plugins.hdfs.TextResolver')
gpadmin-# FORMAT 'TEXT' (DELIMITER = '|');
CREATE EXTERNAL TABLE
Time: 52.876 ms
gpadmin=# select relpages,reltuples from pg_class where relname='demo'; 통계정보 확인
relpages | reltuples
----------+-----------
1000 | 1e+06
(1 row)
63. pivotal.io
875 Howard Street, Fifth Floor, San Francisco, CA 94103
Time: 137.892 ms
gpadmin=# select val from demo where val=59999; query 수행
val
-------
59999
(1 row)
Time: 3840.291 ms
gpadmin=# analyze demo; 통계정보 수행
ANALYZE
Time: 258.789 ms
gpadmin=# select relpages,reltuples from pg_class where relname='demo';
relpages | reltuples
----------+-----------
4096 | 161858
(1 row)
Time: 101.710 ms
gpadmin=# select val from demo where val=59999;
val
-------
59999
(1 row)
Time: 2734.797 ms
gpadmin=#
[예제]PXF
69. pivotal.io
875 Howard Street, Fifth Floor, San Francisco, CA 94103
)
-- PARTITIONED BY (Order_Datetime timestamp)
ROW FORMAT DELIMITED FIELDS TERMINATED BY 't'
STORED AS TEXTFILE
LOCATION '/retail_demo/order_lineitems/';
hive select count(*) from retail_demo.customers_dim_hive;
Total MapReduce jobs = 1
..
2 seconds 940 msec
Ended Job = job_1370914856264_0009
MapReduce Jobs Launched:
Job 0: Map: 1 Reduce: 1 Cumulative CPU: 2.94 sec HDFS Read: 4646997 HDFS Write: 7
SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 940 msec
OK
401430
Time taken: 20.03 seconds
(2) HAWQ 에서 External Table 을 만들어서 이 Hive Table 을 읽어보자.
[pivhdsne:~]$ psql
psql (8.2.15)
Type help for help
CREATE EXTERNAL TABLE retail_demo.order_lineitems_hive
(
Order_ID text
, Order_Item_ID bigint
, Product_ID int
, Product_Name text
, Customer_ID int
, Store_ID int
, Item_Shipment_Status_Code text
, Order_Datetime timestamp
, Ship_Datetime timestamp
, Item_Return_Datetime timestamp
, Item_Refund_Datetime timestamp
, Product_Category_ID int
, Product_Category_Name text
, Payment_Method_Code text
70. pivotal.io
875 Howard Street, Fifth Floor, San Francisco, CA 94103
, Tax_Amount float8
, Item_Quantity int
, Item_Price float8
, Discount_Amount float8
, Coupon_Code text
, Coupon_Amount float8
, Ship_Address_Line1 text
, Ship_Address_Line2 text
, Ship_Address_Line3 text
, Ship_Address_City text
, Ship_Address_State text
, Ship_Address_Postal_Code text
, Ship_Address_Country text
, Ship_Phone_Number text
, Ship_Customer_Name text
, Ship_Customer_Email_Address text
, Ordering_Session_ID text
, Website_URL text
)
LOCATION ('pxf://pivhdsne:50070/retail_demo.order_lineitems_hive?PROFILE=hive')
FORMAT 'CUSTOM' (formatter='pxfwritable_import');
gpadmin=# select count(*) from retail_demo.order_lineitems_hive;
count
---------
1024158
(1 row)