Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Big Data: Getting started with Big SQL self-study guide

29,937 views

Published on

Self-study exercises for IBM's Big SQL technology for querying Hadoop data in stored HDFS files, the Hive warehouse, or HBase.

Published in: Technology
  • Sorry that I didn't check comments earlier -- I don't seem to be getting notified about these. Chris, I don't publicly post the source (MS Word) copy. If you need it for some reason, please email me directly (saracco@us.ibm.com). Karmam, all my lab documents are available to IBM employees internally, and there are some demo environments that you can download or access. Email me directly if you want links. Most of my labs will work on Bluemix, provided you're using a service that includes Big SQL. However, because Bluemix Hadoop services are managed for you, you won't have direct access to the "bigsql" administrator ID so you may have to adjust some of the exercises accordingly.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Hello Cynthia, would this document be available for download somewhere?
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Hello Cynthia, Is there any lab hosted for IBM employees? Can the steps mentioned be replicated on Bluemix trial account? Thank you
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Big Data: Getting started with Big SQL self-study guide

  1. 1. IBM Getting started with Big SQL 4.1 Cynthia M. Saracco IBM Solution Architect September 23, 2015
  2. 2. Page 2 of 91 Contents LAB 1 OVERVIEW......................................................................................................................................................... 4 1.1. WHAT YOU'LL LEARN ................................................................................................................................ 4 1.2. ABOUT YOUR ENVIRONMENT ..................................................................................................................... 5 1.3. GETTING STARTED ................................................................................................................................... 6 LAB 2 EXPLORING YOUR BIG SQL SERVICE THROUGH AMBARI......................................................................... 7 2.1. INSPECTING YOUR CLUSTER STATUS .......................................................................................................... 7 2.2. EXPLORING YOUR BIG SQL SERVICE ....................................................................................................... 11 2.3. OPTIONAL: EXPLORING YOUR DISTRIBUTED FILE SYSTEM........................................................................... 13 LAB 3 USING THE BIG SQL COMMAND LINE INTERFACE (JSQSH)..................................................................... 16 3.1. UNDERSTANDING JSQSH CONNECTIONS................................................................................................... 16 3.2. GETTING HELP FOR JSQSH ..................................................................................................................... 20 3.3. EXECUTING BASIC BIG SQL STATEMENTS ................................................................................................ 20 3.4. OPTIONAL: EXPLORING ADDITIONAL JSQSH COMMANDS ............................................................................ 23 LAB 4 QUERYING STRUCTURED DATA .................................................................................................................. 27 4.1. CREATING SAMPLE TABLES AND LOADING SAMPLE DATA ............................................................................. 27 4.2. QUERYING THE DATA WITH BIG SQL........................................................................................................ 33 4.3. CREATING AND WORKING WITH VIEWS ...................................................................................................... 36 4.4. POPULATING A TABLE WITH ‘INSERT INTO … SELECT’ ......................................................................... 37 4.5. OPTIONAL: STORING DATA IN AN ALTERNATE FILE FORMAT (PARQUET)........................................................ 38 4.6. OPTIONAL: WORKING WITH EXTERNAL TABLES ......................................................................................... 39 4.7. OPTIONAL: CREATING AND QUERYING THE FULL SAMPLE DATABASE............................................................ 42 LAB 5 WORKING WITH COMPLEX DATA TYPES.................................................................................................... 43 5.1. WORKING WITH THE ARRAY TYPE .......................................................................................................... 43 5.2. WORKING WITH THE MAP TYPE............................................................................................................... 45 5.3. WORKING WITH THE STRUCT TYPE........................................................................................................ 48 5.4. OPTIONAL: EXPLORING ADDITIONAL SCENARIOS ....................................................................................... 49 LAB 6 WORKING WITH NON-TRADITIONAL DATA USING SERDES..................................................................... 53 6.1. REGISTERING A SERDE.......................................................................................................................... 53 6.2. CREATING, POPULATING, AND QUERYING A TABLE THAT USES A SERDE ....................................................... 54 LAB 7 UNDERSTANDING AND INFLUENCING DATA ACCESS PLANS................................................................. 57 7.1. COLLECTING STATISTICS WITH THE ANALYZE TABLE COMMAND.............................................................. 57 7.2. UNDERSTANDING YOUR DATA ACCESS PLAN (EXPLAIN) ........................................................................... 58 LAB 8 DEVELOPING AND EXECUTING SQL USER-DEFINED FUNCTIONS.......................................................... 63 8.1. UNDERSTANDING UDFS ......................................................................................................................... 63 8.2. PREPARING JSQSH TO CREATE AND EXECUTE UDFS................................................................................. 64 8.3. CREATING AND EXECUTING A SCALAR UDF .............................................................................................. 64 8.4. OPTIONAL: INVOKING UDFS WITHOUT PROVIDING FULLY-QUALIFIED NAME ................................................... 66 8.5. INCORPORATING IF/ELSE STATEMENTS .................................................................................................. 68 8.6. INCORPORATING WHILE LOOPS.............................................................................................................. 70 8.7. INCORPORATING FOR LOOPS ................................................................................................................. 71 8.8. CREATING A TABLE UDF ........................................................................................................................ 72 8.9. OPTIONAL: OVERLOADING UDFS AND DROPPING UDFS ........................................................................... 73 LAB 9 OPTIONAL: USING SQUIRREL SQL CLIENT ............................................................................................... 76 9.1. CONNECTING TO BIG SQL...................................................................................................................... 76 9.2. EXPLORING YOUR BIG SQL DATABASE..................................................................................................... 81 9.3. ISSUING QUERIES AND INSPECTING RESULTS............................................................................................. 84 9.4. CHARTING QUERY RESULTS .................................................................................................................... 87 LAB 10 SUMMARY ....................................................................................................................................................... 90
  3. 3. Page 4 of 91 Lab 1 Overview In this hands-on lab, you'll learn how to work with Big SQL, an industry-standard SQL interface for big data managed in supported Apache Hadoop platforms. Big SQL enables IT professionals to create tables and query data in Hadoop using familiar SQL statements. To do so, programmers use standard SQL syntax and, in some cases, SQL extensions created by IBM to make it easy to exploit certain Hadoop-based technologies. Big SQL shares query compiler technology with DB2 (a relational DBMS) and offers a wide breadth of SQL capabilities. Organizations interested in Big SQL often have considerable SQL skills in-house as well as a suite of SQL-based business intelligence applications and query/reporting tools. The idea of being able to leverage existing skills and tools — and perhaps reuse portions of existing applications — can be quite appealing to organizations new to Hadoop. Indeed, some companies with large data warehouses built on relational DBMSs are looking to Hadoop-based platforms as a potential target for offloading "cold" or infrequently used data in a manner that still allows for query access. In other cases, organizations turn to Hadoop to analyze and filter non-traditional data (such as logs, sensor data, social media posts, etc.), ultimately feeding subsets or aggregations of this information to their relational warehouses to extend their view of products, customers, or services. 1.1. What you'll learn After completing all exercises in this lab guide, you'll know how to • Inspect the status of your Big SQL service through Apache Ambari, a Web-based management tool. (Ambari is included with the IBM Open Platform for Apache Hadoop.) • Create a connection to your Big SQL server from a command line environment (JSqsh). • Execute Big SQL statements and commands. • Create Big SQL tables stored in the Hive warehouse and in user-specified directories of your Hadoop Distributed File System (HDFS). • Load data into Big SQL tables. • Query big data using Big SQL projections, restrictions, joins, and other operations. • Store complex data types in Big SQL tables. • Use SerDes (serializers/deserializers) to manage non-traditional data in Big SQL tables. • Gather statistics about your tables and explore data access plans for your queries. • Create and execute SQL-based scalar and table user-defined functions. • Use Squirrel SQL Client, an open source SQL tool, to explore your Big SQL database, execute queries, and chart query results.
  4. 4. Page 5 of 91 Allow 6 – 7 hours to complete all sections of this lab. Separate labs are available on using Big SQL with HBase (http://www.slideshare.net/CynthiaSaracco/h-base-introlabv4) and using Spark to access Big SQL data (http://www.slideshare.net/CynthiaSaracco/big-data-49411435). Special thanks to Uttam Jain, Carlos Renteria, and Raanon Reutlinger for their contributions to earlier versions of this lab. Thanks also to Nailah Bissoon and Daniel Kikuchi for their reviews. 1.2. About your environment This lab requires an appropriate Hadoop environment supported by Big SQL 4.1, such as BigInsights Quick Start Edition, BigInsights Data Analyst, and BigInsights Data Scientist. Big SQL and all its pre- requisite services must be installed and running to complete this lab. Examples in this lab are based on a sample multi-node cluster with the configuration shown in the tables below. If your environment is different, modify the sample code and instructions as needed to match your configuration. User Password Root account root password Big SQL Administrator bigsql bigsql Ambari Administrator admin admin Knox Gateway account guest guest-password Property Value Host name myhost.ibm.com Ambari port number 8080 Big SQL database name bigsql Big SQL port number 32051 Big SQL installation directory /usr/ibmpacks/bigsql JSqsh installation directory /usr/ibmpacks/common-utils/current/jsqsh Big SQL samples directory /usr/ibmpacks/bigsql/4.1/bigsql/samples/data BigInsights Home https://myhost.ibm.com:8443/gateway/default/BigInsightsWeb/index.html#/ welcome
  5. 5. Page 6 of 91 About the screen captures, sample code, and environment configuration Screen captures in this lab depict examples and results that may vary from what you see when you complete the exercises. In addition, some code examples may need to be customized to match your environment. 1.3. Getting started To get started with the lab exercises, you need access to a working Big SQL environment and a secure shell (command window). A free BigInsights Quick Start Edition is available for download from IBM at http://www-01.ibm.com/software/data/infosphere/hadoop/trials.html. This lab was tested against a native BigInsights installation on a multi-node cluster. Although the IBM Analytics for Hadoop cloud service on Bluemix includes Big SQL, it does not support SSH access. As a result, certain examples in this lab -- such as those involving JSqsh (the Big SQL command line interface) and DFS commands -- cannot be executed using that Bluemix service. Therefore, if you want to follow all examples in this lab, you should install and configure BigInsights on your own cluster. Consult the instructions in the product's Knowledge Center (http:/www- 01.ibm.com/support/knowledgecenter/SSPT3X_4.1.0/com.ibm.swg.im.infosphere.biginsights.welcome.d oc/doc/welcome.html). Before continuing with this lab, verify that Big SQL and all its pre-requisite services are running. If have any questions or need help getting your environment up and running, visit Hadoop Dev (https://developer.ibm.com/hadoop/) and review the product documentation or post a message to the forum. You cannot proceed with subsequent lab exercises without access to a working environment.
  6. 6. Page 7 of 91 Lab 2 Exploring your Big SQL service through Ambari Administrators can monitor, launch, and inspect aspects of their Big SQL service through Ambari, an open source Web-based tool for managing Hadoop clusters. In this section, you will learn how to  Launch Ambari.  Inspect the overall status of your cluster.  Inspect the configuration and status of your Big SQL service. Allow 30 minutes to complete this section. 2.1. Inspecting your cluster status In this exercise, you will launch Apache Ambari and verify that a minimal set of services are running so that you can begin working with Big SQL. You will also learn how to stop and start a service. __1. Launch a Web browser. __2. Enter the URL for your Ambari service, which was configured at installation. For example, if the host running your Ambari server is myhost.ibm.com and the server was installed at its default port of 8080, you would enter http://myhost.ibm.com:8080 __3. When prompted, enter the Ambari administrator ID and password. (By default, this is admin/admin). If your Web browser returns an error instead of a sign in screen, verify that the Ambari server has been started and that you have the correct URL for it. If needed, launch Ambari manually; log into the node containing the Ambari server as root and issue this command: ambari-server start
  7. 7. Page 8 of 91 __4. Verify that the Ambari console appears similar to this: __5. If necessary, click the Dashboard tab at top and inspect the status of installed services. The previous screen capture was taken from a system in which several open source components provided in the IBM Open Platform for Apache Hadoop had been installed and launched. The Big SQL service was also launched. __6. Click on the Hive service in the panel at left. __7. Note that detailed information about the selected service appears.
  8. 8. Page 9 of 91 Hive is one of several underlying Hadoop services used in this lab. __8. Click on the Oozie service. __9. Since Oozie isn't required for this lab, stop the service. To do so, expand the Services Actions drop-down menu in the upper right corner and select Stop.
  9. 9. Page 10 of 91 __10. When prompted, click Confirm Stop. __11. Monitor the operation's status in the pop-window that appears. __12. When the operation completes, click the OK button at the bottom of the pop-window to close it. __13. Return to the Ambari display and verify that the Oozie service has stopped, as indicated by the red triangle next to the service.
  10. 10. Page 11 of 91 __14. Optionally, start the Oozie service. From the Services Action menu, select Start. Click OK when prompted and allow the process to complete. Click OK again. __15. Confirm that the following services (at a minimum) are operational on your cluster: HDFS, MapReduce2, YARN, Hive, Knox, Big SQL. __16. Return to the main Ambari dashboard. 2.2. Exploring your Big SQL service Let's explore the configuration of your Big SQL service, including the locations of your Big SQL Head Node and JSqsh. __1. In Ambari, click on the BigInsights - Big SQL service to display details about it.
  11. 11. Page 12 of 91 __2. Examine the overview information presented in the Summary tab. In the previous screen capture, you'll note that the following Big SQL services are running: 1 Big SQL Head, 1 Big SQL Secondary Head, and 1 Big SQL Worker. (A production cluster would typically have many Big SQL Workers running.) __3. Click on the Hosts tab towards the upper right of your screen. A summary of the nodes in your cluster is displayed. The image below was taken from a 2-node cluster. __4. One at a time, click on each host name at left. __5. Inspect the summary display, verifying that JSqsh is present on at least 1 node in your cluster. (In a subsequent lab, you will open a terminal window on this node and launch JSqsh.) __6. If desired, expand the Components information for each node to inspect the installed services. Partial information from a node containing the Big SQL Head and Big SQL Secondary Head is shown below. Click OK when done.
  12. 12. Page 13 of 91 __7. Optionally, explore the details about the software installed on your cluster. In the upper right corner, click Admin > Stack and Versions. Note that the BigInsights - Big SQL service shown here is version 4.1. 2.3. Optional: Exploring your distributed file system Occasionally, you may want to explore the contents of your distributed file system (DFS). Many programmers issue shell commands to inspect and manipulate their DFS. In subsequent lab modules, you'll issue a few of these commands yourself. However, this optional exercise introduces you to a Web- based interface for navigating your DFS using the NameNode UI, which is accessible from Ambari.
  13. 13. Page 14 of 91 __1. From the Ambari dashboard, click on the HDFS service. __2. From the Quick Links drop-down menu, select NameNode UI. Depending on your security configuration, you may be prompted to confirm access to the site. If so, agree to proceed. Accessing the NameNode UI outside of Ambari You can also access the NameNode UI directly from any browser by entering the host name and port number specified for this service at installation. The default port is generally 50070, so if your name node was installed on myhost.ibm.com, you would enter http://myhost.ibm.com:50070. __3. Verify that your display is similar to this: __4. In the menu at top, click the arrow key next to the Utilities tab to expose a drop-down menu. Select Browse the file system.
  14. 14. Page 15 of 91 __5. Navigate through the DFS directory tree, clicking on the directory of your choice to investigate its contents. Note that you may encounter access restrictions if you attempt to browse directories for which you lack authority to read. Here's an example of the contents of the /tmp directory on a test cluster.
  15. 15. Page 16 of 91 Lab 3 Using the Big SQL command line interface (JSqsh) BigInsights supports a command-line interface for Big SQL through the Java SQL Shell (JSqsh, pronounced "jay-skwish"). JSqsh is an open source project for querying JDBC databases. In this section, you will learn how to  Launch JSqsh.  Issue Big SQL queries.  Issue popular JSqsh commands to get help, retrieve your query history, and perform other functions. Allow 30 minutes to complete this section. 3.1. Understanding JSqsh connections To issue Big SQL commands from JSqsh, you need to define a connection to a Big SQL server. __1. On a node in your cluster in which JSqsh has been installed, open a terminal window. __2. Launch the JSqsh shell. If JSqsh was installed at its default location of /usr/ibmpacks/common-utils/current/jsqsh, you would invoke the shell with this command: /usr/ibmpacks/common-utils/current/jsqsh/bin/jsqsh __3. If this is your first time invoking the shell, a welcome screen will display. A subset of that screen is shown below. __4. If prompted to press Enter to continue, do so.
  16. 16. Page 17 of 91 __5. When prompted, enter c to launch the connection wizard. In the future, you can enter setup connections in the JSqsh shell to invoke this wizard. __6. Inspect the list of drivers displayed by the wizard, and note the number for the db2 driver (not the db2zos driver). Depending on the size of your command window, you may need to scroll up to see the full list of drivers. In the screen capture below, which includes a partial list of available drivers, the correct DB2 driver is 1. The order of your drivers may differ, as pre-installed drivers are listed first.
  17. 17. Page 18 of 91 About the driver selection You may be wondering why this lab uses the DB2 driver rather than the Big SQL driver. In 2014, IBM released a common SQL query engine as part of its DB2 and BigInsights offerings. Doing so provided for greater SQL commonality across its relational DBMS and Hadoop-based offerings. It also brought a greater breadth of SQL function to Hadoop (BigInsights) users. This common query engine is accessible through the "DB2" driver listed. The Big SQL driver remains operational and offers connectivity to an earlier, BigInsights-specific SQL query engine. This lab focuses on using the common SQL query engine. __7. At the prompt line, enter the number of the DB2 driver. __8. The connection wizard displays some default values for the connection properties and prompts you to change them. (Your default values may differ from those shown below.) __9. Change each variable as needed, one at a time. To do so, enter the variable number and specify a new value when prompted. For example, to change the value of the password variable (which is null by default) (1) Specify variable number 5 and hit Enter. (2) Enter the password value and hit Enter. (3) Inspect the variable settings that are displayed again to verify your change.
  18. 18. Page 19 of 91 Repeat this process as needed for each variable that needs to be changed. In particular, you may need to change the values for the db (database), port, and server variables. After making all necessary changes, the variables should reflect values that are accurate for your environment. In particular, the server property must correspond to the location of the Big SQL Head Node in your cluster. Here is an example of a connection created for the bigsql user account (password bigsql) that will connect to the database named bigsql at port 320510 on the bdvs1052.svl.ibm.com server. The Big SQL database name is defined during the installation of BigInsights. The default is bigsql. In addition, a Big SQL database administrator account is also defined at installation. This account has SECADM (security administration) authority for Big SQL. By default, that user account is bigsql. __10. When prompted, enter t to test your configuration. __11. Verify that the test succeeded, and hit Enter. __12. Save your connection. Enter s, name your connection bigsql, and hit Enter. __13. Finally, quit the connection wizard when prompted. (Enter q.)
  19. 19. Page 20 of 91 You must resolve any connection errors before continuing with this lab. If you have any questions, visit Hadoop Dev (https://developer.ibm.com/hadoop/) and review the product documentation or post a message to the forum. 3.2. Getting help for JSqsh Now that you’re familiar with JSqsh connections, you’re ready to work further with the shell. __1. From JSqsh, type help to display a list of available help categories. __2. Optionally, type help commands to display help for supported commands. A partial list of supported commands is displayed on the initial screen. Press the space bar to display the next page or q to quit the display of help information. 3.3. Executing basic Big SQL statements In this section, you will execute simple JSqsh commands and Big SQL queries so that you can become familiar with the JSqsh shell.
  20. 20. Page 21 of 91 __1. From the JSqsh shell, connect to your Big SQL server using the connection you created in a previous lesson. Assuming you named your connection bigsql, enter this command: connect bigsql __2. Type show tables -e | more to display essential information about all available tables one page at a time. If you're working with a newly installed Big SQL server, your results will appear similar to those below. __3. Next, cut and paste the following command into JSqsh to create a simple Hadoop table: create hadoop table test1 (col1 int, col2 varchar(5)); Because you didn't specify a schema name for the table it was created in your default schema, which is the user name specified in your JDBC connection. This is equivalent to create hadoop table yourID.test1 (col1 int, col2 varchar(5)); where yourID is the user name for your connection. In an earlier lab exercise, you created a connection using the bigsql user ID, so your table is BIGSQL.TEST1. We've intentionally created a very simple Hadoop table for this exercise so that you can concentrate on working with JSqsh. Later, you'll learn more about CREATE TABLE options supported by Big SQL, including the LOCATION clause of CREATE TABLE. In these examples, where LOCATION is omitted, the default Hadoop directory path for these tables are at /…/hive/warehouse/<schema>.db/<table>. Big SQL enables users with appropriate authority to create their own schemas by issuing a command such as create schema if not exists testschema; Authorized users can then create tables in that schema as desired. Furthermore, users can also create a table in a different schema, and if it doesn't already exist it will be implicitly created. __4. Display all tables in the current schema with the tables command. tables
  21. 21. Page 22 of 91 This screen capture was taken from an environment in which only 1 Big SQL table was created (BIGSQL.TEST1). __5. Optionally, display information about tables and views created in other schemas, such as the SYSCAT schema used for the Big SQL catalog. Specify the schema name in upper case since it will be used directly to filter the list of tables. tables -s SYSCAT Partial results are shown below. __6. Insert a row into your table. insert into test1 values (1, 'one'); This form of the INSERT statement (INSERT INTO … VALUES …) should be used for test purposes only because the operation will not be parallelized on your cluster. To populate a table with data in a manner that exploits parallel processing, use the Big SQL LOAD command, INSERT INTO … SELECT FROM statement, or CREATE TABLE AS … SELECT statement. You’ll learn more about these commands later. __7. To view the meta data about a table, use the describe command with the fully qualified table name in upper case. describe BIGSQL.TEST1 __8. Optionally, query the system for metadata about this table: select tabschema, colname, colno, typename, length from syscat.columns where tabschema = USER and tabname= 'TEST1';
  22. 22. Page 23 of 91 Once again, notice that we used the table name in upper case in these queries and describe command. This is because table and column names are folded to upper case in the system catalog tables. You can split the query across multiple lines in the JSqsh shell if you'd like. Whenever you press Enter, the shell will provide another line for you to continue your command or SQL statement. A semi-colon or go command causes your SQL statement to execute. In case you're wondering, SYSCAT.COLUMNS is one of a number of views supplied over system catalog tables automatically maintained for you by the Big SQL service. __9. Issue a query that restricts the number of rows returned to 5. For example, select the first 5 rows from SYSCAT.TABLES: select tabschema, tabname from syscat.tables fetch first 5 rows only; Restricting the number of rows returned by a query is a useful development technique when working with large volumes of data. __10. Leave JSqsh open so you can explore additional features in the next section. 3.4. Optional: Exploring additional JSqsh commands If you plan to use JSqsh frequently, it's worth exploring some additional features. This optional lab shows you how to recall previous commands, redirect output to local files, and execute scripts. __1. Review the history of commands you recently executed in the JSqsh shell. Type history and Enter. Note that previously run statements are prefixed with a number in parentheses. You can reference this number in the JSqsh shell to recall that query. __2. Enter !! (two exclamation points, without spaces) to recall the previously run statement. In the example below, the previous statement selects the first 5 rows from SYSCAT.TABLES. To run the statement, type a semi-colon on the following line.
  23. 23. Page 24 of 91 __3. Recall a previous SQL statement by referencing the number reported via the history command. For example, if you wanted to recall the 4th statement, you would enter !4. After the statement is recalled, add a semi-column to the final line to run the statement. __4. Experiment with JSqsh’s ability to support piping of output to an external program. Enter the following two lines on the command shell: select tabschema, tabname from syscat.tables go | more The go statement in the second line causes the query on the first line to be executed. (Note that there is no semi-colon at the end of the SQL query on the first line. The semi-colon is a Big SQL short cut for the JSqsh go command.) The | more clause causes the output that results from running the query to be piped through the Unix/Linux more command to display one screen of content at a time. Your results should look similar to this:
  24. 24. Page 25 of 91 Since there are more than 400 rows to display in this example, enter q to quit displaying further results and return to the JSqsh shell. __5. Experiment with JSqsh’s ability to redirect output to a local file rather than the console display. Enter the following two lines on the command shell, adjusting the path information on the final line as needed for your environment: select tabschema, colname, colno, typename, length from syscat.columns where tabschema = USER and tabname= 'TEST1' go > $HOME/test1.out This example directs the output of the query shown on the first line to the output file test1.out in your user's home directory. __6. Exit the shell: quit __7. From a terminal window, view the output file: cat $HOME/test1.out __8. Invoke JSqsh using an input file containing Big SQL commands to be executed. Maintaining SQL script files can be quite handy for repeatedly executing various queries. __a. From the Unix/Linux command line, use any available editor to create a new file in your local directory named test.sql. For example, type vi test.sql __b. Add the following 2 queries into your file select tabschema, tabname from syscat.tables fetch first 5 rows only; select tabschema, colname, colno, typename, length from syscat.columns fetch first 10 rows only; __c. Save your file (hit ‘esc’ to exit INSERT mode then type :wq) and return to the command line. __d. Invoke JSQSH, instructing it to connect to your Big SQL database and execute the contents of the script you just created. (You may need to adjust the path or user information shown below to match your environment.)
  25. 25. Page 26 of 91 /usr/ibmpacks/common-utils/current/jsqsh/bin/jsqsh bigsql < test.sql In this example, bigsql is the name of the database connection you created in an earlier lab. __e. Inspect the output. As you will see, JSQSH executes each instruction and displays its output. __9. Finally, clean up the your database. Launch the JSqsh shell again and issue this command: drop table test1; There’s more to JSqsh than this short lab can cover. Visit the JSqsh wiki (https://github.com/scgray/jsqsh/wiki) to learn more.
  26. 26. Page 27 of 91 Lab 4 Querying structured data In this lab, you will execute Big SQL queries to investigate data stored in Hadoop. Big SQL provides broad SQL support based on the ISO SQL standard. You can issue queries using JDBC or ODBC drivers to access data that is stored in Hadoop in the same way that you access relational databases from your enterprise applications. Multiple queries can be executed concurrently. The SQL query engine supports joins, unions, grouping, common table expressions, windowing functions, and other familiar SQL expressions. This tutorial uses sales data from a fictional company that sells outdoor products to retailers and directly to consumers through a web site. The firm maintains its data in a series of FACT and DIMENSION tables, as is common in relational data warehouse environments. In this lab, you will explore how to create, populate, and query a subset of the star schema database to investigate the company’s performance and offerings. Note that Big SQL provides scripts to create and populate the more than 60 tables that comprise the sample GOSALESDW database. You will use fewer than 10 of these tables in this lab. Prior to starting this lab, you must know how to connect to your Big SQL server and execute SQL from a supported tool. If necessary, complete the prior lab on JSqsh before proceeding. After you complete the lessons in this module, you will understand how to:  Create Big SQL tables that use Hadoop text file and Parquet file formats.  Populate Big SQL tables from local files and from the results of queries.  Query Big SQL tables using projections, restrictions, joins, aggregations, and other popular expressions.  Create and query a view based on multiple Big SQL tables. Allow 1.5 hours to complete this lab. 4.1. Creating sample tables and loading sample data In this lesson, you will create several sample tables and load data into these tables from local files. __1. Determine the location of the sample data in your local file system and make a note of it. You will need to use this path specification when issuing LOAD commands later in this lab. Subsequent examples in this section presume your sample data is in the /usr/ibmpacks/bigsql/4.1/bigsql/samples/data directory. This is the location of the data in typical Big SQL installations. __2. If necessary, launch your query execution tool (e.g., JSqsh) and establish a connection to your Big SQL server following the standard process for your environment. __3. Create several tables in your default schema. Issue each of the following CREATE TABLE statements one at a time, and verify that each completed successfully: -- dimension table for region info CREATE HADOOP TABLE IF NOT EXISTS go_region_dim
  27. 27. Page 28 of 91 ( country_key INT NOT NULL , country_code INT NOT NULL , flag_image VARCHAR(45) , iso_three_letter_code VARCHAR(9) NOT NULL , iso_two_letter_code VARCHAR(6) NOT NULL , iso_three_digit_code VARCHAR(9) NOT NULL , region_key INT NOT NULL , region_code INT NOT NULL , region_en VARCHAR(90) NOT NULL , country_en VARCHAR(90) NOT NULL , region_de VARCHAR(90), country_de VARCHAR(90), region_fr VARCHAR(90) , country_fr VARCHAR(90), region_ja VARCHAR(90), country_ja VARCHAR(90) , region_cs VARCHAR(90), country_cs VARCHAR(90), region_da VARCHAR(90) , country_da VARCHAR(90), region_el VARCHAR(90), country_el VARCHAR(90) , region_es VARCHAR(90), country_es VARCHAR(90), region_fi VARCHAR(90) , country_fi VARCHAR(90), region_hu VARCHAR(90), country_hu VARCHAR(90) , region_id VARCHAR(90), country_id VARCHAR(90), region_it VARCHAR(90) , country_it VARCHAR(90), region_ko VARCHAR(90), country_ko VARCHAR(90) , region_ms VARCHAR(90), country_ms VARCHAR(90), region_nl VARCHAR(90) , country_nl VARCHAR(90), region_no VARCHAR(90), country_no VARCHAR(90) , region_pl VARCHAR(90), country_pl VARCHAR(90), region_pt VARCHAR(90) , country_pt VARCHAR(90), region_ru VARCHAR(90), country_ru VARCHAR(90) , region_sc VARCHAR(90), country_sc VARCHAR(90), region_sv VARCHAR(90) , country_sv VARCHAR(90), region_tc VARCHAR(90), country_tc VARCHAR(90) , region_th VARCHAR(90), country_th VARCHAR(90) ) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' LINES TERMINATED BY 'n' STORED AS TEXTFILE ; -- dimension table tracking method of order for the sale (e.g., Web, fax) CREATE HADOOP TABLE IF NOT EXISTS sls_order_method_dim ( order_method_key INT NOT NULL , order_method_code INT NOT NULL , order_method_en VARCHAR(90) NOT NULL , order_method_de VARCHAR(90), order_method_fr VARCHAR(90) , order_method_ja VARCHAR(90), order_method_cs VARCHAR(90) , order_method_da VARCHAR(90), order_method_el VARCHAR(90) , order_method_es VARCHAR(90), order_method_fi VARCHAR(90) , order_method_hu VARCHAR(90), order_method_id VARCHAR(90) , order_method_it VARCHAR(90), order_method_ko VARCHAR(90) , order_method_ms VARCHAR(90), order_method_nl VARCHAR(90) , order_method_no VARCHAR(90), order_method_pl VARCHAR(90) , order_method_pt VARCHAR(90), order_method_ru VARCHAR(90) , order_method_sc VARCHAR(90), order_method_sv VARCHAR(90) , order_method_tc VARCHAR(90), order_method_th VARCHAR(90) ) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' LINES TERMINATED BY 'n' STORED AS TEXTFILE ;
  28. 28. Page 29 of 91 -- look up table with product brand info in various languages CREATE HADOOP TABLE IF NOT EXISTS sls_product_brand_lookup ( product_brand_code INT NOT NULL , product_brand_en VARCHAR(90) NOT NULL , product_brand_de VARCHAR(90), product_brand_fr VARCHAR(90) , product_brand_ja VARCHAR(90), product_brand_cs VARCHAR(90) , product_brand_da VARCHAR(90), product_brand_el VARCHAR(90) , product_brand_es VARCHAR(90), product_brand_fi VARCHAR(90) , product_brand_hu VARCHAR(90), product_brand_id VARCHAR(90) , product_brand_it VARCHAR(90), product_brand_ko VARCHAR(90) , product_brand_ms VARCHAR(90), product_brand_nl VARCHAR(90) , product_brand_no VARCHAR(90), product_brand_pl VARCHAR(90) , product_brand_pt VARCHAR(90), product_brand_ru VARCHAR(90) , product_brand_sc VARCHAR(90), product_brand_sv VARCHAR(90) , product_brand_tc VARCHAR(90), product_brand_th VARCHAR(90) ) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' LINES TERMINATED BY 'n' STORED AS TEXTFILE ; -- product dimension table CREATE HADOOP TABLE IF NOT EXISTS sls_product_dim ( product_key INT NOT NULL , product_line_code INT NOT NULL , product_type_key INT NOT NULL , product_type_code INT NOT NULL , product_number INT NOT NULL , base_product_key INT NOT NULL , base_product_number INT NOT NULL , product_color_code INT , product_size_code INT , product_brand_key INT NOT NULL , product_brand_code INT NOT NULL , product_image VARCHAR(60) , introduction_date TIMESTAMP , discontinued_date TIMESTAMP ) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' LINES TERMINATED BY 'n' STORED AS TEXTFILE ; -- look up table with product line info in various languages CREATE HADOOP TABLE IF NOT EXISTS sls_product_line_lookup ( product_line_code INT NOT NULL , product_line_en VARCHAR(90) NOT NULL , product_line_de VARCHAR(90), product_line_fr VARCHAR(90) , product_line_ja VARCHAR(90), product_line_cs VARCHAR(90) , product_line_da VARCHAR(90), product_line_el VARCHAR(90) , product_line_es VARCHAR(90), product_line_fi VARCHAR(90)
  29. 29. Page 30 of 91 , product_line_hu VARCHAR(90), product_line_id VARCHAR(90) , product_line_it VARCHAR(90), product_line_ko VARCHAR(90) , product_line_ms VARCHAR(90), product_line_nl VARCHAR(90) , product_line_no VARCHAR(90), product_line_pl VARCHAR(90) , product_line_pt VARCHAR(90), product_line_ru VARCHAR(90) , product_line_sc VARCHAR(90), product_line_sv VARCHAR(90) , product_line_tc VARCHAR(90), product_line_th VARCHAR(90) ) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' LINES TERMINATED BY 'n' STORED AS TEXTFILE; -- look up table for products CREATE HADOOP TABLE IF NOT EXISTS sls_product_lookup ( product_number INT NOT NULL , product_language VARCHAR(30) NOT NULL , product_name VARCHAR(150) NOT NULL , product_descriptionVARCHAR(765) ) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' LINES TERMINATED BY 'n' STORED AS TEXTFILE; -- fact table for sales CREATE HADOOP TABLE IF NOT EXISTS sls_sales_fact ( order_day_key INT NOT NULL , organization_key INT NOT NULL , employee_key INT NOT NULL , retailer_key INT NOT NULL , retailer_site_key INT NOT NULL , product_key INT NOT NULL , promotion_key INT NOT NULL , order_method_key INT NOT NULL , sales_order_key INT NOT NULL , ship_day_key INT NOT NULL , close_day_key INT NOT NULL , quantity INT , unit_cost DOUBLE , unit_price DOUBLE , unit_sale_price DOUBLE , gross_margin DOUBLE , sale_total DOUBLE , gross_profit DOUBLE ) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' LINES TERMINATED BY 'n' STORED AS TEXTFILE ; -- fact table for marketing promotions CREATE HADOOP TABLE IF NOT EXISTS mrk_promotion_fact
  30. 30. Page 31 of 91 ( organization_key INT NOT NULL , order_day_key INT NOT NULL , rtl_country_key INT NOT NULL , employee_key INT NOT NULL , retailer_key INT NOT NULL , product_key INT NOT NULL , promotion_key INT NOT NULL , sales_order_key INT NOT NULL , quantity SMALLINT , unit_cost DOUBLE , unit_price DOUBLE , unit_sale_price DOUBLE , gross_margin DOUBLE , sale_total DOUBLE , gross_profit DOUBLE ) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' LINES TERMINATED BY 'n' STORED AS TEXTFILE; Let’s briefly explore some aspects of the CREATE TABLE statements shown here. If you have a SQL background, the majority of these statements should be familiar to you. However, after the column specification, there are some additional clauses unique to Big SQL – clauses that enable it to exploit Hadoop storage mechanisms (in this case, Hive). The ROW FORMAT clause specifies that fields are to be terminated by tabs (“t”) and lines are to be terminated by new line characters (“n”). The table will be stored in a TEXTFILE format, making it easy for a wide range of applications to work with. For details on these clauses, refer to the Apache Hive documentation. __4. Load data into each of these tables using sample data provided in files. Change the SFTP and file path specifications in each of the following examples to match your environment. Then, one at a time, issue each LOAD statement and verify that the operation completed successfully. LOAD returns a warning message providing details on the number of rows loaded, etc. load hadoop using file url 'sftp://yourID:yourPassword@myhost.ibm.com:22/usr/ibmpacks/bigsql/4.1/bigsql/samples/data/GO SALESDW.GO_REGION_DIM.txt' with SOURCE PROPERTIES ('field.delimiter'='t') INTO TABLE GO_REGION_DIM overwrite; load hadoop using file url 'sftp://yourID:yourPassword@myhost.ibm.com:22/usr/ibmpacks/bigsql/4.1/bigsql/samples/data/GO SALESDW.SLS_ORDER_METHOD_DIM.txt' with SOURCE PROPERTIES ('field.delimiter'='t') INTO TABLE SLS_ORDER_METHOD_DIM overwrite; load hadoop using file url 'sftp://yourID:yourPassword@myhost.ibm.com:22/usr/ibmpacks/bigsql/4.1/bigsql/samples/data/GO SALESDW.SLS_PRODUCT_BRAND_LOOKUP.txt' with SOURCE PROPERTIES ('field.delimiter'='t') INTO TABLE SLS_PRODUCT_BRAND_LOOKUP overwrite;
  31. 31. Page 32 of 91 load hadoop using file url 'sftp://yourID:yourPassword@myhost.ibm.com:22/usr/ibmpacks/bigsql/4.1/bigsql/samples/data/GO SALESDW.SLS_PRODUCT_DIM.txt' with SOURCE PROPERTIES ('field.delimiter'='t') INTO TABLE SLS_PRODUCT_DIM overwrite; load hadoop using file url 'sftp://yourID:yourPassword@myhost.ibm.com:22/usr/ibmpacks/bigsql/4.1/bigsql/samples/data/GO SALESDW.SLS_PRODUCT_LINE_LOOKUP.txt' with SOURCE PROPERTIES ('field.delimiter'='t') INTO TABLE SLS_PRODUCT_LINE_LOOKUP overwrite; load hadoop using file url 'sftp://yourID:yourPassword@myhost.ibm.com:22/usr/ibmpacks/bigsql/4.1/bigsql/samples/data/GO SALESDW.SLS_PRODUCT_LOOKUP.txt' with SOURCE PROPERTIES ('field.delimiter'='t') INTO TABLE SLS_PRODUCT_LOOKUP overwrite; load hadoop using file url 'sftp://yourID:yourPassword@myhost.ibm.com:22/usr/ibmpacks/bigsql/4.1/bigsql/samples/data/GO SALESDW.SLS_SALES_FACT.txt' with SOURCE PROPERTIES ('field.delimiter'='t') INTO TABLE SLS_SALES_FACT overwrite; load hadoop using file url 'sftp://yourID:yourPassword@myhost.ibm.com:22/usr/ibmpacks/bigsql/4.1/bigsql/samples/data/GO SALESDW.MRK_PROMOTION_FACT.txt' with SOURCE PROPERTIES ('field.delimiter'='t') INTO TABLE MRK_PROMOTION_FACT overwrite; Let’s explore the LOAD syntax shown in these examples briefly. Each example loads data into a table using a file URL specification that relies on SFTP to locate the source file. In particular, the SFTP specification includes a valid user ID and password (yourID/yourPassword), the target host server and port (myhost.ibm.com:22), and the full path of the data file on that system. The WITH SOURCE PROPERTIES clause specifies that fields in the source data are delimited by tabs (“t”). The INTO TABLE clause identifies the target table for the LOAD operation. The OVERWRITE keyword indicates that any existing data in the table will be replaced by data contained in the source file. (If you wanted to simply add rows to the table’s content, you could specify APPEND instead.) Using SFTP (or FTP) is one way in which you can invoke the LOAD command. If your target data already resides in your distributed file system, you can provide the DFS directory information in your file URL specification. Indeed, for optimal runtime performance, you may prefer to take that approach. See the BigInsights Knowledge Center (product documentation) for details. In addition, you can load data directly from a remote relational DBMS via a JDBC connection. The Knowledge Center includes examples of that. __5. Query the tables to verify that the expected number of rows was loaded into each table. Execute each query that follows individually and compare the results with the number of rows specified in the comment line preceding each query. -- total rows in GO_REGION_DIM = 21 select count(*) from GO_REGION_DIM;
  32. 32. Page 33 of 91 -- total rows in sls_order_method_dim = 7 select count(*) from sls_order_method_dim; -- total rows in SLS_PRODUCT_BRAND_LOOKUP = 28 select count(*) from SLS_PRODUCT_BRAND_LOOKUP; -- total rows in SLS_PRODUCT_DIM = 274 select count(*) from SLS_PRODUCT_DIM; -- total rows in SLS_PRODUCT_LINE_LOOKUP = 5 select count(*) from SLS_PRODUCT_LINE_LOOKUP; -- total rows in SLS_PRODUCT_LOOKUP = 6302 select count(*) from SLS_PRODUCT_LOOKUP; -- total rows in SLS_SALES_FACT = 446023 select count(*) from SLS_SALES_FACT; -- total rows gosalesdw.MRK_PROMOTION_FACT = 11034 select count(*) from MRK_PROMOTION_FACT; 4.2. Querying the data with Big SQL Now you're ready to query your tables. Based on earlier exercises, you've already seen that you can perform basic SQL operations, including projections (to extract specific columns from your tables) and restrictions (to extract specific rows meeting certain conditions you specified). Let's explore a few examples that are a bit more sophisticated. In this lesson, you will create and run Big SQL queries that join data from multiple tables as well as perform aggregations and other SQL operations. Note that the queries included in this section are based on queries shipped as samples. __1. Join data from multiple tables to return the product name, quantity and order method of goods that have been sold. For simplicity, limit the number of returns rows to 20. To achieve this, execute the following query: -- Fetch the product name, quantity, and order method of products sold. ---- Query 1 SELECT pnumb.product_name, sales.quantity, meth.order_method_en FROM sls_sales_fact sales, sls_product_dim prod, sls_product_lookup pnumb, sls_order_method_dim meth WHERE pnumb.product_language='EN' AND sales.product_key=prod.product_key AND prod.product_number=pnumb.product_number AND meth.order_method_key=sales.order_method_key
  33. 33. Page 34 of 91 fetch first 20 rows only; Let’s review a few aspects of this query briefly:  Data from four tables will be used to drive the results of this query (see the tables referenced in the FROM clause). Relationships between these tables are resolved through 3 join predicates specified as part of the WHERE clause. The query relies on 3 equi-joins to filter data from the referenced tables. (Predicates such as prod.product_number=pnumb.product_number help to narrow the results to product numbers that match in two tables.)  For improved readability, this query uses aliases in the SELECT and FROM clauses when referencing tables. For example, pnumb.product_name refers to “pnumb,” which is the alias for the gosalesdw.sls_product_lookup table. Once defined in the FROM clause, an alias can be used in the WHERE clause so that you do not need to repeat the complete table name.  The use of the predicate and pnumb.product_language='EN' helps to further narrow the result to only English output. This database contains thousands of rows of data in various languages, so restricting the language provides some optimization. __2. Modify the query to restrict the order method to one type – those involving a Sales visit. To do so, add the following query predicate just before the FETCH FIRST 20 ROWS clause: AND order_method_en='Sales visit' -- Fetch the product name, quantity, and order method -- of products sold through sales visits. -- Query 2 SELECT pnumb.product_name, sales.quantity, meth.order_method_en FROM sls_sales_fact sales, sls_product_dim prod, sls_product_lookup pnumb, sls_order_method_dim meth WHERE
  34. 34. Page 35 of 91 pnumb.product_language='EN' AND sales.product_key=prod.product_key AND prod.product_number=pnumb.product_number AND meth.order_method_key=sales.order_method_key AND order_method_en='Sales visit' FETCH FIRST 20 ROWS ONLY; __3. Inspect the results: __4. To find out which sales method of all the methods has the greatest quantity of orders, include a GROUP BY clause (group by pll.product_line_en, md.order_method_en). In addition, invoke the SUM aggregate function (sum(sf.quantity)) to total the orders by product and method. Finally, this query cleans up the output a bit by using aliases (e.g., as Product) to substitute a more readable column header. -- Query 3 SELECT pll.product_line_en AS Product, md.order_method_en AS Order_method, sum(sf.QUANTITY) AS total FROM sls_order_method_dim AS md, sls_product_dim AS pd, sls_product_line_lookup AS pll, sls_product_brand_lookup AS pbl, sls_sales_fact AS sf WHERE pd.product_key = sf.product_key AND md.order_method_key = sf.order_method_key AND pll.product_line_code = pd.product_line_code AND pbl.product_brand_code = pd.product_brand_code GROUP BY pll.product_line_en, md.order_method_en; __5. Inspect the results, which should contain 35 rows. A portion is shown below.
  35. 35. Page 36 of 91 4.3. Creating and working with views Big SQL supports views (virtual tables) based on one or more physical tables. In this section, you will create a view that spans multiple tables. Then you'll query this view using a simple SELECT statement. In doing so, you'll see that you can work with views in Big SQL much as you can work with views in a relational DBMS. __1. Create a view named MYVIEW that extracts information about product sales featured in marketing promotions. By the way, since the schema name is omitted in both the CREATE and FROM object names, the current schema (your user name), is assumed. create view myview as select product_name, sales.product_key, mkt.quantity, sales.order_day_key, sales.sales_order_key, order_method_en from mrk_promotion_fact mkt, sls_sales_fact sales, sls_product_dim prod, sls_product_lookup pnumb, sls_order_method_dim meth where mkt.order_day_key=sales.order_day_key and sales.product_key=prod.product_key and prod.product_number=pnumb.product_number and pnumb.product_language='EN' and meth.order_method_key=sales.order_method_key; __2. Now query the view: select * from myview order by product_key asc, order_day_key asc fetch first 20 rows only; __3. Inspect the results:
  36. 36. Page 37 of 91 4.4. Populating a table with ‘INSERT INTO … SELECT’ Big SQL enables you to populate a table with data based on the results of a query. In this exercise, you will use an INSERT INTO . . . SELECT statement to retrieve data from multiple tables and insert that data into another table. Executing an INSERT INTO . . . SELECT exploits the machine resources of your cluster because Big SQL can parallelize both read (SELECT) and write (INSERT) operations. __1. Execute the following statement to create a sample table named sales_report: -- create a sample sales_report table CREATE HADOOP TABLE sales_report ( product_key INT NOT NULL, product_name VARCHAR(150), quantity INT, order_method_en VARCHAR(90) ) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' LINES TERMINATED BY 'n' STORED AS TEXTFILE; __2. Now populate the newly created table with results from a query that joins data from multiple tables. -- populate the sales_report data with results from a query INSERT INTO sales_report SELECT sales.product_key, pnumb.product_name, sales.quantity, meth.order_method_en FROM sls_sales_fact sales, sls_product_dim prod,
  37. 37. Page 38 of 91 sls_product_lookup pnumb, sls_order_method_dim meth WHERE pnumb.product_language='EN' AND sales.product_key=prod.product_key AND prod.product_number=pnumb.product_number AND meth.order_method_key=sales.order_method_key and sales.quantity > 1000; __3. Verify that the previous query was successful by executing the following query: -- total number of rows should be 14441 select count(*) from sales_report; 4.5. Optional: Storing data in an alternate file format (Parquet) Until now, you've instructed Big SQL to use the TEXTFILE format for storing data in the tables you've created. This format is easy to read (both by people and most applications), as data is stored in a delimited form with one record per line and new lines separating individual records. It's also the default format for Big SQL tables. However, if you'd prefer to use a different file format for data in your tables, Big SQL supports several formats popular in the Hadoop environment, including Avro, sequence files, RC (record columnar) and Parquet. While it's beyond the scope of this lab to explore these file formats, you'll learn how you can easily override the default Big SQL file format to use another format -- in this case, Parquet. Parquet is a columnar storage format for Hadoop that's popular because of its support for efficient compression and encoding schemes. For more information on Parquet, visit http://parquet.io/. __1. Create a table named big_sales_parquet. CREATE HADOOP TABLE IF NOT EXISTS big_sales_parquet ( product_key INT NOT NULL, product_name VARCHAR(150), quantity INT, order_method_en VARCHAR(90) ) STORED AS parquetfile; With the exception of the final line (which specifies the PARQUETFILE format), all aspects of this statement should be familiar to you by now. __2. Populate this table with data based on the results of a query. Note that this query joins data from 4 tables you previously defined in Big SQL using a TEXTFILE format. Big SQL will automatically reformat the result set of this query into a Parquet format for storage. insert into big_sales_parquet SELECT sales.product_key, pnumb.product_name, sales.quantity, meth.order_method_en FROM sls_sales_fact sales,
  38. 38. Page 39 of 91 sls_product_dim prod, sls_product_lookup pnumb, sls_order_method_dim meth WHERE pnumb.product_language='EN' AND sales.product_key=prod.product_key AND prod.product_number=pnumb.product_number AND meth.order_method_key=sales.order_method_key and sales.quantity > 5500; __3. Query the table. Note that your SELECT statement does not need to be modified in any way because of the underlying file format. select * from big_sales_parquet; __4. Inspect the results, a subset of which are shown below. The query should return 471 rows. 4.6. Optional: Working with external tables The previous exercises in this lab caused Big SQL to store tables in a default location (in the Hive warehouse). Big SQL also supports the concept of an externally managed table – i.e., a table created over a user directory that resides outside of the Hive warehouse. This user directory contains all the table’s data in files. As part of this exercise, you will create a DFS directory, upload data into it, and then create a Big SQL table that over this directory. To satisfy queries, Big SQL will look in the user directory specified when you created the table and consider all files in that directory to be the table’s contents. Once the table is created, you’ll query that table.
  39. 39. Page 40 of 91 __1. If necessary, open a terminal window. __2. Check the directory permissions for your DFS. hdfs dfs -ls / If the /user directory cannot be written by the public (as shown in the example above), you will need to change these permissions so that you can create the necessary subdirectories for this lab using your standard lab user account. From the command line, issue this command to switch to the root user ID temporarily: su root When prompted, enter the password for this account. Then switch to the hdfs ID. su hdfs While logged in as user hdfs, issue this command: hdfs dfs -chmod 777 /user Next, confirm the effect of your change: hdfs dfs -ls / Exit the hdfs user account: exit Finally, exit the root user account and return to your standard user account: exit
  40. 40. Page 41 of 91 __3. Create directories in your distributed file system for the source data files and ensure public read/write access to these directories. (If desired, alter the DFS information as appropriate for your environment.) hdfs dfs -mkdir /user/bigsql_lab hdfs dfs -mkdir /user/bigsql_lab/sls_product_dim __4. Upload the source data files into their respective DFS directories. Change the local and DFS directories information below to match your environment. hdfs dfs -copyFromLocal /your-dir/data/GOSALESDW.SLS_PRODUCT_DIM.txt /user/bigsql_lab/sls_product_dim/SLS_PRODUCT_DIM.txt __5. List the contents of the DFS directories into which you copied the files to validate your work. hdfs dfs -ls /user/bigsql_lab/sls_product_dim __6. Allow public access to this directory. hdfs dfs -chmod -R 777 /user/bigsql_lab __7. From your query execution environment (such as JSqsh), create an external Big SQL table for the sales product dimension (sls_product_dim_external). Note that the LOCATION clause in each statement references the DFS directory into which you copied the sample data. -- product dimension table stored in a DFS directory external to Hive CREATE EXTERNAL HADOOP TABLE IF NOT EXISTS sls_product_dim_external ( product_key INT NOT NULL , product_line_code INT NOT NULL , product_type_key INT NOT NULL , product_type_code INT NOT NULL , product_number INT NOT NULL , base_product_key INT NOT NULL , base_product_number INT NOT NULL , product_color_code INT , product_size_code INT , product_brand_key INT NOT NULL , product_brand_code INT NOT NULL , product_image VARCHAR(60) , introduction_date TIMESTAMP , discontinued_date TIMESTAMP ) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' LINES TERMINATED BY 'n' location '/user/bigsql_lab/sls_product_dim';
  41. 41. Page 42 of 91 __8. Query the table. select product_key, introduction_date from sls_product_dim_external where discontinued_date is not null fetch first 20 rows only; __9. Inspect the results. 4.7. Optional: Creating and querying the full sample database Big SQL ships with sample SQL scripts for creating, populating, and querying more than 60 tables. These tables are part of the GOSALESDW schema -- a schema that differs from the one used in this lab. (You created tables in the default schema – i.e., your user ID's schema. Since the JSqsh connection you created earlier used the bigsql ID, your schema was BIGSQL.) If desired, use standard Linux operating system facilities to inspect the SQL scripts and sample data for the GOSALESDW schema in the samples directory. By default, this location is /usr/ibmpacks/bigsql/4.1/bigsql/samples. Within this directory, you'll find subdirectories containing (1) the full sample data for the GOSALESDW tables and (2) a collection of SQL scripts for creating, loading, and querying these tables. Feel free to use the supplied scripts to create all GOSALESDW tables, load data into these tables, and query these tables. Note that you may need to modify portions of these scripts to match your environment.
  42. 42. Page 43 of 91 Lab 5 Working with complex data types The previous lab on Querying Structured Data demonstrated how you can create and populate Big SQL tables with columns of primitive data types, such as INT, VARCHAR(), and so on. Big SQL also supports columns based on complex data types, specifically:  ARRAY, an ordered collection of values of the same data type.  MAP, a collection of key-value pairs. MAPs are also known as associative arrays.  STRUCT, a collection of fields of various data types. A STRUCT type is also known as a ROW type. This lab introduces you to Big SQL's support for complex types, employing Hive-compatible syntax in its exercises. (For additional syntax options, see the product Knowledge Center.) To help you readily understand how to work with complex types, you will create tables that are slight variations of some of the tables you created in the previous lab on Querying Structured Data. Prior to starting this lab, you must know how to connect to your Big SQL server and execute SQL from a supported tool. If necessary, complete the earlier lab on JSqsh before proceeding. After you complete all the lessons in this module, you will understand how to:  Create Big SQL tables that contain columns of ARRAY, MAP, and STRUCT types.  Populate these tables with data using LOAD and INSERT INTO . . . SELECT statements.  Query these tables using projection and restriction operators. Allow ½ - 1 hour to complete this lab. 5.1. Working with the ARRAY type In this lesson, you will explore Big SQL's ARRAY type, which stores values of the same data type as an ordered collection in a single column of a table. First, you will create a variation of the sls_product_line_lookup table that you created earlier. Your new table will include 2 columns: an integer column for the product line code and an array column of varying-length character strings for product line descriptions in 3 different languages. Next, you'll load data into this table from a file and query the table. __1. If desired, examine the definition of the sls_product_line_lookup table that you created in the Querying Structured Data lab to get some context for the table you will create in this exercise. __2. In your local file system, create a sample data file named product_line_array.csv using your favorite editor. Instructions in this lab are based on the vi editor. vi product_line_array.csv
  43. 43. Page 44 of 91 In vi, enter insert mode. Type i Cut and paste the following content into the file: 991|Camping Equipment,Campingausrüstung,Matériel de camping 992|Mountaineering Equipment,Bergsteigerausrüstung,Matériel de montagne 993|Personal Accessories,Accessoires,Accessoires personnels 994|Outdoor Protection,Outdoor-Schutzausrüstung,Articles de protection 995|Golf Equipment,Golfausrüstung,Matériel de golf For simplicity, this sample data represents a subset of the data contained in the sls_product_line_lookup table. To exit the file, press Esc and then enter :wq __3. Inspect the contents of your file, ignoring any special language-specific characters that do not display properly from the shell: cat product_line_array.csv __4. Launch your query execution tool (e.g., JSqsh) and establish a connection to your Big SQL server following the standard process for your environment. __5. Create the sls_product_line_lookup_array table, specifying a single ARRAY column for all of the product descriptions: create hadoop table if not exists sls_product_line_lookup_array (product_line_code INT NOT NULL, description ARRAY<VARCHAR(50)>); __6. Load the sample data from your product_line_array.csv file into this table. Adjust the FILE URL specification below as needed to match your environment, including the ID, password, host name, and directory information. LOAD HADOOP USING FILE URL 'sftp://yourID:yourPassword@myhost.ibm.com:22/yourDir/product_line_array.csv' WITH SOURCE PROPERTIES ('field.delimiter'='|', 'collection.items.delimiter'=',') INTO TABLE sls_product_line_lookup_array OVERWRITE; Most of the LOAD HADOOP statement shown here should be familiar to you by now, so let's concentrate on the SOURCE PROPERTIES clause. The properties specified in this clause must correspond to the characteristics of your source data file. For example, the field.delimiter property of this LOAD HADOOP statement indicates that the source file contains a pipe character ("|") to separate field (column) values. The collection.items.delimiter property indicates how each element of the complex data type (an array, in this case) is delineated: in this example, by a comma. Consult the product Knowledge Center for additional information about using LOAD HADOOP with complex data types.
  44. 44. Page 45 of 91 __7. Inspect the informational message that appears after the LOAD operation completes. Verify that the number of rows loaded into this table were 5 and that 0 rows were skipped. __8. Query the table, retrieving only the product line code, English description, and German description for product lines with English descriptions that begin with the word "Personal": select product_line_code, description[1] as English, description[2] as German from sls_product_line_lookup_array where description[1] like 'Personal%'; Note that this query effectively "flattens" the array's content by identifying specific elements of interest and including each in a separate column of the result set. In this case, the query retrieves the first element of the array (which contains the product line's English description) and the second element of the array (which contains the product line's German description). __9. Inspect the results: __10. Attempt to query the table using improper syntax for the array column: select * from sls_product_line_lookup_array; __11. Inspect the error message that appears. When you include a complex column in the SELECT list or WHERE clause of a query, you must specify the individual element or field of interest. 5.2. Working with the MAP type Now that you know how to work with ARRAYs, it's time to explore the MAP type. In this lesson, you will revisit the design of the product line table that you just created and implement a slightly different model. Instead of storing descriptions in an array, you'll use a key-value pairs in a MAP column. One advantage of this design is that you won't need to remember the order in which the language descriptions are stored
  45. 45. Page 46 of 91 in your column. Instead of querying for description[1] to get data in English, you'll be able to query the data using a more intuitive key value. __1. In your local file system, create a sample data file named product_line_map.csv using your favorite editor. Instructions in this lab are based on the vi editor. vi product_line_map.csv In vi, enter insert mode. Type i Cut and paste the following content into the file: 991|EN#Camping Equipment,DE#Campingausrüstung,FR#Matériel de camping 992|EN#Mountaineering Equipment,DE#Bergsteigerausrüstung,FR#Matériel de montagne 993|EN#Personal Accessories,DE#Accessoires,FR#Accessoires personnels 994|EN#Outdoor Protection,DE#Outdoor-Schutzausrüstung,FR#Articles de protection 995|EN#Golf Equipment,DE#Golfausrüstung,FR#Matériel de golf Note that the data values shown here are the same as those used in the previous example. However, an additional delimiter appears. Specifically, the "#" symbol separates the key from the data value of each key-value pair. To exit the file, press Esc and then enter :wq __2. Inspect the contents of your file, ignoring any special language-specific characters that do not display properly from the shell: cat product_line_map.csv __3. Launch your query execution tool (e.g., JSqsh) and establish a connection to your Big SQL server following the standard process for your environment. __4. Create the sls_product_line_lookup_map table, specifying a MAP column for the product line descriptions: create hadoop table if not exists sls_product_line_lookup_map (product_line_code INT NOT NULL, description MAP<VARCHAR(2),VARCHAR(50)>); __5. Load the sample data from your product_line_map.csv file into this table. Adjust the FILE URL specification below as needed to match your environment. LOAD HADOOP USING FILE URL 'sftp://yourID:yourPassword@myhost.ibm.com:22/yourDir/product_line_map.csv' WITH SOURCE PROPERTIES ('field.delimiter'='|', 'collection.items.delimiter'=',', 'map.keys.delimiter'='#') INTO TABLE sls_product_line_lookup_map OVERWRITE;
  46. 46. Page 47 of 91 Note that the SOURCE PROPERTIES clause in this example sets the map.keys.delimiter property to the "#" symbol. If you inspect the contents of the sample data file, you'll node that this symbol separates the key and data values of the key-value pairs to be included in the MAP column. The field.delimiter and collection.items.delimiter properties are also set to values appropriate for the sample file. __6. Inspect the informational message that appears after the LOAD operation completes. Verify that the number of rows loaded into this table were 5 and that 0 rows were skipped. __7. Query the table, retrieving only the product line code, English description, and German description for product lines with English descriptions that begin with the word "Personal": select product_line_code, description['EN'] as English, description['DE'] as German from sls_product_line_lookup_map where description['EN'] like 'Personal%'; Compare this query to its equivalent in the previous exercise, which modeled the description data in an ARRAY column. Here, you retrieve the English description of a product line by referencing the key contained in the complex column. __8. Inspect the results. __9. Attempt to query the table using improper syntax. select * from sls_product_line_lookup_map; Once again, an error message appears. Recall that a MAP is an associative array, so you must specify the map key of interest when referencing the complex column in the SELECT list or WHERE clause of your query.
  47. 47. Page 48 of 91 5.3. Working with the STRUCT type In this lesson, you will explore Big SQL's support for the STRUCT or ROW type. As the name implies, this type enables you to store a structured collection of different data types in a single column. In this lab module, you will create a variation of the go_region_dim table that you created in an earlier lab. Your new table will include 3 columns: an integer column for the country key, an integer column for the country code, and a STRUCT column for the collection of ISO codes that were represented as separate columns in the original table. __1. If desired, examine the definition of the go_region_dim table that you created in the Querying Structured Data lab to get some context for the table you will create in this exercise. __2. In your local file system, create a sample data file named go_region_dim_struct.csv using your favorite editor. Instructions in this lab are based on the vi editor. vi go_region_dim_struct.csv In vi, enter insert mode. Type i Cut and paste the following content into the file: 90001,1003,USA|US|840 90002,1004,CAN|CA|124 90003,1020,MEX|MX|484 To exit the file, press Esc and then enter :wq __3. Inspect the contents of your file. cat go_region_dim_struct.csv __4. Launch your query execution tool (e.g., JSqsh) and establish a connection to your Big SQL server following the standard process for your environment. __5. Create the go_region_dim_struct table, specifying a single STRUCT column for all the ISO codes. Note that the first 2 elements of this column are VARCHARs, while the final element is an INT. create hadoop table if not exists go_region_dim_struct (country_key INT NOT NULL, country_code INT NOT NULL, isocodes STRUCT<letter3: VARCHAR(3), letter2: VARCHAR(2), digit: INT> ); __6. Load the sample data from your go_region_dim_struct.csv file into this table. Adjust the FILE URL specification below as needed to match your environment.
  48. 48. Page 49 of 91 LOAD HADOOP USING FILE URL 'sftp://yourID:yourPassword@myhost.ibm.com:22/yourDir/go_region_dim_struct.csv' WITH SOURCE PROPERTIES ('field.delimiter'=',','collection.items.delimiter'='|') INTO TABLE go_region_dim_struct OVERWRITE; __7. Inspect the informational message that appears after the LOAD operation completes. Verify that the number of rows loaded into this table were 3 and that 0 rows were skipped. __8. Query the table, retrieving only the country key, 2-letter ISO code, and ISO digit for the country with a 3-letter ISO code of "CAN" (Canada): select country_key, isocodes.letter2 as l2, isocodes.digit as digit from go_region_dim_struct where isocodes.letter3 = 'CAN'; __9. Inspect the results. __10. Attempt to query the table using improper syntax for the STRUCT column. select * from go_region_dim_struct; __11. Inspect the error message that appears. As you've probably learned by now, when you query a complex column, you must specify the individual element or field of interest. 5.4. Optional: Exploring additional scenarios This optional module explores two additional scenarios involving complex types. The first scenario involves a multi-table join that retrieves data from columns containing complex data types. The second scenario involves using a query result set to populate a table that contains a complex type column. Before beginning this lab, verify that you have completed previous exercises for creating and populating the following tables: go_region_dim_struct, sls_product_line_lookup_map,
  49. 49. Page 50 of 91 mrk_promotion_fact, and sls_product_dim. The first two tables just mentioned are part of this lab on complex data types. The final two tables are part of an earlier lab on Querying Structured Data. __1. If necessary, launch your query execution tool (e.g., JSqsh) and establish a connection to your Big SQL server following the standard process for your environment. __2. Execute the following query, which joins data from multiple tables to obtain information about the results of marketing promotions and their associated product lines in countries with an ISO digit code of < 500. (Given the sample data that you loaded earlier into the go_region_dim_struct table, those countries are Canada and Mexico.) select rs.country_key, rs.isocodes.letter3 as iso3code, m.order_day_key, m.retailer_key, m.product_key, pl.product_line_code, pl.description['EN'] as Eng_Desc, m.promotion_key, m.sales_order_key, m.sale_total, m.gross_profit from go_region_dim_struct rs, mrk_promotion_fact m, sls_product_dim p, sls_product_line_lookup_map pl where rs.country_key = m.rtl_country_key and p.product_key = m.product_key and p.product_line_code = pl.product_line_code and rs.isocodes.digit < 500 fetch first 10 rows only; Note that the SELECT list and WHERE clause of this query reference columns of both primitive and complex data types. __3. Inspect the results. __4. Next, explore your ability to use an INSERT … SELECT statement to populate a table containing columns of primitive and complex types. To begin, create a sample table named mycomplextest.
  50. 50. Page 51 of 91 create hadoop table mycomplextest (country_key int, isocodes struct<letter3: varchar(3), letter2: varchar(2), digit: int>, order_day_key int, retailer_key int, product_key int, product_line_code int, product_EN_desc varchar(50) ); Note that this table’s definition is very similar to the structure of the query result set you just retrieved. For simplicity, it omits some of the primitive columns. However, it does include one complex column of type STRUCT for all forms of ISO codes. __5. Populate this table with data from a query. insert into mycomplextest select rs.country_key, rs.isocodes, m.order_day_key, m.retailer_key, m.product_key, pl.product_line_code, pl.description['EN'] from go_region_dim_struct rs, mrk_promotion_fact m, sls_product_dim p, sls_product_line_lookup_map pl where rs.country_key = m.rtl_country_key and p.product_key = m.product_key and p.product_line_code = pl.product_line_code and rs.isocodes.digit < 500 fetch first 10 rows only; Examine this INSERT statement, noting the columns in the SELECT list. As you can see, rs.isocodes refers to a column of type STRUCT in the go_region_dim_struct table. Thus, the entire contents of this complex column will be inserted into the mycomplextest.isocodes column. (Because the SELECT list is used in the context of an INSERT statement, it's acceptable to reference the entire rs.isocodes column without further qualification -- i.e., you don't need to reference only a specific element or field of this column, as you would need to do if this SELECT statement wasn't part of a larger INSERT statement.) Note also that the SELECT list references the pl.description column, which was defined as type MAP in the sls_product_line_lookup_map table. In this case, rather than inserting the entire contents of this MAP column into the mycomplextest table, the query will extract only the key value for ‘EN’ (English) descriptions. This was done simply to illustrate that Big SQL’s INSERT . . . SELECT support enables you to work with the full content of a complex data type column or only specific fields or elements of a complex column. __6. Query the table:
  51. 51. Page 52 of 91 select country_key, isocodes.letter3 as iso3code, order_day_key, retailer_key, product_key, product_line_code, product_EN_desc from mycomplextest order by country_key, iso3code; __7. Inspect the results: As you might imagine, there’s more to Big SQL complex types than this lab covers. Consult the product documentation (Knowledge Center) for further details.
  52. 52. Page 53 of 91 Lab 6 Working with Non-Traditional Data using SerDes While data structured in CSV and TSV files are often stored in Hadoop and loaded into Big SQL tables, you may also need to work with other types of data – data that might require the use of a serializer / deserializer (SerDe). SerDes are common in Hadoop environments. You’ll find a number of SerDes available in the public domain, or you can write your own following typical Hadoop practices. Using a SerDe with Big SQL is pretty straightforward. Once you develop or locate the SerDe you need, register the JAR file so that it becomes available for use. Then you can specify the SerDe class name when you create your table. (Note: If needed, look in your JAR file to determine the class name of the SerDe you’ll be using. The CREATE TABLE statement requires the class name, not the JAR file name.) In this lab exercise, you will use a SerDe to define a table for blog data collected in a JSON (JavaScript Object Notation) format. JSON files have a nested, varied structure defined by the user or application that created them. The JSON file for this exercise includes blog data about IBM Watson extracted from various social media sites. The sample data is included in a .zip file available for free download as part of a developerWorks article on Analyzing Social Media and Structured Data with InfoSphere Biginsights. (The URL for this article is http://www.ibm.com/developerworks/data/library/techarticle/dm- 1206socialmedia/index.html?ca=dat). Before beginning this lab, download the .zip file and extract the blogs-data.txt file to your local file system. After you complete the lessons in this module, you will understand how to:  Register a SerDe with Big SQL  Create a Big SQL table that uses a SerDe for processing JSON data  Populate a Big SQL table with JSON data  Query this Big SQL table Allow 15 - 30 minutes hour to complete this lab. 6.1. Registering a SerDe In this exercise, you will provide a JSON-based SerDe to Big SQL and Hive so that you can later create a table that relies on this SerDe. __1. Download the hive-json-serde-0.2.jar into a directory of your choice on your local file system, such as /home/yourDir. (As of this writing, the full URL for this SerDe is https://code.google.com/p/hive-json-serde/downloads/detail?name=hive-json-serde-0.2.jar) __2. Copy the SerDe .jar file to the …/lib directory in which your Hive server was installed. (By default, this directory is in /usr/iop/current/hive-server2 on the node in which you installed the Hive server.) Adjust the path specification below as needed to match your environment. cp /yourDir/hive-json-serde-0.2.jar /usr/iop/current/hive-server2/lib __3. On each Big SQL node in your cluster, copy the SerDe .jar file to the Big SQL …/userlib directory. (By default, this directory is in /usr/ibmpacks/bigsql/4.1/bigsql on each node in which you installed the Big SQL.) Adjust the path specification below as needed to match your environment.
  53. 53. Page 54 of 91 cp /yourDir/hive-json-serde-0.2.jar /usr/ibmpacks/bigsql/4.1/bigsql/userlib __4. Restart all Big SQL services using one of these two approaches: __a. Use the Ambari console to restart the service. (See instructions in a previous lab for details on how to administer services through Ambari.) __b. Use the command line to stop and then start the service. On the Big SQL Head node, switch to the bigsql user ID (su bigsql). Stop the Big SQL with this command: /usr/ibmpacks/bigsql/4.1/bigsql/bin/bigsql stop. After the command successfully completes, start Big SQL again: /usr/ibmpacks/bigsql/4.1/bigsql/bin/bigsql start. 6.2. Creating, populating, and querying a table that uses a SerDe Now that you’ve registered your SerDe, you’re ready to use it. In this section, you will create a table that relies on the SerDe you just registered. For simplicity, this will be an externally managed table – i.e., a table created over a user directory that resides outside of the Hive warehouse. This user directory will contain all the table's data in files. As part of this exercise, you will upload the sample blogs-data.txt file into the target DFS directory. Creating a Big SQL table over an existing DFS directory has the effect of populating this table with all the data in the directory. To satisfy queries, Big SQL will look in the user directory specified when you created the table and consider all files in that directory to be the table’s contents. This is consistent with the Hive concept of an externally managed table. Once the table is created, you'll query that table. In doing so, you'll note that the presence of a SerDe is transparent to your queries. __1. If necessary, download the .zip file containing the sample data from the bottom half of the article referenced in the introduction. Unzip the file into a directory of your choice on your local file system. You will be working with the blogs-data.txt file. __2. From a terminal window, create a DFS directory for the source data file. hdfs dfs -mkdir /user/bigsql_lab/blogs __3. Copy the source data file (blogs-data.txt) from your local file system into this new DFS directory. (Change the path information below as needed to match your environment.) hdfs dfs -copyFromLocal /yourDir/blogs-data.txt /user/bigsql_lab/blogs __4. List the contents of the DFS directory into which you copied the file to validate your work. hdfs dfs -ls /user/bigsql_lab/blogs __5. Allow public access to the file you just uploaded. hdfs dfs -chmod -R 777 /user/bigsql_lab/blogs
  54. 54. Page 55 of 91 __6. From your query execution environment (such as JSqsh), create an external Big SQL table for the blogs data. Execute the following statement, which creates a TESTBLOGS table that includes a LOCATION clause that specifies the DFS directory containing your sample blogs- data.txt file: create hadoop table if not exists testblogs ( Country String, Crawled String, FeedInfo String, Inserted String, IsAdult int, Language String, Postsize int, Published String, SubjectHtml String, Tags String, Type String, Url String) row format serde 'org.apache.hadoop.hive.contrib.serde2.JsonSerde' location '/user/bigsql_lab/blogs'; __7. Finally, query the table using the following statement. select * from testblogs where subjecthtml is not null fetch first 5 rows only; A subset of the output is shown below. About this code . . . . The CREATE HADOOP TABLE statement specifies the class in the SerDe .jar file that is responsible for processing the input record into a "row" that Big SQL (and Hive) can understand. Because you registered the SerDe .jar file following the instructions in an earlier exercise, the Big SQL runtime engine will be able to locate this class in the .jar file and successfully execute the CREATE HADOOP TABLE statement. Quite commonly, new users will specify the .jar file in the CREATE HADOOP TABLE statement instead of the class file. Doing so will result in a runtime error. Another common runtime error is SQLCODE=-5199, SQLSTATE=57067: The statement failed because a connection to a Hadoop I/O component could not be established or maintained. Hadoop I/O component name: "Java DFSIO". Reason code: "1". This error typically indicates a problem with the SerDe's registration. Verify that you (1) copied the SerDe .jar file to the proper Hive directory and the proper Big SQL directories on all nodes and (2) restarted Big SQL before attempting to use the SerDe that you just registered.
  55. 55. Page 56 of 91
  56. 56. Page 57 of 91 Lab 7 Understanding and influencing data access plans As you may already know, query optimization significantly influences runtime performance. In this lab, you’ll learn how to use the ANALYZE TABLE command to collect statistics about your data so that the Big SQL query optimizer can make well-informed decisions when choosing between various options for data access. Collecting and maintaining accurate statistics is highly recommended when dealing with large volumes of data (but less critical for this sample lab). After collecting statistics for several sample tables that you created earlier, you will use the Big SQL EXPLAIN feature to examine the data access plan the query optimizer chose for a given query. Performance specialists often consult data access plans to help them tune their environments. After completing this lab, you will know how to:  Collect meta data (statistics) about your data.  Collect and review data access plans for your queries. Prior to beginning this lab, you must have created and populated several tables with data as described in an earlier lab on Querying Structured Data. Allow 30 minutes to complete this lab. Please post questions or comments about this lab to the forum on Hadoop Dev at https://developer.ibm.com/answers?community=hadoop. Special thanks to Raanon Reutlinger for his contributions to an earlier version of this lab. 7.1. Collecting statistics with the ANALYZE TABLE command The ANALYZE TABLE command collects statistics about your Big SQL tables. These statistics influence query optimization, enabling the Big SQL query engine to select an efficient data access path to satisfy your query. __1. If needed, launch JSqsh and connect to your Big SQL database. __2. Collect statistics for the tables you created in an earlier lab on Querying Structured Data. Issue each of the following commands individually, allowing the operation to complete. Depending on your machine resources, this may take several minutes or more. ANALYZE TABLE sls_sales_fact COMPUTE STATISTICS FOR COLUMNS product_key, order_method_key ; ANALYZE TABLE sls_product_dim COMPUTE STATISTICS FOR COLUMNS product_key, product_number, product_line_code, product_brand_code ; ANALYZE TABLE sls_product_lookup COMPUTE STATISTICS FOR COLUMNS product_number, product_language ; ANALYZE TABLE sls_order_method_dim COMPUTE STATISTICS FOR COLUMNS order_method_key, order_method_en
  57. 57. Page 58 of 91 ; ANALYZE TABLE sls_product_line_lookup COMPUTE STATISTICS FOR COLUMNS product_line_code, product_line_en ; ANALYZE TABLE sls_product_brand_lookup COMPUTE STATISTICS FOR COLUMNS product_brand_code ; 7.2. Understanding your data access plan (EXPLAIN) The EXPLAIN feature enables you to inspect the data access plan selected by the Big SQL optimizer for your query. Such information is highly useful for performance tuning. This exercise introduces you to EXPLAIN, a Big SQL feature that stores meta data in a set of EXPLAIN tables. __1. If necessary, launch your query execution environment and connect to your Big SQL database. __2. To create the necessary EXPLAIN tables to hold information about your query plans, call the SYSINSTALLOBJECTS procedure. The example below shows one way in which this procedure can be invoked. Casting a NULL in the last parameter causes a single set of EXPLAIN tables to be created in schema SYSTOOLS, which can be made accessible to all users. CALL SYSPROC.SYSINSTALLOBJECTS('EXPLAIN', 'C', CAST (NULL AS VARCHAR(128)), CAST (NULL AS VARCHAR(128))); __3. Authorize all Big SQL users to read data from the SYSTOOLS.EXPLAIN_INSTANCE table created by the stored procedure you just executed. grant select on systools.explain_instance to public; __4. Capture the data access plan for a query. One way to do this is by prefixing the query with the command EXPLAIN PLAN WITH SNAPSHOT FOR. In this way, the query isn't executed, but the access plan is saved in the EXPLAIN tables. Run this command: explain plan with snapshot for select distinct product_key, introduction_date from sls_product_dim; Information about the data access strategy for this query is stored in the EXPLAIN tables, which you’ll explore shortly. There are various tools to view the "explained" access plan. In this lab, you will use a DB2 utility called db2exfmt, executed from the bash shell. __5. If necessary, open a terminal window. ANALYZE TABLE syntax It’s best to include FOR COLUMNS and a list of columns to the ANALYZE TABLE command. Choose those columns found in your WHERE, ORDER BY, GROUP BY and DISTINCT clauses.
  58. 58. Page 59 of 91 __6. Invoke db2profile to set up your environment. . ~bigsql/sqllib/db2profile __7. Invoke db2exfmt to retrieve the plan. Supply appropriate input parameters, including the name of your Big SQL database (e.g., -d bigsql), the user ID and password under which you executed the explain plan query (e.g., -u bigsql bigsql), and an output file for the plan itself (e.g., -o query1.exp). Adjust the example below as needed for your environment. db2exfmt -d bigsql -u bigsql bigsql -o query1.exp __8. When prompted for additional information (such as an EXPLAIN timestamp), accept defaults and hit Enter. Allow the operation to complete. __9. Investigate the contents of the query1.exp file. For example, type more query1.exp Use the space bar to scroll forward through the output one page at a time, and enter b to page backward. __10. Inspect the Original Statement and Optimized Statement sections of the plan. Sometimes, the optimizer will decide to rewrite the query in a more efficient manner-- for example, replacing IN lists with JOINS. In this case, the Optimized Statement show that no further optimization has been done. __11. Scroll to the Access Plan section. Notice the SORT operation and the total number of operations for this plan.
  59. 59. Page 60 of 91 __12. Next, alter the table to include an informational primary key constraint on one of the table’s columns. From your query execution environment (e.g., JSqsh), execute the following alter command: alter table sls_product_dim add constraint newPK primary key (product_key) not enforced; This will alter the table to have a non-enforced PK constraint. __13. Now collect the plan information for the same query on the altered table: explain plan with snapshot for select distinct product_key, introduction_date from sls_product_dim; __14. From a terminal window, invoke db2exfmt again, providing a different output file name (such as query2.exp): db2exfmt -d bigsql -u bigsql bigsql -o query2.exp __15. When prompted for additional information (such as an EXPLAIN timestamp), accept defaults and hit Enter. Allow the operation to continue.
  60. 60. Page 61 of 91 __16. Investigate the contents of the query2.exp file. For example, type more query2.exp __17. Compare the original and optimized statements. Observe that the DISTINCT clause was removed from the optimized query. Because of the primary key constraint that you added, the optimizer determined that the DISTINCT clause (to eliminate duplicates) was unnecessary. __18. Similarly, inspect the new Access Plan. Observe that no SORT operation is included and that there are fewer operations in total.
  61. 61. Page 62 of 91
  62. 62. Page 63 of 91 Lab 8 Developing and executing SQL user-defined functions Big SQL enables users to create their own SQL functions that can be invoked in queries. User-defined functions (UDFs) promote code re-use and reduce query complexity. They can be written to return a single (scalar) value or a result set (table). Programmers can write UDFs in SQL or any supported programming languages (such as Java and C). For simplicity, this lab focuses on SQL UDFs. After you complete this lab, you will understand how to:  Create scalar and table UDFs written in SQL.  Incorporate procedural logic in your UDFs.  Invoke UDFs in Big SQL queries.  Drop UDFs. Allow 1 - 1.5 hours to complete this lab. Please note that this lab discusses only some of the capabilities of Big SQL scalar and table UDFs. For an exhaustive list of all the capabilities, please see the BigInsights knowledge center (http://www- 01.ibm.com/support/knowledgecenter/SSPT3X_4.1.0/com.ibm.swg.im.infosphere.biginsights.welcome.d oc/doc/welcome.html). Prior to starting this lab, you must be familiar with how to use the Big SQL command line (JSqsh), and you must have created the sample GOSALESDW tables. If necessary, work through earlier lab exercises on these topics. This UDF lab was originally developed by Uttam Jain (uttam@us.ibm.com) with contributions from Cynthia M. Saracco. Please post questions or comments to the forum on Hadoop Dev at https://developer.ibm.com/hadoop/support/. 8.1. Understanding UDFs Big SQL provides many built-in functions to perform common computations. An example is dayname(), which takes a date/timestamp and returns the corresponding day name, such as Friday. Often, organizations need to perform some customized or complex operation on their data that’s beyond the scope of any built-in-function. Big SQL allows users to embed their customized business logic inside a user-defined function (UDF) and write queries that call these UDFs. As mentioned earlier, Big SQL supports two types of UDFs: 1. Scalar UDF: These functions take one or more values as input and return a single value as output. For example, a scalar UDF can take three values (price of an item, percent discount on that item, and percent sales tax) to compute the final price of that item. 2. Table UDF: These functions take one or more values as input and return a whole table as output. For example, a table UDF can take single value (department-id) as input and return a table of employees who work in that department. This result set could have multiple columns, such as employee-id, employee-first-name, employee-last-name.
  63. 63. Page 64 of 91 Once created, UDFs can be incorporated into queries in a variety of ways, as you’ll soon see. In this lab, you will first set up your environment for UDF development and then explore how to create and invoke UDFs through various exercises. Ready to get started? 8.2. Preparing JSqsh to create and execute UDFs In this section, you will set up your JSqsh environment for UDF development. __1. If necessary, launch JSqsh using the connection to your bigsql database. (This was covered in an earlier lab.) __2. Reset the default SQL terminator character to “@”: set terminator = @; Because some of the UDFs you will be developing involve multiple SQL statements, you must reset the JSqsh default termination character so that the semi-colon following each SQL statement in your UDF is not interpreted as the end of the CREATE FUNCTION statement for your UDF. __3. Validate that the terminator was effectively reset: set @ __4. Inspect the output from the command (a subset of which is shown below), and verify that the terminator property is set to @. You’re now ready to create your first Big SQL UDF. 8.3. Creating and executing a scalar UDF In this section, you will create a scalar SQL UDF to compute final price of a particular item that was sold. Your UDF will require several input parameters:  unit sale price: Price of one item  quantity: Number of units of this item being sold in this transaction  % discount: Discount on the item (computed before tax)  % sales-tax: Sales tax (computed after discount)
  64. 64. Page 65 of 91 As you might expect, your UDF will return a single value – the final price of the item. After creating and registering the UDF, you will invoke it using some test values to ensure that it behaves correctly. Afterwards, you will invoke it in a query, passing in values from columns in a table as input to your function. __1. Create a UDF named new_final_price in the gosalesdw schema: CREATE OR REPLACE FUNCTION gosalesdw.new_final_price ( quantity INTEGER, unit_sale_price DOUBLE, discount_in_percent DOUBLE, sales_tax_in_percent DOUBLE ) RETURNS DOUBLE LANGUAGE SQL RETURN (quantity * unit_sale_price) * DOUBLE(1 - discount_in_percent / 100.0) * DOUBLE(1 + sales_tax_in_percent / 100.0) @ __2. Review the logic of this function briefly. This first line creates the function, which is defined to take four input parameters. The RETURNS clause indicates that a single (scalar) value of type DOUBLE will be returned. The function’s language is as SQL. Finally, the last two lines include the function’s logic, which simply performs the necessary arithmetic operations to calculate the final sales price of an item. __3. After creating the function, test it using some sample values. A simple way to do this is with the VALUES clause shown here: VALUES gosalesdw.new_final_price (1, 10, 20, 8.75)@ __4. Verify that result returned by your test case is 8.70000 __5. Next, use the UDF in a query to compute the final price for items listed in sales transactions in the SLS_SALES_FACT table. Note that this query uses values from two columns in the table as input for the quantity and unit price and two user-supplied values as input for the discount rate and sales tax rate. SELECT sales_order_key, quantity, unit_sale_price, gosalesdw.new_final_price(quantity, unit_sale_price, 20, 8.75) as final_price FROM sls_sales_fact ORDER BY sales_order_key FETCH FIRST 10 ROWS ONLY@ __6. Inspect the results.
  65. 65. Page 66 of 91 __7. Now invoke your UDF in the WHERE clause of a query. (Scalar UDFs can be included anywhere in a SQL statement that a scalar value is expected.) This query is similar to your previous query expect that it includes a WHERE clause to restrict the result set to items with a file price of greater than 7000. -- scalar UDF can be used wherever a scalar value is expected, -- for example in WHERE clause SELECT sales_order_key, quantity, unit_sale_price, gosalesdw.new_final_price(quantity, unit_sale_price, 20, 8.75) as final_price FROM sls_sales_fact WHERE gosalesdw.new_final_price(quantity, unit_sale_price, 20, 8.75) > 7000 ORDER BY sales_order_key FETCH FIRST 10 ROWS ONLY@ __8. Note that your results no longer include rows with items priced at 7000 or below. 8.4. Optional: Invoking UDFs without providing fully-qualified name In the previous lab, you used the fully-qualified UDF name (GOSALESDW.NEW_FINAL_PRICE) in your VALUES or SELECT statements. (GOSALESDW is the schema name and NEW_FINAL_PRICE is the function name.) A UDF with the same name and input parameters can be specified in more than one schema, so providing Big SQL with the fully qualified function name identifies the function you want to execute. With

×