Hands-on-Lab: Adding Value to HBase with IBM InfoSphere BigInsights and BigSQL

2,801
-1

Published on

This is the hands-on-lab document I created accompanying my presentation at the Information On Demand 2013 conference for Session Number 1687 - Adding Value to HBase with IBM InfoSphere BigInsights and BigSQL.

*Contact me for data files*

This lab has 3 independant parts:
Part I - Creating Big SQL Tables and Loading Data
(exploring different ways to create and load HBase tables with Big SQL. includes an optional section on HBase access via JAQL)
Part II - Query Handling
(how to query HBase tables with Big SQL)
Part III - Connecting to Big SQL Server via JDBC
(using BIRT, a business intelligence and reporting tool, to run a simple report on a tpch orders table showcasing use of the BigSQL JDBC driver)

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,801
On Slideshare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
125
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Hands-on-Lab: Adding Value to HBase with IBM InfoSphere BigInsights and BigSQL

  1. 1. Hands on Lab Adding Value to HBase with IBM InfoSphere BigInsights and BigSQL Session Number 1687 Piotr Pruski, IBM, piotr.pruski@ca.ibm.com ( Benjamin Leonhardi, IBM @ppruski) 1
  2. 2. Table of Contents Lab Setup ............................................................................................................................ 3 Getting Started .................................................................................................................... 3 Administering the Big SQL and HBase Servers................................................................. 4 Part I – Creating Big SQL Tables and Loading Data ......................................................... 6 Background ..................................................................................................................... 6 One-to-one Mapping....................................................................................................... 9 Adding New JDBC Drivers ...................................................................................... 11 One-to-one Mapping with UNIQUE Clause................................................................. 13 Many-to-one Mapping (Composite Keys and Dense Columns)................................... 16 Why do we need many-to-one mapping? ................................................................. 17 Data Collation Problem............................................................................................. 19 Many-to-one Mapping with Binary Encoding.............................................................. 20 Many-to-one Mapping with HBase Pre-created Regions and External Tables ............ 22 Load Data: Error Handling ........................................................................................... 26 [OPTIONAL] HBase Access via JAQL ....................................................................... 27 PART II – A – Query Handling........................................................................................ 31 The Data........................................................................................................................ 31 Projection Pushdown .................................................................................................... 33 Predicate Pushdown ...................................................................................................... 34 Point Scan ................................................................................................................. 34 Partial Row Scan....................................................................................................... 35 Range Scan................................................................................................................ 35 Full Table Scan ......................................................................................................... 36 Automatic Index Usage................................................................................................. 37 Pushing Down Filters into HBase................................................................................. 38 Table Access Hints ....................................................................................................... 39 Accessmode .............................................................................................................. 39 PART II – B – Connecting to Big SQL Server via JDBC ................................................ 40 Business Intelligence and Reporting via BIRT............................................................. 41 Communities ..................................................................................................................... 48 Thank You! ....................................................................................................................... 48 Acknowledgements and Disclaimers................................................................................ 49 2
  3. 3. Lab Setup This lab exercise uses the IBM InfoSphere BigInsights Quick Start Edition, v2.1. The Quick Start Edition uses a non-warranted program license, and is not for production use. The purpose of the Quick Start Edition is for experimenting with the features of InfoSphere BigInsights, while being able to use real data and run real applications. The Quick Start Edition puts no data limit on the cluster and there is no time limit on the license. The following table outlines the users and passwords that are pre-configured on the image: username root biadmin db2inst1 password password biadmin password Getting Started To prepare for the contents of this lab, you must go through the following process to start all of the Hadoop components. 1. Start the VMware image by clicking the “Power on this virtual machine” button in VMware Workstation if the VM is not already on. 2. Log into the VMware virtual machine using the following information user: biadmin password: biadmin 3. Double-click on the BigInsights Shell folder icon from the desktop of the Quick Start VM. This view provides you with quick links to access the following functions that will be used throughout the course of this exercise: Big SQL Shell HBase Shell Jaql Shell Linux gnome-terminal 3
  4. 4. 4. Open the Terminal (gnome-terminal) and start the Hadoop components (daemons). Linux Terminal start-all.sh Note: This command may take a few minutes to finish. Once all components have started successfully as shown below you may move to the next section. … [INFO] Progress - 100% [INFO] DeployManager - Start; SUCCEEDED components: [zookeeper, hadoop, derby, hive, hbase, bigsql, oozie, orchestrator, console, httpfs]; Consumes : 174625ms Administering the Big SQL and HBase Servers BigInsights provides both command-line tools and a user interface to manage the Big SQL and HBase servers. In this section, we will briefly go over the user interface which is part of BigInsights Web Console. 1. Bring up the BigInsights web console by double clicking on the BigInsights WebConsole icon on the desktop of the VM and open the Cluster Status tab. Select HBase to view the status of HBase master and region servers. 2. Similarly, click on Big SQL from the same tab to view its status. 4
  5. 5. 3. Use hbase-master and hbase-regionserver web interfaces to visualize tables, regions and other metrics. Go to the BigInsights Welcome tab and select “Access Secure Cluster Servers.” You may need to enable pop-ups from the site when prompted. Alternatively, point browser to the following bottom two URL’s noted in the image below. Some interesting information from the web interfaces are: HBase root directory • This can be used to find the size of an HBase table. List of tables with descriptions. 5
  6. 6. Each table displays lists of regions with start and end keys. • This information can be used to compact or split tables as needed. Metrics for each region server. • These can be used to determine if there are hot regions which are serving the majority of requests to a table. Such regions can be split. It also helps determine the effects and effectiveness of block cache, bloom filters and memory settings. 4. Perform a health check of HBase and Big SQL which is different from the status checks done above. It verifies the health of the functionality. From the Linux gnome-terminal, issue the following commands. Linux Terminal $BIGINSIGHTS_HOME/bin/healthcheck.sh hbase [INFO] DeployCmdline - [ IBM InfoSphere BigInsights QuickStart Edition ] [INFO] Progress - Health check hbase [INFO] Deployer - Try to start hbase if hbase service is stopped... [INFO] Deployer - Double check whether hbase is started successfully... [INFO] @bivm - hbase-master(active) started, pid 6627 [INFO] @bivm - hbase-regionserver started, pid 6745 [INFO] Deployer - hbase service started [INFO] Deployer - hbase service is healthy [INFO] Progress - 100% [INFO] DeployManager - Health check; SUCCEEDED components: [hbase]; Consumes : 26335ms Linux Terminal $BIGINSIGHTS_HOME/bin/healthcheck.sh bigsql [INFO] [INFO] [INFO] [INFO] [INFO] [INFO] [INFO] 1121ms DeployCmdline - [ IBM InfoSphere BigInsights QuickStart Edition ] Progress - Health check bigsql @bivm - bigsql-server already running, pid 6949 Deployer - Ping Check Success: bivm/192.168.230.137:7052 @bivm - bigsql is healthy Progress - 100% DeployManager - Health check; SUCCEEDED components: [bigsql]; Consumes : Part I – Creating Big SQL Tables and Loading Data In this part of the lab, our main goal is to demonstrate a migration of a table from a relational database to Big Insights using Big SQL over HBase. We will understand how HBase handles row keys and some pitfalls that users may encounter when moving data from a relational database to HBase tables. We will also try some useful options like pre-creating regions to see how it can help with data loading and queries. We will also explore various ways to load data. Background 6
  7. 7. In this lab, we will use one table from the Great Outdoors Sales Data Warehouse model (GOSALESDW), SLS_SALES_FACT. The details of the tables along with its primary key information are depicted in the figure below. SLS_SALES_FACT PK PK PK PK PK PK PK ORDER_DAY_KEY ORGANIZATION_KEY EMPLOYEE_KEY RETAILER_KEY RETAILER_SITE_KEY PROMOTION_KEY ORDER_METHOD_KEY SALES_ORDER_KEY SHIP_DAY_KEY CLOSE_DAY_KEY QUANTITY UNIT_COST UNIT_PRICE UNIT_SALE_PRICE GROSS_MARGIN SALE_TOTAL GROSS_PROFIT There is an instance of DB2 contained on this image which contains this table with data already loaded that we will use in our migration. From the Linux gnome-terminal, switch to the DB2 instance user as shown below. Linux Terminal su - db2inst1 Note: The password for the db2inst1 is password. Enter this when prompted. As db2inst1, connect to the pre-created database, gosales. Linux Terminal db2 CONNECT TO gosales Upon successful connection, you should see the following output on the terminal. Database Connection Information Database server SQL authorization ID Local database alias = DB2/LINUXX8664 10.5.0 = DB2INST1 = GOSALES Issue the following command to list all of the tables contained in this database. 7
  8. 8. Linux Terminal db2 LIST TABLES Note: Here you will see three tables. Each one is essentially the same except with one key difference – the amount of data that is contained within them. The remaining instructions in this lab exercise will use the SLS_SALES_FACT_10P table simply for the fact that it has a smaller amount of data and will be faster to work with for demonstration purposes. If you would like to use the larger tables with more data feel free to do so but just remember to change the names appropriately. Table/View ------------------------------SLS_SALES_FACT SLS_SALES_FACT_10P SLS_SALES_FACT_25P Schema --------------DB2INST1 DB2INST1 DB2INST1 Type ----T T T Creation time -------------------------2013-08-22-14.51.27.228148 2013-08-22-14.54.01.622569 2013-08-22-14.55.46.416787 3 record(s) selected. Examine how many rows we have in this table to ensure later everything will be migrated properly. Issue the following select statement. Linux Terminal db2 "SELECT COUNT(*) FROM sls_sales_fact_10p" You should expect 44603 rows in this table. 1 ----------44603 1 record(s) selected. Use the following describe command to view all of the columns and data types that are contained within this table. Linux Terminal db2 "DESCRIBE TABLE sls_sales_fact_10p" 8
  9. 9. Column name ------------------------------ORDER_DAY_KEY ORGANIZATION_KEY EMPLOYEE_KEY RETAILER_KEY RETAILER_SITE_KEY PRODUCT_KEY PROMOTION_KEY ORDER_METHOD_KEY SALES_ORDER_KEY SHIP_DAY_KEY CLOSE_DAY_KEY QUANTITY UNIT_COST UNIT_PRICE UNIT_SALE_PRICE GROSS_MARGIN SALE_TOTAL GROSS_PROFIT Data type Column schema Data type name Length Scale Nulls --------- ------------------- ---------- ----- ----SYSIBM SYSIBM SYSIBM SYSIBM SYSIBM SYSIBM SYSIBM SYSIBM SYSIBM SYSIBM SYSIBM SYSIBM SYSIBM SYSIBM SYSIBM SYSIBM SYSIBM SYSIBM INTEGER INTEGER INTEGER INTEGER INTEGER INTEGER INTEGER INTEGER INTEGER INTEGER INTEGER INTEGER DECIMAL DECIMAL DECIMAL DOUBLE DECIMAL DECIMAL 4 4 4 4 4 4 4 4 4 4 4 4 19 19 19 8 19 19 0 0 0 0 0 0 0 0 0 0 0 0 2 2 2 0 2 2 Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes 18 record(s) selected. One-to-one Mapping In this section, we will use Big SQL to do a one-to-one mapping of the columns in the relational DB2 table to an HBase table row key and columns. This is not a recommended approach; however, the goal of this exercise is to demonstrate the inefficiency and pitfalls that can occur with such a mapping. Big SQL supports both, one-to-one and many-to-one mappings. In a one-to-one mapping, the HBase row key and each HBase column is mapped to a single SQL column. In the following example, the HBase row key is mapped to the SQL column id. Similarly, the cq_name column within the cf_data column family is mapped to the SQL column ‘name’ and so on. To begin, first create a schema to keep our tables organized. Open the BigSQL Shell from the BigInsights Shell folder on desktop and use the create schema command to create a schema named gosalesdw. BigSQL Shell CREATE SCHEMA gosalesdw; 9
  10. 10. Issue the following command in the same BigSQL shell that is open. This DDL statement will create the SQL table with the one-to-one mapping of what we have in our relational DB2 source. Notice all the column names are the same with the same data types. The column mapping section requires a mapping for the row key. HBase columns are identified using family:qualifier. BigSQL Shell CREATE HBASE TABLE GOSALESDW.SLS_SALES_FACT ( ORDER_DAY_KEY int, ORGANIZATION_KEY int, EMPLOYEE_KEY int, RETAILER_KEY int, RETAILER_SITE_KEY int, PRODUCT_KEY int, PROMOTION_KEY int, ORDER_METHOD_KEY int, SALES_ORDER_KEY int, SHIP_DAY_KEY int, CLOSE_DAY_KEY int, QUANTITY int, UNIT_COST decimal(19,2), UNIT_PRICE decimal(19,2), UNIT_SALE_PRICE decimal(19,2), GROSS_MARGIN double, SALE_TOTAL decimal(19,2), GROSS_PROFIT decimal(19,2) ) COLUMN MAPPING ( key mapped by (ORDER_DAY_KEY), cf_data:cq_ORGANIZATION_KEY mapped by (ORGANIZATION_KEY), cf_data:cq_EMPLOYEE_KEY mapped by (EMPLOYEE_KEY), cf_data:cq_RETAILER_KEY mapped by (RETAILER_KEY), cf_data:cq_RETAILER_SITE_KEY mapped by (RETAILER_SITE_KEY), cf_data:cq_PRODUCT_KEY mapped by (PRODUCT_KEY), cf_data:cq_PROMOTION_KEY mapped by (PROMOTION_KEY), cf_data:cq_ORDER_METHOD_KEY mapped by (ORDER_METHOD_KEY), cf_data:cq_SALES_ORDER_KEY mapped by (SALES_ORDER_KEY), cf_data:cq_SHIP_DAY_KEY mapped by (SHIP_DAY_KEY), cf_data:cq_CLOSE_DAY_KEY mapped by (CLOSE_DAY_KEY), cf_data:cq_QUANTITY mapped by (QUANTITY), cf_data:cq_UNIT_COST mapped by (UNIT_COST), cf_data:cq_UNIT_PRICE mapped by (UNIT_PRICE), cf_data:cq_UNIT_SALE_PRICE mapped by (UNIT_SALE_PRICE), cf_data:cq_GROSS_MARGIN mapped by (GROSS_MARGIN), cf_data:cq_SALE_TOTAL mapped by (SALE_TOTAL), cf_data:cq_GROSS_PROFIT mapped by (GROSS_PROFIT) ); Big SQL supports a load from source command that can be used to load data from warehouse sources which we’ll use first. It also supports loading data from delimited files using a load hbase command which we will use later. 10
  11. 11. Adding New JDBC Drivers The load from source command uses Sqoop internally to do the load. Therefore, before using the load command from a BigSQL shell, we need first add the driver for the JDBC source into 1) the Sqoop library directory, and 2) the JSQSH terminal shared directory. From a Linux gnome-terminal, issue the following command (as biadmin) to add the JDBC driver JAR file to access the database to the $SQOOP_HOME/lib directory. Linux Terminal cp /opt/ibm/db2/V10.5/java/db2jcc.jar $SQOOP_HOME/lib From the BigSQL shell, examine the drivers currently loaded for the JSQSH terminal. BigSQL Shell drivers Terminate the BigSQL shell with the quit command. BigSQL Shell quit Copy the same DB2 driver to the JSQSH share directory with the following command. Linux Terminal cp /opt/ibm/db2/V10.5/java/db2jcc.jar $BIGINSIGHTS_HOME/bigsql/jsqsh/share/ When a user adds new drivers, the Big SQL server must be restarted. You could do this either from the web console, or use the follow command from the Linux gnome-terminal. Linux Terminal stop.sh bigsql && start.sh bigsql Open the BigSQL Shell from the BigInsights Shell folder on desktop once again since it was closed in our earlier step with the quit command and check if in fact the driver was loaded into JSQSH. BigSQL Shell drivers Now that the drivers have been set, the load can finally take place. The load from source statement extracts data from a source outside of an InfoSphere BigInsights cluster (DB2 in this case) and loads that data into an InfoSphere BigInsights HBase (or Hive) table. Issue the following command to load the SLS_SALES_FACT_10P table from DB2 into the SLS_SALES_FACT table we have defined in BigSQL. BigSQL Shell 11
  12. 12. LOAD USING JDBC CONNECTION URL 'jdbc:db2://localhost:50000/GOSALES' WITH PARAMETERS (user = 'db2inst1',password = 'password') FROM TABLE SLS_SALES_FACT_10P SPLIT COLUMN ORDER_DAY_KEY INTO HBASE TABLE gosalesdw.sls_sales_fact APPEND; You should expect to load 44603 rows which is the same number of rows that the select count statement on the original DB2 table verified earlier. 44603 rows affected (total: 1m37.74s) Try to verify this with a select count statement as shown. BigSQL Shell SELECT COUNT(*) FROM gosalesdw.sls_sales_fact; Notice there is a discrepancy between the results from the load operation and the select count statement. +----+ | | +----+ | 33 | +----+ 1 row in results(first row: 3.13s; total: 3.13s) Also verify from an HBase shell. Open the HBase Shell from the BigInsights Shell folder on desktop and issue the following count command to verify the number of rows. HBase Shell count 'gosalesdw.sls_sales_fact' It should be apparent that the results from the Big SQL statement and HBase commands conform to one another. 33 row(s) in 0.7000 seconds However, this doesn’t yet explain why there is a mismatch between the number of loaded rows and the number of retrieved rows when we query the table. The load (and insert -- to be examined later) command behaves like upsert. Meaning, if a row with the same row key exists, HBase will write the new value as a new version for that column/cell. When querying the table, only latest value is returned by Big SQL. In many cases, this behaviour could be confusing. As with our case, we tried to load data with repeating values for a row key from a DB2 table with 44603 rows, the load reported 44603 rows affected. However, the select count(*) showed fewer rows; 33 to be exact. No errors are thrown in such scenarios therefore it is always recommended to cross-check the number of rows by querying the table as we did. Now that we understand that all the rows are actually versioned in HBase, we can examine a possible way to retrieve all versions of a particular row. 12
  13. 13. First, from the BigSQL shell, issue the following select query with a predicate on the order day key. In the original table, there are most likely many tuples with the same order day key. BigSQL Shell SELECT organization_key FROM gosalesdw.sls_sales_fact WHERE order_day_key = 20070720; As expected, we only retrieve one row, which is the latest or newest version of the row inserted into HBase with the specified order day key. +------------------+ | organization_key | +------------------+ | 11171 | +------------------+ 33 row(s) in 0.7000 seconds Using the HBase shell, we can retrieve previous versions for a row key. Use the following get command to get the top 4 versions of the row with row key 20070720. HBase Shell get 'gosalesdw.sls_sales_fact', '20070720', {COLUMN => 'cf_data:cq_ORGANIZATION_KEY', VERSIONS => 4} Since the previous command specified only 4 versions (VERSIONS => 4), we only retrieve 4 rows in the output. COLUMN cf_data:cq_ORGANIZATION_KEY value=11171 cf_data:cq_ORGANIZATION_KEY value=11171 cf_data:cq_ORGANIZATION_KEY value=11171 cf_data:cq_ORGANIZATION_KEY value=11171 4 row(s) in 0.0360 seconds CELL timestamp=1383365546430, timestamp=1383365546429, timestamp=1383365546428, timestamp=1383365546427, Optionally try the same command again specifying a larger version number. For example, VERSIONS => 100. Either way, most likely, this is not the intended behaviour that users may expect when performing such migration. They probably wanted to get all the data into the HBase table without versioned cells. There are a couple of solutions for this. One is to define the table with a composite row key to enforce uniqueness which will be explored later in this lab. Another option, outlined in the next section, is to force each row key to be unique by appending a UUID. One-to-one Mapping with UNIQUE Clause 13
  14. 14. Another option while performing such a migration is to use the force key unique option when creating the table using BigSQL syntax. This option will force the load to add a UUID to the row key. It helps to prevent versioning of cells. However, this method is quite inefficient as it stores more data and also makes queries slower. Issue the following command in the BigSQL shell. This statement will create the SQL table with the one-to-one mapping of what we have in our relational DB2 source. This DDL statement is almost identical to what was seen in the previous section with one exception: the force key unique clause is specified for the column mapping of the row key. BigSQL Shell CREATE HBASE TABLE GOSALESDW.SLS_SALES_FACT_UNIQUE ( ORDER_DAY_KEY int, ORGANIZATION_KEY int, EMPLOYEE_KEY int, RETAILER_KEY int, RETAILER_SITE_KEY int, PRODUCT_KEY int, PROMOTION_KEY int, ORDER_METHOD_KEY int, SALES_ORDER_KEY int, SHIP_DAY_KEY int, CLOSE_DAY_KEY int, QUANTITY int, UNIT_COST decimal(19,2), UNIT_PRICE decimal(19,2), UNIT_SALE_PRICE decimal(19,2), GROSS_MARGIN double, SALE_TOTAL decimal(19,2), GROSS_PROFIT decimal(19,2) ) COLUMN MAPPING ( key mapped by (ORDER_DAY_KEY) force key unique, cf_data:cq_ORGANIZATION_KEY mapped by (ORGANIZATION_KEY), cf_data:cq_EMPLOYEE_KEY mapped by (EMPLOYEE_KEY), cf_data:cq_RETAILER_KEY mapped by (RETAILER_KEY), cf_data:cq_RETAILER_SITE_KEY mapped by (RETAILER_SITE_KEY), cf_data:cq_PRODUCT_KEY mapped by (PRODUCT_KEY), cf_data:cq_PROMOTION_KEY mapped by (PROMOTION_KEY), cf_data:cq_ORDER_METHOD_KEY mapped by (ORDER_METHOD_KEY), cf_data:cq_SALES_ORDER_KEY mapped by (SALES_ORDER_KEY), cf_data:cq_SHIP_DAY_KEY mapped by (SHIP_DAY_KEY), cf_data:cq_CLOSE_DAY_KEY mapped by (CLOSE_DAY_KEY), cf_data:cq_QUANTITY mapped by (QUANTITY), cf_data:cq_UNIT_COST mapped by (UNIT_COST), cf_data:cq_UNIT_PRICE mapped by (UNIT_PRICE), cf_data:cq_UNIT_SALE_PRICE mapped by (UNIT_SALE_PRICE), cf_data:cq_GROSS_MARGIN mapped by (GROSS_MARGIN), cf_data:cq_SALE_TOTAL mapped by (SALE_TOTAL), cf_data:cq_GROSS_PROFIT mapped by (GROSS_PROFIT) ); 14
  15. 15. In the previous section, we used the load from source command to get the data from our table on DB2 source into HBase. This may not always be feasible which is why in this section we explore another loading statement, load hbase. This will load data into HBase using flat files – which perhaps is an export of the data form the relational source. Issue the following statement which will load data from a file into an InfoSphere BigInsights HBase table. BigSQL Shell LOAD HBASE DATA INPATH '/user/biadmin/gosalesdw/SLS_SALES_FACT.10p.txt' DELIMITED FIELDS TERMINATED BY 't' INTO TABLE gosalesdw.sls_sales_fact_unique;  Note: The load hbase command can take in an optional list of columns. If no column list is specified, it will use the column ordering in table definition. The input file can be on DFS or on the local file system where Big SQL server is running. Once again, you should expect to load 44603 rows which is the same number of rows that the select count statement on the original DB2 table verified. 44603 rows affected (total: 26.95s) Verify the number of rows loaded with a select count statement as shown. BigSQL Shell SELECT COUNT(*) FROM gosalesdw.sls_sales_fact_unique; This time there is no discrepancy between the results from the load operation and the select count statement. +-------+ | | +-------+ | 44603 | +-------+ 1 row in results(first row: 1.61s; total: 1.61s) Issue the same count from the HBase shell to be sure. HBase Shell count 'gosalesdw.sls_sales_fact_unique' The values are persistent across load, select, and count. ... 44603 row(s) in 6.8490 seconds As in the previous section, from the BigSQL shell, issue the following select query with a predicate on the order day key. BigSQL Shell 15
  16. 16. SELECT organization_key FROM gosalesdw.sls_sales_fact_unique WHERE order_day_key = 20070720; In the previous section, only one row was returned for the specified date. This time, expect to see 1405 rows since the rows are now forced to be unique due to our clause in the create statement and therefore no versioning should be applied. 1405 rows in results(first row: 0.47s; total: 0.58s) Once again, as in the previous section, we can check from the HBase shell if there are multiple versions of the cells. Issue the following get statement to attempt to retrieve the top 4 versions of the row with row key 20070720. HBase Shell get 'gosalesdw.sls_sales_fact_unique', '20070720', {COLUMN => 'cf_data:cq_ORGANIZATION_KEY', VERSIONS => 4} Zero rows are returned because the row key of 20070720 doesn’t exist. This is due to the fact we have appended the UUID to each row key; (20070720 + UUID). COLUMN 0 row(s) in 0.0850 seconds CELL Therefore, instead, issue the follow HBase command to do a scan versus a get. This will scan the table using the first part of the row key. We are also indicating scanner specifications of start and stop row values to only return the results we are interested in retrieving. HBase Shell scan 'gosalesdw.sls_sales_fact_unique', {STARTROW => '20070720', STOPROW => '20070721'} Notice there are no discrepancies between the results from Big SQL select and HBase scan. 1405 row(s) in 12.1350 seconds Many-to-one Mapping (Composite Keys and Dense Columns) This section is dedicated to the other option of trying to enforce uniqueness of the cells and that is to define a table with a composite row key (aka many-to-one mapping). In a many-to-one mapping, multiple SQL columns are mapped to a single HBase entity (row key or a column). There are two terms that may be used frequently: composite key and dense column. A composite key is an HBase row key that is mapped to multiple SQL columns. A dense column is an HBase column that is mapped to multiple SQL columns. In the following example, the row key contains two parts – userid and account number. Each part corresponds to a SQL column. Similarly, the HBase columns are mapped to multiple 16
  17. 17. SQL columns. Note that we can have a mix. For example, we can have a composite key, a dense column and a non-dense column or any mix of these. key 11111_ac11 userid acc_no Column Family: cf_data cq_acct cq_names fname1_lname1 first_na me HBase 11111#11#0.25 last_na me balanc min_ba intere SQL Issue the following DDL statement from the BigSQL shell which represents all entities from our relational table using a many-to-one mapping. Take notice of the column mapping section where multiple columns can be mapped to single family:qualifier’s. BigSQL Shell CREATE HBASE TABLE GOSALESDW.SLS_SALES_FACT_DENSE ( ORDER_DAY_KEY int, ORGANIZATION_KEY int, EMPLOYEE_KEY int, RETAILER_KEY int, RETAILER_SITE_KEY int, PRODUCT_KEY int, PROMOTION_KEY int, ORDER_METHOD_KEY int, SALES_ORDER_KEY int, SHIP_DAY_KEY int, CLOSE_DAY_KEY int, QUANTITY int, UNIT_COST decimal(19,2), UNIT_PRICE decimal(19,2), UNIT_SALE_PRICE decimal(19,2), GROSS_MARGIN double, SALE_TOTAL decimal(19,2), GROSS_PROFIT decimal(19,2) ) COLUMN MAPPING ( key mapped by (ORDER_DAY_KEY, ORGANIZATION_KEY, EMPLOYEE_KEY, RETAILER_KEY, RETAILER_SITE_KEY, PRODUCT_KEY, PROMOTION_KEY, ORDER_METHOD_KEY), cf_data:cq_OTHER_KEYS mapped by (SALES_ORDER_KEY, SHIP_DAY_KEY, CLOSE_DAY_KEY), cf_data:cq_QUANTITY mapped by (QUANTITY), cf_data:cq_DOLLAR_VALUES mapped by (UNIT_COST, UNIT_PRICE, UNIT_SALE_PRICE, GROSS_MARGIN, SALE_TOTAL, GROSS_PROFIT) ); Why do we need many-to-one mapping? HBase stores a lot of information for each value. For each value stored, a key consisting of the row key, column family name, column qualifier and timestamp are also stored. This means a lot of duplicate information is kept. HBase is very verbose and it is primarily intended for sparse data. In most cases, data in the relational world is not sparse. If we were to store each SQL column individually on HBase, as in our previous two sections, the required storage space would exponentially grow. When querying that data back, the query also returns the entire key (meaning, the row key, column family, and column qualifier) for each value. As an example, after loading data into this table we will examine the storage space for each of the three tables created thus far. 17
  18. 18. As in the previous section, issue the following statement which will load data from a file into the InfoSphere BigInsights HBase table. BigSQL Shell LOAD HBASE DATA INPATH '/user/biadmin/gosalesdw/SLS_SALES_FACT.10p.txt' DELIMITED FIELDS TERMINATED BY 't' INTO TABLE gosalesdw.sls_sales_fact_dense; Notice, the number of rows loaded into a table with many-to-one mapping remains the same even though we are storing less data! This statement also executes much faster than the previous load for this exact reason. 44603 rows affected (total: 3.42s) Issue the same statements and commands from both the BigSQL and HBase shell’s as in the previous two sections to verify that the number of rows is the same as in the original dataset. All of the results should be the same as in the previous section. BigSQL Shell SELECT COUNT(*) FROM gosalesdw.sls_sales_fact_dense; +-------+ | | +-------+ | 44603 | +-------+ 1 row in results(first row: 0.93s; total: 0.93s) BigSQL Shell SELECT organization_key FROM gosalesdw.sls_sales_fact_dense WHERE order_day_key = 20070720; 1405 rows in results(first row: 0.65s; total: 0.68s) HBase Shell scan 'gosalesdw.sls_sales_fact_dense', {STARTROW => '20070720', STOPROW => '20070721'} 1405 row(s) in 4.3830 seconds As noted earlier, one-to-one mapping leads to use of too much storage space for the same data mapped using composite keys or dense column where the HBase row key or HBase column(s) are made up of multiple relational table columns. This is because HBase would repeat row key, column family name, column name and timestamp for each column value. For relational data which is usually dense, this would cause an explosion in the required storage space. Issue the following command as biadmin from a Linux gnome-terminal to check the directory sizes for the three tables we created thus far. 18
  19. 19. Linux Terminal hadoop fs -du /hbase/ … 17731926 3188 47906322 … hdfs://bivm:9000/hbase/gosalesdw.sls_sales_fact hdfs://bivm:9000/hbase/gosalesdw.sls_sales_fact_dense hdfs://bivm:9000/hbase/gosalesdw.sls_sales_fact_unique Notice that the dense table is significantly smaller than the others. The table in which we forced uniqueness is the largest since it needs to append a UUID to each row key. Data Collation Problem All data represented thus far has been stored as strings. That is the default encoding on HBase tables created by BigSQL. Therefore, numeric data is not collated correctly. HBase uses lexicographic ordering, so you may run into cases where a query returns wrong results. The following scenario walks through a situation where data is not collated correctly. Using the Big SQL insert into hbase statement, add the following row to the sls_sales_fact_dense table we previously defined and loaded data into. Notice that the date we are specifying as part of the ORDER_DAY_KEY column (which has data type int) is a lager numerical value and does not conform to any date standard since it contains an extra digit. BigSQL Shell INSERT INTO gosalesdw.sls_sales_fact_dense (ORDER_DAY_KEY, ORGANIZATION_KEY, EMPLOYEE_KEY, RETAILER_KEY, RETAILER_SITE_KEY, PRODUCT_KEY, PROMOTION_KEY, ORDER_METHOD_KEY) VALUES (200707201, 11171, 4428, 7109, 5588, 30265, 5501, 605);  Note: Insert command is available for HBase tables. However, it is not a supported feature Issue a scan on the table with the following start and stop criteria. HBase Shell scan 'gosalesdw.sls_sales_fact_dense', {STARTROW => '20070720', STOPROW => '20070721'} Take notice of the last three rows/cells returned from the output of this scan. The newly added row shows up in the scan even though its integer value is not between 20070720 and 20070721. 19
  20. 20. 200707201x0011171x004428x007109x005588x003 column=cf_data:cq_DOLLAR_VALUES, timestamp=1376692067977, value= 0264x005501x00605 200707201x0011171x004428x007109x005588x003 column=cf_data:cq_OTHER_KEYS, timestamp=1376692067977, value= 0264x005501x00605 200707201x0011171x004428x007109x005588x003 column=cf_data:cq_QUANTITY, timestamp=1376692067977, value= 0264x005501x00605 1406 row(s) in 4.2400 seconds Now insert another row into the table with the following command. This time we are conforming to the date format of YYYYMMDD and incrementing the day by 1 from the last value returned in the table; i.e., 20070721. BigSQL Shell INSERT INTO gosalesdw.sls_sales_fact_dense (ORDER_DAY_KEY, ORGANIZATION_KEY, EMPLOYEE_KEY, RETAILER_KEY, RETAILER_SITE_KEY, PRODUCT_KEY, PROMOTION_KEY, ORDER_METHOD_KEY) VALUES (20070721, 11171, 4428, 7109, 5588, 30265, 5501, 605); Issue another scan on the table. Keep in mind to increase the stoprow criteria by 1 day. HBase Shell scan 'gosalesdw.sls_sales_fact_dense', {STARTROW => '20070720', STOPROW => '20070722'} Now notice that the newly added row is included as part of the result set, and the row with ORDER_DAY_KEY of 200707201 is after the row with ORDER_DAY_KEY of 20070721. This is an example of numeric data is not collated properly. Meaning, the rows are not being stored in numerical order as one might expect but rather in byte lexicographical order. 200707201x0011171x004428x007109x005588x003 timestamp=1376692067977, value= 0264x005501x00605 200707201x0011171x004428x007109x005588x003 timestamp=1376692067977, value= 0264x005501x00605 200707201x0011171x004428x007109x005588x003 timestamp=1376692067977, value= 0264x005501x00605 20070721x0011171x004428x007109x005588x0030 timestamp=1376692480966, value= 265x005501x00605 20070721x0011171x004428x007109x005588x0030 timestamp=1376692480966, value= 265x005501x00605 20070721x0011171x004428x007109x005588x0030 timestamp=1376692480966, value= 265x005501x00605 1407 row(s) in 2.8840 seconds column=cf_data:cq_DOLLAR_VALUES, column=cf_data:cq_OTHER_KEYS, column=cf_data:cq_QUANTITY, column=cf_data:cq_DOLLAR_VALUES, column=cf_data:cq_OTHER_KEYS, column=cf_data:cq_QUANTITY, Many-to-one Mapping with Binary Encoding 20
  21. 21. Big SQL supports two types of data encodings: string and binary. Each HBase entity can also have its own encoding. For example, a row key can be encoded as a string, one HBase column can be encoded as binary and another as string. String is the default encoding used in Big SQL HBase tables. The value is converted to string and stored as UTF-8 bytes. When multiple parts are packed into one HBase entity, separators are used to delimit data. The default separator is the null byte. As it is the lowest byte, it maintains data collation and allows range queries and partial row scans to work correctly. Binary encoding in Big SQL is sort-able, so numeric data including negative number collate properly. It handles separators internally and avoids issues of separators existing within data by escaping it. Issue the following DDL statement from the BigSQL shell to create a dense table as we did in the previous section, but this time overriding the default encoding to binary. BigSQL Shell CREATE HBASE TABLE GOSALESDW.SLS_SALES_FACT_DENSE_BINARY ( ORDER_DAY_KEY int, ORGANIZATION_KEY int, EMPLOYEE_KEY int, RETAILER_KEY int, RETAILER_SITE_KEY int, PRODUCT_KEY int, PROMOTION_KEY int, ORDER_METHOD_KEY int, SALES_ORDER_KEY int, SHIP_DAY_KEY int, CLOSE_DAY_KEY int, QUANTITY int, UNIT_COST decimal(19,2), UNIT_PRICE decimal(19,2), UNIT_SALE_PRICE decimal(19,2), GROSS_MARGIN double, SALE_TOTAL decimal(19,2), GROSS_PROFIT decimal(19,2) ) COLUMN MAPPING ( key mapped by (ORDER_DAY_KEY, ORGANIZATION_KEY, EMPLOYEE_KEY, RETAILER_KEY, RETAILER_SITE_KEY, PRODUCT_KEY, PROMOTION_KEY, ORDER_METHOD_KEY), cf_data:cq_OTHER_KEYS mapped by (SALES_ORDER_KEY, SHIP_DAY_KEY, CLOSE_DAY_KEY), cf_data:cq_QUANTITY mapped by (QUANTITY), cf_data:cq_DOLLAR_VALUES mapped by (UNIT_COST, UNIT_PRICE, UNIT_SALE_PRICE, GROSS_MARGIN, SALE_TOTAL, GROSS_PROFIT) ) default encoding binary; Once again, use the load hbase data command to load the data into the table. This time we are adding the DISABLE WAL clause. By using the option to disable WAL (write-ahead log), writes into HBase can be sped up. However, this is not a safe option. Turning off WAL can result in data loss if a region server crashes. Another possible option to speed up load is to increase the write buffer size. BigSQL Shell LOAD HBASE DATA INPATH '/user/biadmin/gosalesdw/SLS_SALES_FACT.10p.txt' DELIMITED FIELDS TERMINATED BY 't' INTO TABLE gosalesdw.sls_sales_fact_dense_binary DISABLE WAL; 21
  22. 22. 44603 rows affected (total: 5.54s) Issue a select statement on the newly created and loaded table with binary encoding, sls_sales_fact_dense_binary. BigSQL Shell SELECT * FROM gosalesdw.sls_sales_fact_dense_binary go –m discard;  Note: The “go –m discard” option is used so that the results of the command will not be displayed in the terminal. 44603 rows in results(first row: 0.35s; total: 2.89s) Issue another select statement on the previous table that has string encoding, sls_sales_fact_dense. BigSQL Shell SELECT COUNT(*) FROM gosalesdw.sls_sales_fact_dense go –m discard; 44605 rows in results(first row: 0.31s; total: 3.1s) One main point to see here is that the query can return faster. (Numeric types are also collated properly).  Note: You will probably not see much, if any, performance differences in this lab exercise since we are working with such a small dataset. There is no custom serialization/deserialization logic required for string encoding. This makes it portable in the case one would want to use another application to read data in HBase tables. A main use case for string encoding is when someone wants to map existing data. Delimited data is a very common form of storing data and it can be easily mapped using Big SQL string encoding. However, parsing strings is expensive and queries with data encoded as strings are slow. Also, numeric data is not collated correctly as seen. Queries on data encoded as binary have faster response times. Numeric data, including negative numbers, are also collated correctly with binary encoding. The downside is you get data encoded by Big SQL logic and may not be portable as-is. Many-to-one Mapping with HBase Pre-created Regions and External Tables HBase automatically handles splitting regions when they reach a set limit. In some scenarios like bulk loading, it is more efficient to pre-create regions so that the load operation can take place in parallel. The data for sales is 4 months, April through July for the year 2007. We can pre-create regions by specifying splits in create table command. 22
  23. 23. In this section, we will create a table within the HBase shell with pre-defined splits, not using any Big SQL features at first. Than we will showcase how users can map existing data in HBase to Big SQL which can prove to be a very common practice. This is made possible by creating what is called external tables. Start by issuing the following statement in the HBase shell. This will create the sls_sales_fact_dense_split table with pre-defined region splits for April through July in 2007. HBase Shell create 'gosalesdw.sls_sales_fact_dense_split', {NAME => 'cf_data', REPLICATION_SCOPE => '0', KEEP_DELETED_CELLS => 'false', COMPRESSION => 'NONE', ENCODE_ON_DISK => 'true', BLOCKCACHE => 'true', MIN_VERSIONS => '0', DATA_BLOCK_ENCODING => 'NONE', IN_MEMORY => 'false', BLOOMFILTER => 'NONE', TTL => '2147483647', VERSIONS => '2147483647', BLOCKSIZE => '65536'}, {SPLITS => ['200704', '200705', '200706', '200707']} Issue the following list command on the HBase shell to verify the newly created table. HBase Shell list Note that if we were to list the tables from the Big SQL shell, we would not see this table because we have not made any association yet to Big SQL. Open and point a browser to the following URL: http://bivm:60010/. Scroll down and click on the table we had just defined in the HBase shell, gosalesdw.sls_sales_fact_dense_split. 23
  24. 24. Examine the pre-created regions for this table as we had defined when creating the table. Execute the following create external hbase command to map the existing table we have just created in HBase to Big SQL. Some thing to note about the command: The create table statement allows specifying a different name for SQL table through hbase table name clause. Using external tables, you can also create multiple views of same HBase table. For example, one table can map to few columns and another table to another set of columns etc. Notice the column mapping section of the create table statement allows specifying a different separator for each column and row key. Another place where external tables can be used is to map tables created using Hive HBase storage handler. These cannot be directly read using Big SQL storage handler. BigSQL Shell 24
  25. 25. CREATE EXTERNAL HBASE TABLE GOSALESDW.EXTERNAL_SLS_SALES_FACT_DENSE_SPLIT ( ORDER_DAY_KEY int, ORGANIZATION_KEY int, EMPLOYEE_KEY int, RETAILER_KEY int, RETAILER_SITE_KEY int, PRODUCT_KEY int, PROMOTION_KEY int, ORDER_METHOD_KEY int, SALES_ORDER_KEY int, SHIP_DAY_KEY int, CLOSE_DAY_KEY int, QUANTITY int, UNIT_COST decimal(19,2), UNIT_PRICE decimal(19,2), UNIT_SALE_PRICE decimal(19,2), GROSS_MARGIN double, SALE_TOTAL decimal(19,2), GROSS_PROFIT decimal(19,2) ) COLUMN MAPPING ( key mapped by (ORDER_DAY_KEY, ORGANIZATION_KEY, EMPLOYEE_KEY, RETAILER_KEY, RETAILER_SITE_KEY, PRODUCT_KEY, PROMOTION_KEY, ORDER_METHOD_KEY) SEPARATOR '-', cf_data:cq_OTHER_KEYS mapped by (SALES_ORDER_KEY, SHIP_DAY_KEY, CLOSE_DAY_KEY) SEPARATOR '/', cf_data:cq_QUANTITY mapped by (QUANTITY), cf_data:cq_DOLLAR_VALUES mapped by (UNIT_COST, UNIT_PRICE, UNIT_SALE_PRICE, GROSS_MARGIN, SALE_TOTAL, GROSS_PROFIT) SEPARATOR '|' ) HBASE TABLE NAME 'gosalesdw.sls_sales_fact_dense_split'; The data in external tables is not validated at creation time. For example, if a column in the external table contains data with separators incorrectly defined, the query results would be unpredictable.  Note: External tables are not owned by Big SQL and hence cannot be dropped via Big SQL. Also, secondary indexes cannot be created via Big SQL on external tables. Use the following command to load the external table we have defined. BigSQL Shell LOAD HBASE DATA INPATH '/user/biadmin/gosalesdw/SLS_SALES_FACT.10p.txt' DELIMITED FIELDS TERMINATED BY 't' INTO TABLE gosalesdw.external_sls_sales_fact_dense_split; 44603 rows affected (total: 1m57.2s) Verify that the same number of rows loaded is also the same number of rows returned by querying the external SQL table. BigSQL Shell SELECT COUNT(*) FROM gosalesdw.external_sls_sales_fact_dense_split; 25
  26. 26. +--------+ | | +--------+ | 446023 | +--------+ 1 row in results(first row: 6.44s; total: 6.46s) Verify the same from the HBase shell directly on the underlying HBase table. HBase Shell count 'gosalesdw.sls_sales_fact_dense_split' ... 44603 row(s) in 9.1620 seconds Issue a get command from the HBase shell specifying the row key as follows. Notice the separator between each part of the row key is a “-” which is what we defined when originally creating the external table. HBase Shell get 'gosalesdw.sls_sales_fact_dense_split', '20070720-11171-4428-71095588-30263-5501-605' In the following output you can also see the other seperators we defined for the external table. “|” for the cq_DOLLAR_VALUE, and “/” for cq_QUANTITY. COLUMN cf_data:cq_DOLLAR_VALUES value=33.59|62.65|62.65|0.4638|1566.25|726.50 cf_data:cq_OTHER_KEYS value=481896/20070723/20070723 cf_data:cq_QUANTITY 3 row(s) in 0. 0610 seconds CELL timestamp=1376690502630, timestamp=1376690502630, timestamp=1376690502630, value=25 Of course in Big SQL we don't need to specify the separators such as “-” when querying against the table as with the command below. BigSQL Shell SELECT * FROM gosalesdw.external_sls_sales_fact_dense_split WHERE ORDER_DAY_KEY = 20070720 AND ORGANIZATION_KEY = 11171 AND EMPLOYEE_KEY = 4428 AND RETAILER_KEY = 7109 AND RETAILER_SITE_KEY = 5588 AND PRODUCT_KEY = 30263 AND PROMOTION_KEY = 5501 AND ORDER_METHOD_KEY = 605; Load Data: Error Handling In this final section of the part of the lab, we will examine how to handle errors during the load operation. The load hbase command has an option to continue past errors. The LOG ERROR ROWS IN FILE clause can be used to specify a file name to log any rows that could not be loaded 26
  27. 27. because of errors. Some of the common errors are invalid numeric types, and a separator existing within the data for string encoding. Linux Terminal hadoop fs -cat /user/biadmin/gosalesdw/SLS_SALES_FACT_badload.txt 2007072a … … … b0070720 … … … 2007-07-20 … … … 20070720 … … … 20070721 … … … 11171 … … … … … … … … … … … … … 11171 … … … … … … … … … … … 11171 … … … … … … … … … … … 11-71 … … … … … … … … … … … 11171 … … … … … … … … … … … … … … … … … … … Note that separator appearing within the data is an issue with string encoding. Knowing there are errors with the input data, proceed to issue the following load command, specifying a directory and file where to put the “bad” rows. BigSQL Shell LOAD HBASE DATA INPATH '/user/biadmin/gosalesdw/SLS_SALES_FACT_badload.txt' DELIMITED FIELDS TERMINATED BY 't' INTO TABLE gosalesdw.external_sls_sales_fact_dense_split LOG ERROR ROWS IN FILE '/tmp/SLS_SALES_FACT_load.err'; In this example, 4 rows did not get loaded because of errors. Note that load reports all the rows that passed through it 1 row affected (total: 2.74s) Examine the specified file in the load command to view the rows which we not loaded. Linux Terminal hadoop fs -cat /tmp/SLS_SALES_FACT_load.err "2007072a","11171","…","…","…","…","…","…","…","…","…","…","…","…","…","…","…","…" "b0070720","11171","…","…","…","…","…","…","…","…","…","…","…","…","…","…","…","…" "2007-07-20","11171","…","…","…","…","…","…","…","…","…","…","…","…","…","…","…","…" "20070720","11-71","…","…","…","…","…","…","…","…","…","…","…","…","…","…","…","…" [OPTIONAL] HBase Access via JAQL Jaql has an HBase module that can be used to create and insert data into HBase tables and query them efficiently using multiple modes - local mode that directly access HBase as well as map reduce mode. It allows specifying query optimization options similar to what is available in hbase shell. The capability to transparently use map reduce jobs makes it work well with bigger tables. At the same time, users can force local mode when they run point or 27
  28. 28. range queries. It allows use of a SQL language subset termed as Jaql SQL which provides the capability to join, perform grouping and other aggregations on tables. It also provides access to data from different sources such as relational DBMS and different formats like delimited files, Avro and anything that is supported by Jaql. The results of the query can be written in different formats to HDFS and read by other BigInsights applications like BigSheets for further analysis. In this section, we’ll first pull information from our relational DMBS and than go over use of Jaql HBase module, specifically the additional features that it provides. Start by opening a Jaql shell. You can open the same (JSQSH) terminal that was used for Big SQL by adding the “--jaql" option as shown below. This is a much better environment to work with than the standard Jaql Shell as it provides features like previous command using the up arrow key and you can also traverse through your commands using the left/right arrow keys. Linux Terminal /opt/ibm/biginsights/bigsql/jsqsh/bin/jsqsh --jaql; Once in the JSQSH shell with Jaql option, load the dbms::jdbc driver with the following command. BigSQL/JAQL Shell import dbms::jdbc; Add the JDBC driver JAR file to the classpath. BigSQL/JAQL Shell addRelativeClassPath(getSystemSearchPath(), '/opt/ibm/db2/V10.5/java/db2jcc.jar'); Supply the connection information. BigSQL/JAQL Shell db := jdbc::connect( driver = 'com.ibm.db2.jcc.DB2Driver', url = 'jdbc:db2://localhost:50000/gosales', properties = {user: "db2inst1", password: "password"} ); Specify the rows to be retrieved with a SQL select statement. BigSQL/JAQL Shell DESC := jdbc::prepare( db, query = "SELECT * FROM db2inst1.sls_sales_fact_10p"); In many-to-one mapping for row key, we went over creation of a composite key. In the next few steps, we will use Jaql to load the same data using a composite key and dense columns. We’ll pack all columns that make up primary key of the relational table into a HBase row key, and we’ll also pack other columns into dense HBase columns. Define a variable to read the original data from the relational JDBC source. This converts each tuple of the table into a JSON record. 28
  29. 29. BigSQL/JAQL Shell ssf = localRead(DESC); Transform the record into the required format. Essentially we are doing the same procedure as when we defined the many-to-one mapping in the previous sections. For the first element, which we will use for HBase row key, concatenate the values of the columns that form the primary key of the sales fact table using a “-” separator. For the remaining columns, pack them into other dense HBase columns: cq_OTHER_KEYS (using “/” separator), cq_QUANTITY, and cq_DOLLAR_VALUES (using “|” separator). BigSQL/JAQL Shell ssft = ssf -> transform [$."ORDER_DAY_KEY", $."ORGANIZATION_KEY", $."EMPLOYEE_KEY", $."RETAILER_KEY", $."RETAILER_SITE_KEY", $."PRODUCT_KEY", $."PROMOTION_KEY", $."ORDER_METHOD_KEY", $."SALES_ORDER_KEY", $."SHIP_DAY_KEY", $."CLOSE_DAY_KEY", $."QUANTITY", $."UNIT_COST", $."UNIT_PRICE", $."UNIT_SALE_PRICE", $."GROSS_MARGIN", $."SALE_TOTAL", $."GROSS_PROFIT"] -> transform { key: strcat($[0],"-",$[1],"-",$[2],"-",$[3],"-",$[4],"-",$[5],"",$[6],"-",$[7]), cf_data: { cq_OTHER_KEYS: strcat($[8],"/",$[9],"/",$[10]), cq_QUANTITY: strcat($[11]), cq_DOLLAR_VALUES: strcat($[12],"|",$[13],"|",$[14],"|",$[15],"|",$[16],"|",$[17]) } }; Verify the data is in the correct format by querying the first record. BigSQL/JAQL Shell ssft -> top 1; { "key": "20070418-11114-4415-7314-5794-30124-5501-605", "cf_data": { "cq_OTHER_KEYS": "254121/20070423/20070423", "cq_QUANTITY": "60", "cq_DOLLAR_VALUES": "610.00m|1359.72m|1291.73m|0.5278|77503.80m|40903.80m" } } (1 row in 2.40s) Now we have the data ready to be written into HBase. First import the hbase module which prepares jaql by loading required jars and preparing the environment using the HBase configuration files. BigSQL/JAQL Shell import hbase(*); Use hbaseString to define a schema for the HBase table. The HBase table does not get created until something is written into it. An array of records that match the specified schema 29
  30. 30. should be used to write into the HBase table. The data types correspond to how Jaql will interpret the data. BigSQL/JAQL Shell SSFHT = hbaseString('sales_fact2', schema { key: string, cf_data?: {*: string}}, create=true, replace=true, rowBatchSize=10000, colBatchSize=200 ); Note: As this (could be) a big table, specify rowBatchSize and colBatchSize which will be used for scanner catching and column batch size by the internal HBase scan object. The column batch size is useful when there are a huge number of columns in rows. Write to the table using the previously created ssft array which matches the specified schema. BigSQL/JAQL Shell ssft -> write(SSFHT); A write operation will create the HBase table, and populate it with the input data. To confirm, use hbase shell to count (or scan) the table and verify the data was written with the right number of rows. HBase Shell count 'sales_fact2' 44603 row(s) in 3.6230 seconds To read the contents of the HBase table using Jaql, use read on the hbaseString. In the following command we are also passing the read directly into a count function to verify the right number of rows. BigSQL/JAQL Shell count(read(SSFHT)); 44603 To query for rows matching a particular order day key 20070720, use setKeyRange for the partial range query. Use localRead for point and range queries as Jaql is tuned for local execution and performs efficiently. BigSQL/JAQL Shell localRead(SSFHT -> setKeyRange('20070720', '20070721')); Perform the same query using HBase shell. Both complete in similar amount of time. HBase Shell scan 'sales_fact2', {STARTROW => '20070720', STOPROW => '20070721', CACHE => 10000} 30
  31. 31. To query for a row when we have the values for all primary key columns, we can construct the entire row key and perform a point query. BigSQL/JAQL Shell localRead(SSFHT -> setKey('20070720-11171-4428-7109-5588-30263-5501605')); Identically, this is what the statement would look like from the HBase shell. HBase Shell get 'sales_fact2', '20070720-11171-4428-7109-5588-30263-5501-605' To use a filter from Jaql, use setFilter function along with addFilter. In the below case, the predicate is on quantity column which is the leading part of the dense column and hence can be used in the predicate. BigSQL/JAQL Shell read(SSFHT -> setFilter([addFilter(filterType.SingleColumnValueFilter, HBaseKeyArrayToBinary(["481896/"]), compareOp.equal, comparators.BinaryPrefixComparator, "cf_data", "cq_OTHER_KEYS", true ) ]) ); PART II – A – Query Handling Efficiently querying HBase requires pushing as much to the server(s) as possible. This includes projection pushdown or fetching the minimal set of columns that are required by the query. It also includes pushing down query predicates into the server as scan limits, filters, index lookups, etc. Setting scan limits is extremely powerful as it can help to narrow down regions we need to scan. With a full row key, HBase can quickly pinpoint the region and the row. With partial keys and key ranges (upper, lower limits or both), HBase can narrow down regions or eliminate regions which fall outside the range. Indexes help to leverage this key lookup but they use two tables to achieve this. Filters cannot eliminate regions but some have capability to skip within a region. They help to narrow down the data set returned to the client. With limited metadata/statistics about HBase tables, supporting a variety of hints helps improve query efficiency. The Data This section describes the schema which the sample data will use to demonstrate the effects of pushdown from Big SQL. 31
  32. 32. We will use a tpch table: orders table with 150,000 rows defined using the mapping shown below. Issue the following command from a Big SQL shell to create the orders table. Notice this table has a many-to-one mapping, meaning there is a composite key and dense columns. BigSQL Shell CREATE HBASE TABLE ORDERS ( O_ORDERKEY BIGINT, O_CUSTKEY INTEGER, O_ORDERSTATUS VARCHAR(1), O_TOTALPRICE FLOAT, O_ORDERDATE TIMESTAMP, O_ORDERPRIORITY VARCHAR(15), O_CLERK VARCHAR(15), O_SHIPPRIORITY INTEGER, O_COMMENT VARCHAR(79) ) column mapping ( key mapped by (O_CUSTKEY,O_ORDERKEY), cf:d mapped by (O_ORDERSTATUS,O_TOTALPRICE,O_ORDERPRIORITY,O_CLERK,O_SHIPPRIORITY,O_CO MMENT), cf:od mapped by (O_ORDERDATE) ) default encoding binary; Load the sample data into the newly created table by issuing the following command. Note: As in Part I, there are three sample sets provided for you. Each one is essentially the same except with one key difference. This is in the amount of data that is contained within them. The remaining instructions in this lab exercise will use the orders.10p.tbl dataset simply for the fact that it has a smaller amount of data and will be faster to work with for demonstration purposes. If you would like to use the larger tables with more data feel free to do so but just remember to change the names appropriately. BigSQL Shell LOAD HBASE DATA INPATH 'tpch/orders.10p.tbl' DELIMITED FIELDS TERMINATED BY '|' INTO TABLE ORDERS; 150000 rows affected (total: 21.52s) In next set of sections, we examine the output from Big SQL log files to point out what you can check for to confirm pushdown from Big SQL. To view log messages, you may have to first change logging levels using the below commands. BigSQL Shell log com.ibm.jaql.modules.hcat.mapred.JaqlHBaseInputFormat info; BigSQL Shell log com.ibm.jaql.modules.hcat.hbase info; Note that columns are pushed down at HBase level. So in many-to-one mappings, if the query requires only one part of a dense column with many parts, the entire value for dense 32
  33. 33. column will be returned. Therefore it is efficient to pack together columns that are usually queried together. Use the following command to tail the Big SQL log file. Keep this open in a terminal throughout this entire part of this lab. We will be referring to it quite often to see what is going on behind the scenes when running certain commands. Linux Terminal tail -f /var/ibm/biginsights/bigsql/logs/bigsql.log Projection Pushdown The first query here does a SELECT * and requests all HBase columns used in the table mapping. The original HBase table could have a lot more columns; we may have defined an external table mapping to just a few columns. In such cases, only the HBase columns used in mapping will be retrieved. BigSQL Shell SELECT * FROM orders go -m discard; 150000 rows in results(first row: 1.73s; total: 10.69s) In the Big SQL log file, we can see that we returned data from both columns. BigSQL Log … …HBase scan details:{…, families={cf=[d, od]}, …, stopRow=, startRow=, totalColumns=2, …} This second query request only one HBase column: BigSQL Shell SELECT o_totalprice FROM orders go -m discard; Notice that the query returns much faster since we are returning much less data. 150000 rows in results(first row: 0.27s; total: 2.83s) Verify from the log file that this query only executed against one column. BigSQL Log … …HBase scan details:{…, families={cf=[d]}, …, stopRow=, startRow=, totalColumns=1, …} The third query request only one HBase column also. 33
  34. 34. BigSQL Shell SELECT o_orderdate FROM orders go -m discard; Although this query actually returns lesser data, it actually has higher response time because serialization/deserialization of type timestamp is expensive. 150000 rows in results(first row: 0.37s; total: 4.5s) BigSQL Log … …HBase scan details:{…, families={cf=[od]}, …, stopRow=, startRow=, totalColumns=1, …} Predicate Pushdown Point Scan Identifying and using point scans is the most effective optimization for queries into HBase. For converting to point scan, we need to get the predicate value covering the full row key. This could come in as multiple predicates as Big SQL supports composite keys. The query analyzer in Big SQL is capable of combining multiple predicates to identify a full row scan. Currently, this analysis happens at run time in the storage handler. At that point, the decision of whether or not to use map reduce has already been made. To bypass map reduce, a user has to provide explicit local mode access hints currently. In the example below, the command “set force local on” makes sure all queries executing in the session do not use map reduce. BigSQL Shell set force local on; Issue the following select statement that will provide predicates for the columns that comprise of the full row key. They are custkey and orderkey. BigSQL Shell select o_orderkey, o_totalprice from orders where o_custkey=4 and o_orderkey=5612065; +------------+--------------+ | o_orderkey | o_totalprice | +------------+--------------+ | 5612065 | 71845.25781 | +------------+--------------+ 1 row in results(first row: 0.18s; total: 0.18s) If we check the logs, you can see that Big SQL successfully took both predicates specified and combined them to do a row scan using all parts of the composite key. 34
  35. 35. BigSQL Log … … Found a row scan by combining all composite key parts. … Found a row scan from row key parts … HBase filter list created using AND. … HBase scan details:{…, families={cf=[d]}, filter=FilterList AND (1/1): [PrefixFilter x01x80x00x00x04], …, stopRow=x01x80x00x00x04x01x80x00x00x00x00UxA2!, startRow=x01x80x00x00x04x01x80x00x00x00x00UxA2!, totalColumns=1, …} Partial Row Scan This section shows the capability of Big SQL server to process predicates on leading parts of row key – and not necessarily the full row key as in the previous section. Issue the following example query that provides a predicate for the first part of the row key, custkey. BigSQL Shell select o_orderkey, o_totalprice from orders where o_custkey=4; +------------+--------------+ | o_orderkey | o_totalprice | +------------+--------------+ | 5453440 | 17938.41016 | | 5612065 | 71845.25781 | +------------+--------------+ 2 rows in results(first row: 0.19s; total: 0.19s) Checking the logs, you can see the predicate on first part of row key is converted to a range scan. The stop row in the scan is non-inclusive. So it is internally appended with lowest possible byte to cover the partial range. BigSQL Log … … Found a row scan that uses the first 1 part(s) of composite key. … Found a row scan from row key parts … HBase filter list created using AND. … HBase scan details:{…, families={cf=[d]}, filter=FilterList AND (1/1): [PrefixFilter x01x80x00x00x04], …, stopRow=x01x80x00x00x04xFF, startRow=x01x80x00x00x01, totalColumns=1, …} Range Scan When there are range predicates, we can set the start or stop row or both. In our example query below we have a ‘less than’ predicate; therefore we only know the stop row. However, even setting this will help eliminate regions with row keys that fall above the stop row. Issue the following command. BigSQL Shell select o_orderkey, o_totalprice from orders where o_custkey < 15; 35
  36. 36. +------------+--------------+ | o_orderkey | o_totalprice | +------------+--------------+ | 5453440 | 17938.41016 | | 5612065 | 71845.25781 | | 5805349 | 255145.51562 | | 5987111 | 97765.57812 | | 5692738 | 143292.53125 | | 5885190 | 125285.42969 | | 5693440 | 117319.15625 | | 5880160 | 198773.68750 | | 5414466 | 149205.60938 | | 5534435 | 136184.51562 | | 5566567 | 56285.71094 | +------------+--------------+ 11 rows in results(first row: 0.22s; total: 0.22s) Notice in the log file that similarly to the previous section, we are also only using the first part of the composite key since we are specifying custkey as the predicate. However, in this case since we only know the stop row (less than 3), there is no value for the start row portion of the scan. BigSQL Log … … Found a row scan that uses the first 1 part(s) of composite key. … Found a row scan from row key parts … … HBase scan details:{…, families={cf=[d]}, …, stopRow=x01x80x00x00x0F, startRow=, totalColumns=1, …} Full Table Scan This section simply shows an example of what happens when none of the predicates can be pushed down to HBase In this example query, the predicate (orderkey) is on non-leading part of row key and therefore is not pushed down. Issue the command to see this will result in a full table scan. BigSQL Shell select o_orderkey, o_totalprice from orders where o_orderkey=5612065; +------------+--------------+ | o_orderkey | o_totalprice | +------------+--------------+ | 5612065 | 71845.25781 | +------------+--------------+ 1 row in results(first row: 1.90s; total: 1.90s) As can be determined by examining the logs, in cases where none of the predicates can be pushed to HBase, a full table scan is required. Meaning there are no specified values for either start or stop row. BigSQL Log 36
  37. 37. … … HBase scan details:{…, families={cf=[d]}, …, stopRow=, startRow=, …} Automatic Index Usage This section will demonstrate the benefits of an index lookup. Before creating an index, let’s first execute a query that will invoke a full table scan so we can do a comparison later to see the performance benefits we can achieve by creating an index on particular column(s). Notice we are specifying a predicate on the clerk column which is the middle part of a dense column defined. BigSQL Shell SELECT * FROM orders WHERE o_clerk='Clerk#000000999' go -m discard; 154 rows in results(first row: 2.40s; total: 4.32s) As you can see below in the log file, there is no usage of an index. BigSQL Log … … indexScanInfo: [isIndexScan: false], valuesInfo: [minValue: undefined, minInclusive: false, maxValue: undefined, maxInclusive: false], filterInfo: [numFilters: 0], rowScanCandidateInfo: [hasRowScanCandidate: false], indexScanCandidateInfo: [hasIndexScanCandidate: false]] … Issue the following command to create the index on the clerk column which is the middle part of a dense column in table. This creates a new table to store index data. The index table stores the column value and row key it appears in. BigSQL Shell CREATE INDEX ix_clerk ON TABLE orders (o_clerk) AS 'hbase';  Note: The create index statement will create the new index table which uses <base_table_name>_<index_name> as its name, it deploys the coprocessor, populates the index table using map reduce index builder. The “as hbase” clause indicates the type of index handler to use. For HBase, we have a separate index handler. 0 rows affected (total: 1m17.47s) Re-issue the exact same command as we did earlier. BigSQL Shell SELECT * FROM orders WHERE o_clerk='Clerk#000000999' go -m discard; 37
  38. 38. After creating the index and issuing the same select statement, Big SQL will automatically take advantage of the index that was created and avoids a full table scan which results in a much faster response time. 154 rows in results(first row: 0.73s; total: 0.74s) You can verify in the log file that Big SQL. In this case the index table is scanned for all matching rows that start with value of predicate clerk, in this case Clerk#000000999. From the matching row(s), the row key(s) of base table are extracted and get requests are batched and sent to the data table. BigSQL Log … … indexScanInfo: [isIndexScan: true, keyLookupType: point_query, indexDetails: JaqlHBaseIndex[indexName: ix_clerk, indexSpec: {"bin_terminator": "#","columns": [{"cf": "cf","col": "o_clerk","cq": "d","from_dense": "true"}],"comp_seperator": "%","composite": "false","key_seperator": "/","name": "ix_clerk"}, numColumns: 1, columns: [Ljava.lang.String;@3ced3ced, startValue: x01Clerk#000000999x00, stopValue: x01Clerk#000000999x00]], valuesInfo: [minValue: [B@4b834b83, minInclusive: false, maxValue: undefined, maxInclusive: false], filterInfo: [numFilters: 0], rowScanCandidateInfo: [hasRowScanCandidate: false], indexScanCandidateInfo: [hasIndexScanCandidate: true, indexScanCandidate: IndexScanCandidate[columnName: o_clerk,indexColValue: [B@4cda4cda,[operator: =,isVariableLength: false,type: null,encoding: BINARY]]] … Found an index scan from index scan candidates. Details: … Index name: ix_clerk … … Index query details: [indexSpec:ix_clerk, startValueBytes: #Clerk#000000999, stopValueBytes: #Clerk#000000999,baseTableScanStart:,baseTableScanStop:] … Index query successful.  Note: For a composite index where multiple columns are used to define an index, predicates are handled and pushed down similar to what is done for composite row keys. If there was no index, the predicate could not be pushed down as it is the non-leading part of a dense column. In such cases, a full table scan is required as seen at the beginning of this section. Pushing Down Filters into HBase Though HBase filters do not avoid full table scan, they limit the rows and data returned to the client. HBase filters have a skip facility which lets them skip over certain portions of data. Many of the inbuilt filters implement this and thus prove more efficient than a raw table scan. There are filters that can limit the data within a row. For example, when we need to only get columns in the key part of filter, some filters like FirstKeyOnlyFilter and KeyOnlyFilter can be applied to get only a single instance of the row key part of data. The sample query below will demonstrate a case where Big SQL pushes down a row scan and a column filter. BigSQL Shell 38
  39. 39. SELECT o_orderkey FROM orders WHERE o_custkey>100000 AND o_orderstatus='P' go -m discard; 1278 rows in results(first row: 0.37s; total: 0.38s) Notice, the predicate on the custkey column triggers the row scan. The column filter, SingleColumnValueFilter, is triggered because there is a predicate on the leading part of a dense column (cf:d). BigSQL Log … … Found a row scan that uses the first 1 part(s) of composite key. … Found a row scan from row key parts … HBase filter list created using AND. … … HBase scan details:{…, families={cf=[d]}, filter=FilterList AND (1/1): [SingleColumnValueFilter (cf, d, EQUAL, x01Px00)], …, stopRow=, startRow=x01x80x01x86xA1, totalColumns=1, …} This way Big SQL can automatically convert predicates into many of these filters and thus handle queries more efficiently. Table Access Hints Access hints affect the strategy that is used to read the table, identify the source of the data, and how to optimize a query. For example, the strategy can affect such behaviour as whether MapReduce is employed to implement the join or whether a memory (hash) join is employed. These hints can also control how to access data from specific sources. The table access hint that we will explore here is: accessmode. Accessmode The accessmode hint is very important for HBase. It avoids map reduce overhead. Combined with point queries, they ensure sub-second response time without being affected by the total data size. There are multiple ways to specify accessmode hint – as query hint or at session level. Note that session level hints take precedence. If “set force local off;” is run in a session, all subsequent queries will always use map reduce even if an explicit accessmode=‘local’ hint is specified on the query. You can check the state of accessmode, if it was explicitly set, on the session with the following command in the Big SQL shell. BigSQL Shell set; If you kept the same shell open throughout this part of the lab, you will see the following output. This is because we used “set force local on” earlier in one of the previous sections. 39
  40. 40. +--------------------+-------+ | key | value | +--------------------+-------+ | bigsql.force.local | true | +--------------------+-------+ 1 row in results(first row: 0.0s; total: 0.0s) To change the setting back to the default, you can change the value to automatic with the following command. BigSQL Shell set force local auto; Issue the following select query. BigSQL Shell select o_orderkey from orders where o_custkey=4 and o_orderkey=5612065; Notice how long the query takes. +------------+ | o_orderkey | +------------+ | 5612065 | +------------+ 1 row in results(first row: 7.2s; total: 7.2s) Issue the same query with an accessmode hint this time. BigSQL Shell select o_orderkey from orders /*+ accessmode='local' +*/ where o_custkey=4 and o_orderkey=5612065; Notice how the query responds much faster with the results. This is because of the local accessmode, hence no mapreduce job employed. +------------+ | o_orderkey | +------------+ | 5612065 | +------------+ 1 row in results(first row: 0.32s; total: 0.32s) PART II – B – Connecting to Big SQL Server via JDBC Organizations interested in Big SQL often have considerable SQL skills in-house, as well as a suite of SQL-based business intelligence applications and query/reporting tools. The idea of being able to leverage existing skills and tools — and perhaps reuse portions of existing applications — can be quite appealing to organizations new to Hadoop. 40
  41. 41. Therefore Big SQL supports a JDBC driver that conforms to the JDBC 3.0 specification to provide connectivity to Java™ applications. (Big SQL also supports a 32-bit or a 64-bit ODBC driver, on either Linux or Windows, that conforms to the Microsoft Open Database Connectivity 3.0.0 specification, to provide connectivity to C and C++ applications). In this part of the lab, we will explore how to use Big SQL’s JDBC driver with BIRT, an open source business intelligence and report tool that can plug into Eclipse. We will use this tool to run some very simple reports using SQL queries on data stored in HBase on our Hadoop environment. Business Intelligence and Reporting via BIRT To start, open eclipse from the Desktop of the virtual machine by clicking on the Eclipse icon. When promoted to do so, leave the default workspace as is. Once Eclipse has loaded switch to the 'Report Design' perspective so that we can work with BIRT. To do so, from the menu bar click on: Window -> Open Perspective -> Other.... Than click on: Report Design -> OK as shown below. Once in the Report Design perspective, double-click on Orders.rptdesign from the Navigator pane (on the bottom left-hand side) to open the pre-created report. 41
  42. 42.  Note: A report has been created on your behalf to quicker illustrate the functionally/usage of the Big SQL drivers, while removing tedious steps of designing a report in BIRT. Expand 'Data Sets' from Data Explorer. You will notice the data sets (or report queries) contain a red 'X' beside them. This is because the pre-created report queries are not yet associated to a data source. Now all that is necessary, prior to being able to run the report, is to set up the JDBC connection to BigSQL. To obtain the client drivers, open the BigInsights web console from the Desktop of the VM, or point your browser to: http://bivm:8080. From the Welcome tab, in the Quick Links section, select Download the Big SQL Client drivers. Save the file to /home/biadmin/Desktop/IBD-1687A/. 42
  43. 43. Open the folder where you saved the file and extract the contents of the client package under the same directory. Back in Eclipse, add Big SQL as a source. Right-click on Data Sources -> New Data Source from the Data Explorer pane on the top left-hand side. In the New Data Source window, select JDBC Data Source and specify “Big SQL” for the Data Source Name. Click Next. 43
  44. 44. In the New JDBC Data Source Profile window, click on Manage Drivers…. Once the Manage JDBC Drivers window appears click on Add… Point to the location where the client drivers were extracted than click OK. Once added, you should have an entry for the BigSQLDriver in the Driver Class dropdown field list. Select it, and complete the fields with the following information: • • • Database URL: jdbc:bigsql://localhost:7052 User Name: biadmin Password: biadmin 44
  45. 45. Click on ‘Test Connection...’ to ensure we can connect to Big SQL using the JDBC driver. Double-click 'Orders per year' and add the Big SQL connection that was just defined. Examine the query: WITH test (order_year, order_date) AS (SELECT YEAR(o_orderdate), o_orderdate FROM orders FETCH FIRST 20 ROWS ONLY) SELECT order_year, COUNT(*) AS cnt FROM test GROUP BY order_year 45
  46. 46. Carry out the same procedure to add the Big SQL connection for the 'Top 5 salesmen' data set and examine the query. WITH base (o_clerk, tot) AS (SELECT o_clerk, SUM(o_totalprice) AS tot FROM orders GROUP BY o_clerk ORDER BY tot DESC) SELECT o_clerk, tot FROM base FETCH FIRST 5 ROWS ONLY  Note: Disregard the red ‘X’ that may still exist on the Data Sets. This is a bug and can safely be ignored. Now that we have defined the Data Source and have the Data Sets configured, run the report in Web Viewer as shown in the diagram below. The output from the web viewer against the orders table on Big SQL should be as follows. 46
  47. 47. As seen in this part of the lab, a variety of IBM and non-IBM software that supports JDBC and ODBC data sources can also be configured to work with Big SQL. We used BIRT here, but as another example, Cognos Business Intelligence can uses Big SQL's JDBC interface to query data, generate reports, and perform other analytical functions. Similarly, other tools like Tableau can leverage Big SQL’s ODBC drivers to work with data stored in a Big Insights cluster. 47
  48. 48. Communities • On-line communities, User Groups, Technical Forums, Blogs, Social networks, and more o Find the community that interests you … • Information Management bit.ly/InfoMgmtCommunity • Business Analytics bit.ly/AnalyticsCommunity • Enterprise Content Management bit.ly/ECMCommunity • IBM Champions o Recognizing individuals who have made the most outstanding contributions to Information Management, Business Analytics, and Enterprise Content Management communities • ibm.com/champion Thank You! Your Feedback is Important! • Access the Conference Agenda Builder to complete your session surveys o Any web or mobile browser at http://iod13surveys.com/surveys.html o Any Agenda Builder kiosk onsite 48
  49. 49. Acknowledgements and Disclaimers: Availability: References in this presentation to IBM products, programs, or services do not imply that they will be available in all countries in which IBM operates. The workshops, sessions and materials have been prepared by IBM or the session speakers and reflect their own views. They are provided for informational purposes only, and are neither intended to, nor shall have the effect of being, legal or other guidance or advice to any participant. While efforts were made to verify the completeness and accuracy of the information contained in this presentation, it is provided AS-IS without warranty of any kind, express or implied. IBM shall not be responsible for any damages arising out of the use of, or otherwise related to, this presentation or any other materials. Nothing contained in this presentation is intended to, nor shall have the effect of, creating any warranties or representations from IBM or its suppliers or licensors, or altering the terms and conditions of the applicable license agreement governing the use of IBM software. All customer examples described are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual environmental costs and performance characteristics may vary by customer. Nothing contained in these materials is intended to, nor shall have the effect of, stating or implying that any activities undertaken by you will result in any specific sales, revenue growth or other results. © Copyright IBM Corporation 2013. All rights reserved. • U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp. IBM, the IBM logo, ibm.com, and InfoSphere BigInsights are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. If these and other IBM trademarked terms are marked on their first occurrence in this information with a trademark symbol (® or ™), these symbols indicate U.S. registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the Web at “Copyright and trademark information” at www.ibm.com/legal/copytrade.shtml Other company, product, or service names may be trademarks or service marks of others. 49

×