SlideShare a Scribd company logo
1 of 74
Download to read offline
Explore Big Data with Hadoop and
InfoSphere BigInsights
Cynthia M. Saracco (saracco@us.ibm.com)
August 15, 2014
IBM Software
Page 2 Explore Hadoop and BigInsights
Contents
LAB 1 OVERVIEW......................................................................................................................................................... 3
1.1. ABOUT YOUR ENVIRONMENT ..................................................................................................................... 3
1.2. GETTING STARTED ................................................................................................................................... 4
LAB 2 ISSUING BASIC HADOOP COMMANDS .......................................................................................................... 8
2.1. CREATING A DIRECTORY IN YOUR DISTRIBUTED FILE SYSTEM......................................................................... 8
2.2. COPYING DATA INTO HDFS ...................................................................................................................... 8
2.3. RUNNING A SAMPLE MAPREDUCE APPLICATION........................................................................................... 9
LAB 3 EXPLORING AND ADMINISTERING YOUR CLUSTER WITH THE BIGINSIGHTS WEB CONSOLE ........... 13
3.1. GETTING STARTED WITH THE WEB CONSOLE ............................................................................................ 13
3.2. ADMINISTERING BIGINSIGHTS.................................................................................................................. 14
3.3. WORKING WITH THE DISTRIBUTED FILE SYSTEM (HDFS) ............................................................................ 17
3.4. MANAGING AND LAUNCHING PRE-BUILT APPLICATIONS FROM THE WEB CATALOG .......................................... 22
LAB 4 ANALYZING SOCIAL MEDIA DATA WITH BIGSHEETS ............................................................................... 28
4.1. CREATING A WORKBOOK......................................................................................................................... 28
4.2. ANALYZING AND CUSTOMIZING YOUR WORKBOOK ...................................................................................... 30
4.3. CREATING CHARTS................................................................................................................................. 38
4.4. CREATING A BIG SQL TABLE BASED ON YOUR WORKBOOK ......................................................................... 42
4.5. OPTIONAL: EXPORTING YOUR WORKBOOK DATA ....................................................................................... 45
LAB 5 QUERYING DATA WITH BIG SQL .................................................................................................................. 48
5.1. CREATING A PROJECT AND EXECUTING BIG SQL STATEMENTS ................................................................... 48
5.2. CREATING SAMPLE TABLES AND LOADING SAMPLE DATA ............................................................................. 53
5.3. QUERYING TABLES WITH JOINS, AGGREGATIONS AND MORE ........................................................................ 60
5.4. OPTIONAL: USING SERDES FOR NON-TRADITIONAL DATA .......................................................................... 63
5.5. OPTIONAL: DEVELOPING A JDBC CLIENT APPLICATION WITH BIG SQL ....................................................... 65
LAB 6 SUMMARY ....................................................................................................................................................... 70
Hands On Lab Page 3
Lab 1 Overview
In this hands-on lab, you'll learn how to work with Big Data using Apache Hadoop and InfoSphere
BigInsights, IBM's Hadoop-based platform. In particular, you'll learn the basics of working with the
Hadoop Distributed File System (HDFS) and see how to administer your Hadoop-based environment
using the BigInsights Web console. After launching a sample MapReduce application, you'll explore a
more sophisticated scenario involving social media data. In doing so, you'll learn how to use a
spreadsheet-style interface to discover insights about the global coverage of a popular brand without
writing any code. Finally, you'll learn how to apply industry standard SQL to data managed by
BigInsights through IBM's Big SQL technology. Indeed, you'll have a chance to create tables and
execute complex queries over data in HDFS, including data derived from a relational data warehouse.
Ready to get started?
After completing this hands-on lab, you’ll be able to:
• Work directly with Apache Hadoop through file system commands
• Inspect and administer your cluster through the BigInsights Web Console
• Explore big data using a spreadsheet-style tool
• Use Big SQL to create tables and issue complex queries
Allow 2 ½ - 3 hours to complete this lab.
This lab was developed by Cynthia M. Saracco, IBM Silicon Valley Lab. Please post questions or
comments about this lab or the technologies it describes to the forum on Hadoop Dev at
https://developer.ibm.com/hadoop/.
1.1. About your environment
This lab was developed for the InfoSphere BigInsights 3.0 Quick Start Edition VMware image. If
necessary, download and install the single-node cluster VMware image from this site: http://www-
01.ibm.com/software/data/infosphere/biginsights/quick-start/downloads.html
The VMware image is set up in the following manner:
User Password
VM Image root account root password
VM Image lab user account biadmin biadmin
BigInsights Administrator biadmin biadmin
Big SQL Administrator bigsql bigsql
Lab user biadmin biadmin
IBM Software
Page 4 Explore Hadoop and BigInsights
Property Value
Host name bivm.ibm.com
BigInsights Web Console URL http://bivm.ibm.com:8080
Big SQL database name bigsql
Big SQL port number 51000
.
About the screen captures, sample code, and environment configuration
Screen captures in this lab depict examples and results that may vary from
what you see when you complete the exercises. In addition, some code
examples may need to be customized to match your environment. For
example, you may need to alter directory path information or user ID
information.
1.2. Getting started
To get started with the lab exercises, you need to install and launch the VMware image as well as start
the required services.
__1. If necessary, obtain a copy of the BigInsights 3.0 Quick Start Edition VMware image from IBM's
external download site (http://www-01.ibm.com/software/data/infosphere/biginsights/quick-
start/downloads.html). Use the image for the single-node cluster.
__2. Follow the instructions provided to decompress (unzip) the file and install the image on your
laptop. Note that there is a README file with additional information.
__3. If necessary, install VMware player or other required software to run VMware images. Details
are in the README file provided with the BigInsights VMware image.
__4. Launch the VMware image. When logging in for the first time, use the root ID (with a password
of password). Follow the instructions to configure your environment, accept the licensing
agreement, and enter the passwords for the root and biadmin IDs (root/password and
biadmin/biadmin) when prompted. This is a one-time only requirement.
Hands On Lab Page 5
__5. When the one-time configuration process is completed, you will be presented with a SUSE Linux
log in screen. Log in as biadmin with a password of biadmin.
IBM Software
Page 6 Explore Hadoop and BigInsights
__6. Verify that your screen appears similar to this:
__7. Click Start BigInsights to start all required services. (Alternatively, you can open a terminal
window and issue this command: $BIGINSIGHTS_HOME/bin/start-all.sh)
Wait until the operation completes. This may take several minutes, depending on your
machine's resources.
Hands On Lab Page 7
__8. Verify that all required BigInsights services are up and running. From a terminal window, issue
this command: $BIGINSIGHTS_HOME/bin/status.sh.
__9. Inspect the results, a subset of which are shown below. Verify that, at a minimum, the following
components started successfully: hdm, zookeeper, hadoop, catalog, hive, bigsql, oozie,
console, and httpfs.
Now you're ready to start working with big data!
If have any questions or need help getting your environment up and running, visit Hadoop
Dev (https://developer.ibm.com/hadoop/) and review the product documentation or post
a message to the forum.
You cannot proceed with subsequent lab exercises until you've logged into the VMware
image and launched the necessary BigInsights services.
IBM Software
Page 8 Explore Hadoop and BigInsights
Lab 2 Issuing basic Hadoop commands
In this exercise, you’ll work directly with Apache Hadoop to perform some basic tasks involving the
Hadoop Distributed File System (HDFS) and launching a sample application. All the work you’ll perform
here involves commands and interfaces provided with Hadoop from http://hadoop.apache.org. As
mentioned earlier, Hadoop is part of IBM’s InfoSphere BigInsights platform.
Allow 15 minutes to complete this lab module.
2.1. Creating a directory in your distributed file system
__1. Click the BigInsights Shell icon.
__2. Select the Terminal icon to open a terminal window.
__3. Execute the following Hadoop file system command to create a directory in HDFS for your work:
hadoop fs -mkdir /user/biadmin/test
Note that HDFS is distinct from your Unix/Linux local file system directory, and working with
HDFS requires using hadoop fs commands.
2.2. Copying data into HDFS
__1. Using standard Unix/Linux file system commands, list the contents of the /home/biadmin/licenses
directory.
ls /home/biadmin/licenses
Note the BIlicense_en.txt file. It contains license information in English, and it will serve as a
sample data file for a future exercise.
__2. Copy the BIlicense_en.txt file into the /user/biadmin/test directory you just created in HDFS.
hadoop fs -put /home/biadmin/licenses/BIlicense_en.txt /user/biadmin/test
Hands On Lab Page 9
__3. List the contents of your target HDFS directory to verify that the file was successfully copied.
hadoop fs -ls /user/biadmin/test
2.3. Running a sample MapReduce application
WordCount is one of several sample MapReduce applications provided for Apache Hadoop. Written in
Java, it simply scans through input document(s) and, for each word, returns the total number of
occurrences found. You can read more about WordCount on the Apache wiki
(http://wiki.apache.org/hadoop/WordCount).
Since launching MapReduce applications (or jobs) is a common practice in Hadoop, you'll explore how to
do that with WordCount.
__1. Execute the following command to launch the sample WordCount application provided with your
Hadoop distribution.
hadoop jar /opt/ibm/biginsights/IHC/hadoop-example.jar wordcount
/user/biadmin/test WordCount_output
This command specifies that the wordcount application contained in the specified .jar file is to be
launched. The input for this application is in the /user/biadmin/test directory of HDFS. The output of
this job will be stored in HDFS in the WordCount_output subdirectory of the user executing this
command (biadmin). Thus, the output directory will be /user/biadmin/WordCount_output. This
directory will be created automatically as a result of executing this application.
NOTE: If the output folder already exists or if you try to rerun a successful
MapReduce job with the same parameters, you will receive an error message. This
is the default behavior of the sample WordCount application.
IBM Software
Page 10 Explore Hadoop and BigInsights
__2. Inspect the output of your job.
hadoop fs -ls WordCount_output
In this case, the output was small and contained written to a single file. If you had run WordCount
against a larger volume of data, its output would have been split into multiple files (e.g., part-r-00001,
part-r-00002, and so on).
__3. To view the contents of part-r-0000 file, issue this command:
hadoop fs -cat WordCount_output/*00
Partial output is shown here:
Hands On Lab Page 11
__4. Optionally, inspect details about your job. Open a Web browser, or click on the web console
icon on your desktop and open a new tab. Access the URL for Hadoop's Job Tracker
(http://bivm.ibm.com:50030/jobtracker.jsp). Scroll to the Completed Jobs section to
locate the Job ID associated with the Word Count application. Click on the Job ID link to review
details, such as the number of Map and Reduce tasks launched for your application, the number
of bytes read and written, etc. Partial output is shown in the second image that follows.
IBM Software
Page 12 Explore Hadoop and BigInsights
Hands On Lab Page 13
Lab 3 Exploring and administering your cluster with the
BigInsights Web console
As you saw in the previous lab, Apache Hadoop users typically work through a command line interface to
perform many common tasks. This lab introduces you to the BigInsights Web console, which enables
you to administer your cluster, work with HDFS, launch jobs, and perform many other tasks using a
graphical interface.
After completing this hands-on lab, you’ll be able to:
• Launch the Web console.
• Work with popular resources accessible through the Welcome page.
• Administer BigInsights by inspecting the status of your cluster and accessing tools for open
source components provided with BigInsights.
• Work with the distributed file system. In particular, you'll explore the HDFS directory structure,
create subdirectories, and upload files to HDFS.
• Manage and launch pre-built applications from a Web catalog.
• Inspect the status of previously launched applications (jobs) and review their output.
Allow 30 minutes to complete this section of lab.
This lab is an introduction to a subset of console functions. Real-time monitoring, dashboards, alerts,
and application linking are among the more advanced console functions that are beyond this lab's scope.
3.1. Getting started with the Web Console
In this exercise, you will launch the console and inspect its Welcome page.
__1. Launch the BigInsights Web console. Direct your browser to http://bivm.ibm.com:8080 or click
the Web Console icon on your desktop.
__2. Log in with your user name and password (biadmin / biadmin).
IBM Software
Page 14 Explore Hadoop and BigInsights
__3. Verify that your Web console appears similar to this:
__4. Briefly skim through the links provided in these sections to become familiar with resources
available to you:
Tasks: Quick access to popular BigInsights tasks
Quick Links: Links to internal and external quick links and downloads to enhance your
environment
Learn More: Online resources available to learn more about BigInsights
3.2. Administering BigInsights
The Web console allows administrators to inspect the overall health of the system as well as perform
basic functions, such as starting and stopping specific servers or components, adding nodes to the
cluster, and so on. You’ll explore a subset of these capabilities here.
Hands On Lab Page 15
__5. Click on the Cluster Status tab at the top of the page.
__6. Inspect the overall status of your cluster. The figure below was taken on a single-node cluster
that had several services running. One service – Monitoring -- was unavailable. Your display
may differ somewhat. It’s not necessary for all BigInsights services to be running to complete
the exercises in this lab.
__7. Click on the Hive service and note the detailed information provided for this service in the pane
at right. For example, you can see the URL for Hive's Web interface and its process ID. In
addition, note that you can start and stop services (such as the Hive service) from the Cluster
Status page of the console.
IBM Software
Page 16 Explore Hadoop and BigInsights
__8. Optionally, cut-and-paste the URL for Hive’s Web interface into a new tab of your browser.
You'll see an open source tool provided with Hive for administration purposes, as shown below.
Other open source tools provided with Apache Hadoop are also available through IBM's
packaged distribution (BigInsights), as you'll see shortly. Close this browser tab.
__9. Click on the Welcome page of your Web console.
__10. Click on the Access secure cluster servers button in the Quick Links section at right.
If nothing appears, verify that the pop-up blocker of your browser is disabled; a prompt should
appear at the top of the page if pop-ups are blocked.
__11. Inspect the list of server components for which there are additional Web-based tools. The
BigInsights console displays the URLs you can use to access each of these Web sites directly.
(This information will only appear if the pop-up blocker is disabled on browser.)
__12. Click on the jobtracker alias. The display should be familiar to you -- it's the same one you saw
in the previous lab that introduced you to some basic Hadoop facilities.
Hands On Lab Page 17
3.3. Working with the distributed file system (HDFS)
In this section, you'll learn how to use the Web console to create directories in HDFS, navigate the file
system, and upload small files -- tasks you performed earlier through a command-line interface. In
addition, you'll perform a few other file-related tasks as well. Many people find the console's graphical
interface to be easier to use than the command-line interface.
__1. Click on the Files tab at the top of the page.
__2. Expand the DFS directory tree in the left pane to display the contents of /user/biadmin. Note
the presence of the /WordCount_output and /test subdirectories, which you created in an
earlier lab. If desired, expand each directory and inspect its contents.
IBM Software
Page 18 Explore Hadoop and BigInsights
__3. Become familiar with the functions provided through the icons at the top of this pane, as we'll
refer to some of these in subsequent sections of this module. Simply position your cursor on
each icon to learn its function. From left to right, the icons enable you to copy a file or directory,
move a file, create a directory, rename a file or directory, upload a file to HDFS, download a file
from HDFS to your local file system, remove a file or directory from HDFS, set permissions, open
a command window to launch HDFS shell commands, and refresh the Web console page.
__4. Delete the /user/biadmin/test directory and its contents. Position your cursor on this
directory, click the red X icon, and click Yes when prompted.
Hands On Lab Page 19
__5. Create a new subdirectory in /user/biadmin. With your cursor positioned on /user/biadmin,
click the create directory icon.
__6. When a pop-up window appears, specify test2 as the new directory's name and click OK.
IBM Software
Page 20 Explore Hadoop and BigInsights
__7. Expand the directory hierarchy to verify that your new subdirectory was created.
__8. Upload a file into this directory from your local file system. Click the upload icon.
Hands On Lab Page 21
__9. When a pop-up window appears, click the Browse button to navigate through your local file
system to /home/biadmin/licenses. Select the BIlicense_en.txt file and click Open.
__10. Expand the /user/biadmin/test2 directory and verify that the BIlicense_en.txt file was
successfully copied into HDFS. Note that the right pane of the Web console previews the file's
contents.
IBM Software
Page 22 Explore Hadoop and BigInsights
3.4. Managing and launching pre-built applications from the Web catalog
The Web console includes a catalog of ready-made applications that users can launch through a
graphical interface. Each application's status, execution history, and output are easy to monitor from this
page as well. In this exercise, you'll first manage the catalog’s contents, selecting one of more than 20
pre-built applications provided with BigInsights to deploy on your cluster. Once deployed, the application
will be visible to all authorized users. You'll then launch the application, monitor its execution status, and
inspect its output.
As you might have guessed, the sample application used in this lab is Word Count -- the same
application you ran from a command line earlier.
__1. Click the Applications tab of the Web console. No applications are deployed on a new cluster,
so there won't be much to see yet.
__2. In the upper left corner, click Manage. A list of applications available for deployment are
displayed.
Hands On Lab Page 23
__3. Expand the Test category and click on the Word Count application.
__4. Click Deploy.
__5. When a pop-up window appears, accept the defaults for all settings and click Deploy.
IBM Software
Page 24 Explore Hadoop and BigInsights
__6. After the application has been deployed, you're ready to run it. Click Run in the upper left pane.
__7. Verify that the Word Count application appears in the catalog. (Any other applications that were
previously deployed to the Web catalog will also appear.)
Hands On Lab Page 25
__8. Click on the Word Count icon. The pane at right prompts you to enter appropriate information.
For this application, you need to specify an execution name for your application's run, the HDFS
directory containing the input document(s) for the Word Count application, and an output
directory in HDFS.
__9. For the Execution name, enter My Test Run 1.
__10. For the Input path, click Browse and navigate to /user/biadmin/test2. Click OK.
__11. For the Output path, type /user/biadmin/WordCount_console_output. (Recall that the Word
Count application creates this output directory at run time. If you specify an existing HDFS
directory for the output, the application will fail.)
__12. Verify that your display appears similar to this and click Run.
IBM Software
Page 26 Explore Hadoop and BigInsights
__13. As your application executes, monitor its status through the Application History pane at lower
right.
__14. When the application completes successfully, click the link provided in the Output column to see
the application's output.
__15. Optionally, return to the Applications page of the console and click on the link provided in the
Details column for your application's run.
Hands On Lab Page 27
__16. Note that the console displays the Application Status page, which contains information about the
Oozie workflow for your application as well as the application itself. If desired, click on one or
more available links to explore details available for your review.
IBM Software
Page 28 Explore Hadoop and BigInsights
Lab 4 Analyzing social media data with BigSheets
To help business analysts and those without a programming background analyze big data, IBM provides
a spreadsheet-style tool called BigSheets. In this lab, you'll learn how you can explore big data through
this tool without writing any scripts or MapReduce applications. The sample data for this lab consists of
social media posts about a popular brand (IBM Watson) that was collected using a sample application
provided with BigInsights. For background information, you may want to read the article on Analyzing
social media and structured data with InfoSphere BigInsights at
http://www.ibm.com/developerworks/data/library/techarticle/dm-1206socialmedia/index.html
After completing this hands-on lab, you’ll be able to:
• Create a BigSheets workbook
• Analyze and customize a workbook
• Visualize your workbook's data in a chart
• Create a Big SQL table based on your workbook
• Export your workbook's data into one of several popular formats
Allow 45 – 60 minutes to complete this lab.
4.1. Creating a workbook
To get started, copy the sample blogs-data.txt file to HDFS and create a master workbook for it.
__1. Obtain the blogs-data.txt file. You’ll find this in the sampleData.zip file provided with the article
mentioned earlier.
__2. Use Hadoop file system commands or the BigInsights Web console to create subdirectories in
HDFS for your sample data. Under /user/biadmin, create a /sampleData directory. Beneath
/user/biadmin/sampleData, create the /IBMWatson subdirectory.
Where did this data come from?
For time efficiency, social media data about "IBM Watson" was already collected using the
Boardreader sample application, which collects social media data from various global sites
and writes the output in JSON array format to files. This lab focuses on blog data collected
about IBM Watson for a six-month interval.
Boardreader is an IBM business partner that offers a social media content aggregation and
provisioning service based on a multilingual data dating back to 2001. The service searches
message boards / forums, social networks, blogs/comments, microblogs, reviews,
videos/comments and online news. Customers who want to use the Boardreader service
should contact the firm directly to obtain a license key.
Hands On Lab Page 29
If you forgot how to create a subdirectory in HDFS, consult the earlier labs on Issuing Basic
Hadoop Commands or Exploring and Administering Your Cluster with the BigInsights Web
Console.
__3. Upload the blogs-data.txt file to the /user/biadmin/sampleData/IBMWatson directory. You
can use Hadoop file system commands or the BigInsights Web console to do this. (If you forgot
how to copy a file to HDFS, consult the earlier labs on Issuing Basic Hadoop Commands or
Exploring and Administering Your Cluster with the BigInsights Web Console.)
__4. From the Files page of the Web console, position your cursor on the
/user/biadmin/sampleData/IBMWatson/blogs-data.txt file, as shown in the previous
image.
__5. Click the Sheet radio button to preview this data in a spreadsheet-style format.
__6. Because the sample blog data for this lab is uses a JSON Array structure, you must click on the
pencil icon to select an appropriate reader (data format translator) for this data. Select the JSON
Array reader and click the green check.
IBM Software
Page 30 Explore Hadoop and BigInsights
__7. Save this as a Master Workbook named Watson Blogs. Optionally, provide a description. Click
Save.
__8. Note that the BigSheets page of the Web console will open and your new workbook will be
displayed.
Now you're ready to begin exploring this data using BigSheets.
4.2. Analyzing and customizing your workbook
BigSheets offers analysts a variety of macros, functions, and built-in analytical features. You'll learn
about a few here.
Hands On Lab Page 31
__1. To make it easier to search and manage your workbooks, add a few tags to the Watson Blogs
master workbook you just created. In the upper right corner, click the icon to toggle the
workbook display to show additional fields.
Depending on the size of your browser, an additional scroll bar may appear at right.
__2. Scroll down to the Workbook Details section. Locate the Tags field, select the green plus sign
(+) , enter a tag for Watson, and click the green check mark. Repeat the process to add
separate tags for IBM and blogs.
__3. Click on the Workbooks link the upper left corner of your open workbook.
__4. From the list of available workbooks, you can quickly search for a specific tag. Use the drop-
down Tags menu to select the blogs tag or type tag: blogs into the box.
IBM Software
Page 32 Explore Hadoop and BigInsights
__5. Open the Watson Blogs master workbook again. (Double click on it.)
__6. Create a new workbook based on this master workbook. In BigSheets, a master workbook is a
“base” workbook and has a limited set of things you can edit. So, to manipulate the data
contained within a workbook, you want to create a new workbook derived from the master.
__a. Click the Build new Workbook button.
__b. When the new Workbook appears, change its default name. Click the pencil icon next to
the name, enter Watson Blogs Revised as the new name, and click the green check
mark.
__c. Click the Fit column(s) button to more easily see columns A through H on your screen
.
__7. Remove the column IsAdult from your workbook. This is currently column E. Click on the
triangle next to the column name of IsAdult and select the Remove.
Hands On Lab Page 33
__8. In this case, you want to keep only a few columns. To easily remove several columns, click the
triangle again (from any column) and select Organize Columns
__a. Click the red X button next to each column you want to remove.
In this case, KEEP the following columns
__i. Country
__ii. FeedInfo
__iii. Language
__iv. Published
__v. SubjectHtml
__vi. Tags
__vii. Type
__viii. Url
__b. Click the green check mark button when you are ready to remove the columns you
selected.
Did I lose data?
Deleting a column does not remove data. Deleting a column in a workbook just
removes the mapping to this column.
IBM Software
Page 34 Explore Hadoop and BigInsights
__9. Click on the Fit column(s) button again to show columns A through H. Verify that your screen
appears similar to this:
__10. From the Save menu at upper left, select Save. Provide a description for your workbook if you’d
like.
__11. Apply a built-in function to further investigate the contents of this workbook. Click the Add
Sheets button in the lower left corner.
Hands On Lab Page 35
__12. From the pop-up menu, select Function. You're going to apply a built-in function that extracts
the URL Host information from the full URL links associated with the blog data that was
captured. Doing so will enable you to identify and chart sites with greatest blog coverage of IBM
Watson.
__13. From the Function menu, click Categories and Url.
__14. Select the URLHOST function.
__15. In the new menu that appears, enter Get Host URL as the sheet name and select the Url
column as the source of input to the URLHOST function.
IBM Software
Page 36 Explore Hadoop and BigInsights
__16. At the bottom of the menu, click the Carry Over tab to specify which columns from the workbook
you'd like to retain. Select Add All and click the green check mark.
__17. Verify that your workbook contains a new URLHOST column and all previously existing columns.
(Whenever you create a new Sheet or edit your workbook in some way, BigSheets will preview
the results of your work against a small sample of the data represented by your workbook.) If
desired, click the Fit Column button to show more columns on your screen.
Hands On Lab Page 37
__18. Click Save > Save & Exit.
__19. When prompted to Run or Close the workbook, click Run. "Running" a workbook instructs
BigSheets to apply the logic you specified graphically against all data associated with your
workbook. You can monitor the progress of your request by watching the status bar indicator in
the upper right-hand side of the page.
__20. When the operation completes, verify that your workbook appears similar to this:
IBM Software
Page 38 Explore Hadoop and BigInsights
__21. If desired, use the Next button in the lower right corner to see page through the content a few
times, noting the various URLHOST values. If desired, you could use built-in BigSheets features
to sort the data based on URLHOST (or other) values, filter records (such as blogs written in the
English language), etc. But perhaps the quickest way to see which sites published the most
blogs about IBM Watson during this time period is to chart the results. You'll do that next.
4.3. Creating charts
Now that you've customized your workbook to eliminate some unwanted columns and generate a new
column containing URL host information, it's time to visualize the results. In this short exercise, you'll
create two simple charts that identify the top 10 global sites with the most blog posts about IBM Watson.
__1. If necessary, open the Watson Blogs Revised notebook.
__2. Click on the Add chart link in the lower left.
Hands On Lab Page 39
__3. Select chart > Bar as the chart type.
__4. Specify appropriate properties for the bar chart, paying close attention to these fields:
__a. Title: Top 10 Blog Sites for IBM Watson
__b. X Axis: URLHOST
__c. Sort By: Y Axis
__d. Occurrence Order: Descending
__e. Limit: 10
IBM Software
Page 40 Explore Hadoop and BigInsights
__5. Click the green check mark.
__6. When prompted, Run the chart. This causes BigSheets to apply your instructions to the entire
data set.
__7. Inspect the results. Are you surprised that ibm.com wasn’t the top site for blog posts about IBM
Watson?
Hands On Lab Page 41
__8. If desired, hover over each bar to see the URL host name and the number of blogs posted at that
site.
__9. Next, create a new chart of a different type to visualize the information in a different format.
Select Add Chart > Categories > cloud > Bubble Cloud.
__10. Provide appropriate values for the following fields:
__a. Title: Top 10 Blog Sites for IBM Waton
__b. Tags: URLHOST
__c. Occurrence Order: Descending
__d. Sort By: Count
__e. Limit: 10
IBM Software
Page 42 Explore Hadoop and BigInsights
__11. Click the green check mark.
__12. When prompted, Run the chart.
__13. Inspect the results. If desired, hover over a bubble to see the number of blog postings for that
site.
4.4. Creating a Big SQL table based on your workbook
BigSheets offers a wide range of built-in features, including the ability to create a Big SQL table from
your workbook. This is quite handy if you have SQL-based tools or applications that you'd like to use
with data you've customized in BigSheets.
Hands On Lab Page 43
__1. If necessary, open your Watson Blogs Revised workbook.
__2. Click Create Table button just above the columns of your workbook. When prompted, accept
sheets as the target schema name and type mywatsonblogs as the target table name.
__3. Click Confirm.
__4. From the Files page of the Web console, click the Catalog Tables tab in the navigation window
and expand the sheets folder.
__5. Click the mywatsonblogs file. Note that a preview of the table appears in the pane at right.
__6. Click the Welcome tab of the Web console. In the Quick Links section, click the Run Big SQL
queries link.
IBM Software
Page 44 Explore Hadoop and BigInsights
__7. A new tab will appear in your Web browser.
__8. In the box where you're prompted to enter your Big SQL query, type this statement:
select urlhost, language, subjecthtml from sheets.mywatsonblogs
fetch first 10 rows only;
__9. Verify that the Big SQL radio button is checked (not the Big SQL V1 radio button).
__10. If necessary, use the scroll bar at right to expose the Run button just below the radio buttons.
Click Run.
__11. Inspect the results.
Hands On Lab Page 45
__12. Close the Big SQL browser tab.
4.5. Optional: Exporting your workbook data
In this optional exercise, you'll see how easy it is to export data in your workbook to one of several
popular formats so that other applications can easily access the data.
__1. If necessary, open your Watson Blogs Revised workbook.
__2. Click Export data. From the drop-down menu, select TSV (tab separated value) as the format
type.
__3. Click the File radio button to export the data to a file in your distributed file system.
Querying tables with Big SQL
While the Web console's Big SQL query interface is handy for executing test
queries that return a small amount of data, it's best to use other facilities provided
by IBM or third parties to execute Big SQL queries that return larger volumes of
data to avoid memory constraints imposed by your browser. In a subsequent lab,
you'll learn how to execute Big SQL queries from Eclipse.
IBM Software
Page 46 Explore Hadoop and BigInsights
__4. Use the Browse button to navigate to the directory in HDFS where you would like to export this
workbook. In this case, select /user/biadmin/sampleData/IBMWatson. In the box below the
directory tree, enter myworkbook as the file name. Do not add a file extension such as .tsv.
Click OK.
__5. Click OK again to initiate the data export operation.
__6. When a message appears indicating that the operation has finished, click OK.
__7. On the Files page of the Web console, navigate to the directory you specified for the export
(/user/biadmin/sampleData/IBMWatson) and locate your new myworkbook.tsv file.
Hands On Lab Page 47
__8. Optionally, click the download icon to copy the file from HDFS to a directory of your choice in
your local file system.
IBM Software
Page 48 Explore Hadoop and BigInsights
Lab 5 Querying data with Big SQL
Now that you know how to work with HDFS and analyze your data with a spreadsheet-style tool, it’s a
good time to explore how you can query your data with Big SQL. Big SQL provides broad SQL support
based on the ISO SQL standard. You can issue queries using JDBC or ODBC drivers to access data
that is stored in InfoSphere BigInsights in the same way that you access relational databases from your
enterprise applications. The SQL query engine supports joins, unions, grouping, common table
expressions, windowing functions, and other familiar SQL expressions.
This tutorial uses sales data from a fictional company that sells and distributes outdoor products to third-
party retailer stores as well as directly to consumers through its online store. It maintains its data in a
series of FACT and DIMENSION tables, as is common in relational data warehouse environments. In
this lab, you will explore how to create, populate, and query a subset of the star schema database to
investigate the company’s performance and offerings. Note that BigInsights provides scripts to create
and populate the more than 60 tables that comprise the sample GOSALESDW database. You will use
fewer than 10 of these tables in this lab.
To execute the queries in this lab, you will use the open source Eclipse environment provided with the
BigInsights Quick Start Edition VMware image. Of course, you can use other tools or interfaces to
invoke Big SQL, such as the Java SQL Shell (JSqsh), a command-line facility provided with the
BigInsights. However, Eclipse is a good choice for this lab, as it formats query results in a manner that’s
easy to read and encourages you to collect your SQL statements into scripts for editing and testing.
After you complete the lessons in this module, you will understand how to:
• Connect to the Big SQL server from Eclipse
• Execute individual or multiple Big SQL statements
• Create Big SQL tables in Hadoop
• Populate Big SQL tables with data from local files
• Query Big SQL tables using projections, restrictions, joins, aggregations, and other popular
expressions.
• Create and query a view based on multiple Big SQL tables.
• Create and run a JDBC client application for Big SQL using Eclipse.
Allow 45 – 60 minutes to complete this lab.
5.1. Creating a project and executing Big SQL statements
To begin, create a BigInsights project and Big SQL script.
__1. Launch Eclipse using the icon on your desktop. Accept the default workspace when prompted.
__2. Create a BigInsights project for your work. From the Eclipse menu bar, click File > New > Other.
Expand the BigInsights folder, and select BigInsights Project, and then click Next.
Hands On Lab Page 49
__3. Type myBigSQL in the Project name field, and then click Finish.
__4. If you are not already in the BigInsights perspective, a Switch to the BigInsights perspective
window opens. Click Yes to switch to the BigInsights perspective.
__5. Create a new SQL script file. From the Eclipse menu bar, click File > New > Other. Expand the
BigInsights folder, and select SQL script, and then click Next.
__6. In the New SQL File window, in the Enter or select the parent folder field, select myBigSQL.
Your new SQL file is stored in this project folder.
__7. In the File name field, type aFirstFile. The .sql extension is added automatically. Click Finish.
In the Select Connection Profile window, locate the Big SQL JDBC connection, which is the
pre-defined connection to Big SQL 3.0 provided with the VMware image. Inspect the properties
displayed in the Properties field. Verify that the connection uses the JDBC driver and database
name shown in the Properties pane here.
IBM Software
Page 50 Explore Hadoop and BigInsights
About the driver selection
You may be wondering why you are using a connection that employs the
com.ibm.com.db2.jcc.DB2 driver class. In 2014, IBM released a common SQL
query engine as part of its DB2 and BigInsights offerings. Doing so provides
for greater SQL commonality across its relational DBMS and Hadoop-based
offerings. It also brings a greater breadth of SQL function to Hadoop
(BigInsights) users. This common query engine is accessible through the DB2
driver. The Big SQL driver remains operational and offers connectivity to an
earlier, BigInsights-specific SQL query engine. This lab focuses on using the
common SQL query engine.
__8. Click Edit to edit this connection's log in information.
Hands On Lab Page 51
__9. Change the user name and password properties to match your user ID and password (e.g.,
biadmin / biadmin). Leave the remaining property values intact.
__10. Click Test Connection to verify that you can successfully connect to the server.
__11. Check the Save password box and click OK.
__12. Click Finish to close the connection window. Your empty SQL script will be displayed.
__13. Copy the following statement into your SQL script:
create hadoop table test1 (col1 int, col2 varchar(5));
IBM Software
Page 52 Explore Hadoop and BigInsights
Because you didn't specify a schema name for the table, it will be created in your default schema,
which is your user name (biadmin). Thus, the previous statement is equivalent to
create hadoop table biadmin.test1 (col1 int, col2 varchar(5));
In some cases, the Eclipse SQL editor may flag certain Big SQL statements as
containing syntax errors. Ignore these false warnings and continue with your lab
exercises.
__14. Save your file (press Ctrl + S or click File > Save).
__15. Right mouse click anywhere in the script to display a menu of options.
__16. Select Run SQL or press F5. This causes all statements in your script to be executed.
__17. Inspect the SQL Results pane that appears towards the bottom of your display. (If desired,
double click on the SQL Results tab to enlarge this pane. Then double click on the tab again to
return the pane to its normal size.) Verify that the statement executed successfully. Your Big
SQL database now contains a new table named BIADMIN.TEST1. Note that your schema and
table name were folded into upper case.
Hands On Lab Page 53
For the remainder of this lab, you should execute each SQL statement
individually. To do so, highlight the statement with your cursor and press F5.
When you’re developing a SQL script with multiple statements, it’s generally a
good idea to test each statement one at a time to verify that each is working as
expected.
__18. From your Eclipse project, query the system for meta data about your test1 table:
select tabschema, colname, colno, typename, length
from syscat.columns where tabschema = USER and tabname= 'TEST1';
In case you're wondering, syscat.columns is one of a number of views supplied over system
catalog data automatically maintained for you by the Big SQL service.
__19. Inspect the SQL Results to verify that the query executed successfully, and click on the Result1
tab to view its output.
__20. Finally, clean up the object you created in the database.
drop table test1;
__21. Save your file. If desired, leave it open to execute statements for subsequent exercises.
Now that you’ve set up your Eclipse environment and know how to create SQL scripts and execute
queries, you’re ready to develop more sophisticated scenarios using Big SQL. In the next lab, you will
create a number of tables in your schema and use Eclipse to query them.
5.2. Creating sample tables and loading sample data
In this lesson, you will create several sample tables and load data into these tables from local files.
IBM Software
Page 54 Explore Hadoop and BigInsights
__1. Determine the location of the sample data in your local file system and make a note of it. You
will need to use this path specification when issuing LOAD commands later in this lab.
Subsequent examples in this section presume your sample data is in the
/opt/ibm/biginsights/bigsql/samples/data directory. This is the location
of the data on the BigInsights VMware image, and it is the default location in
typical BigInsights installations.
Furthermore, the /opt/ibm/biginsights/bigsql/samples/queries
directory contains SQL scripts that include the CREATE TABLE, LOAD, and
SELECT statements used in this lab, as well as other statements.
__2. Create several tables to track information about sales. Issue each of the following CREATE
TABLE statements one at a time, and verify that each completed successfully:
-- dimension table for region info
CREATE HADOOP TABLE IF NOT EXISTS go_region_dim
( country_key INT NOT NULL
, country_code INT NOT NULL
, flag_image VARCHAR(45)
, iso_three_letter_code VARCHAR(9) NOT NULL
, iso_two_letter_code VARCHAR(6) NOT NULL
, iso_three_digit_code VARCHAR(9) NOT NULL
, region_key INT NOT NULL
, region_code INT NOT NULL
, region_en VARCHAR(90) NOT NULL
, country_en VARCHAR(90) NOT NULL
, region_de VARCHAR(90), country_de VARCHAR(90), region_fr VARCHAR(90)
, country_fr VARCHAR(90), region_ja VARCHAR(90), country_ja VARCHAR(90)
, region_cs VARCHAR(90), country_cs VARCHAR(90), region_da VARCHAR(90)
, country_da VARCHAR(90), region_el VARCHAR(90), country_el VARCHAR(90)
, region_es VARCHAR(90), country_es VARCHAR(90), region_fi VARCHAR(90)
, country_fi VARCHAR(90), region_hu VARCHAR(90), country_hu VARCHAR(90)
, region_id VARCHAR(90), country_id VARCHAR(90), region_it VARCHAR(90)
, country_it VARCHAR(90), region_ko VARCHAR(90), country_ko VARCHAR(90)
, region_ms VARCHAR(90), country_ms VARCHAR(90), region_nl VARCHAR(90)
, country_nl VARCHAR(90), region_no VARCHAR(90), country_no VARCHAR(90)
, region_pl VARCHAR(90), country_pl VARCHAR(90), region_pt VARCHAR(90)
, country_pt VARCHAR(90), region_ru VARCHAR(90), country_ru VARCHAR(90)
, region_sc VARCHAR(90), country_sc VARCHAR(90), region_sv VARCHAR(90)
, country_sv VARCHAR(90), region_tc VARCHAR(90), country_tc VARCHAR(90)
, region_th VARCHAR(90), country_th VARCHAR(90)
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY 't'
LINES TERMINATED BY 'n'
STORED AS TEXTFILE
;
-- dimension table tracking method of order for the sale (e.g., Web, fax)
CREATE HADOOP TABLE IF NOT EXISTS sls_order_method_dim
Hands On Lab Page 55
( order_method_key INT NOT NULL
, order_method_code INT NOT NULL
, order_method_en VARCHAR(90) NOT NULL
, order_method_de VARCHAR(90), order_method_fr VARCHAR(90)
, order_method_ja VARCHAR(90), order_method_cs VARCHAR(90)
, order_method_da VARCHAR(90), order_method_el VARCHAR(90)
, order_method_es VARCHAR(90), order_method_fi VARCHAR(90)
, order_method_hu VARCHAR(90), order_method_id VARCHAR(90)
, order_method_it VARCHAR(90), order_method_ko VARCHAR(90)
, order_method_ms VARCHAR(90), order_method_nl VARCHAR(90)
, order_method_no VARCHAR(90), order_method_pl VARCHAR(90)
, order_method_pt VARCHAR(90), order_method_ru VARCHAR(90)
, order_method_sc VARCHAR(90), order_method_sv VARCHAR(90)
, order_method_tc VARCHAR(90), order_method_th VARCHAR(90)
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY 't'
LINES TERMINATED BY 'n'
STORED AS TEXTFILE
;
-- look up table with product brand info in various languages
CREATE HADOOP TABLE IF NOT EXISTS sls_product_brand_lookup
( product_brand_code INT NOT NULL
, product_brand_en VARCHAR(90) NOT NULL
, product_brand_de VARCHAR(90), product_brand_fr VARCHAR(90)
, product_brand_ja VARCHAR(90), product_brand_cs VARCHAR(90)
, product_brand_da VARCHAR(90), product_brand_el VARCHAR(90)
, product_brand_es VARCHAR(90), product_brand_fi VARCHAR(90)
, product_brand_hu VARCHAR(90), product_brand_id VARCHAR(90)
, product_brand_it VARCHAR(90), product_brand_ko VARCHAR(90)
, product_brand_ms VARCHAR(90), product_brand_nl VARCHAR(90)
, product_brand_no VARCHAR(90), product_brand_pl VARCHAR(90)
, product_brand_pt VARCHAR(90), product_brand_ru VARCHAR(90)
, product_brand_sc VARCHAR(90), product_brand_sv VARCHAR(90)
, product_brand_tc VARCHAR(90), product_brand_th VARCHAR(90)
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY 't'
LINES TERMINATED BY 'n'
STORED AS TEXTFILE
;
-- product dimension table
CREATE HADOOP TABLE IF NOT EXISTS sls_product_dim
( product_key INT NOT NULL
, product_line_code INT NOT NULL
, product_type_key INT NOT NULL
, product_type_code INT NOT NULL
, product_number INT NOT NULL
, base_product_key INT NOT NULL
, base_product_number INT NOT NULL
, product_color_code INT
IBM Software
Page 56 Explore Hadoop and BigInsights
, product_size_code INT
, product_brand_key INT NOT NULL
, product_brand_code INT NOT NULL
, product_image VARCHAR(60)
, introduction_date TIMESTAMP
, discontinued_date TIMESTAMP
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY 't'
LINES TERMINATED BY 'n'
STORED AS TEXTFILE
;
-- look up table with product line info in various languages
CREATE HADOOP TABLE IF NOT EXISTS sls_product_line_lookup
( product_line_code INT NOT NULL
, product_line_en VARCHAR(90) NOT NULL
, product_line_de VARCHAR(90), product_line_fr VARCHAR(90)
, product_line_ja VARCHAR(90), product_line_cs VARCHAR(90)
, product_line_da VARCHAR(90), product_line_el VARCHAR(90)
, product_line_es VARCHAR(90), product_line_fi VARCHAR(90)
, product_line_hu VARCHAR(90), product_line_id VARCHAR(90)
, product_line_it VARCHAR(90), product_line_ko VARCHAR(90)
, product_line_ms VARCHAR(90), product_line_nl VARCHAR(90)
, product_line_no VARCHAR(90), product_line_pl VARCHAR(90)
, product_line_pt VARCHAR(90), product_line_ru VARCHAR(90)
, product_line_sc VARCHAR(90), product_line_sv VARCHAR(90)
, product_line_tc VARCHAR(90), product_line_th VARCHAR(90)
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY 't'
LINES TERMINATED BY 'n'
STORED AS TEXTFILE;
-- look up table for products
CREATE HADOOP TABLE IF NOT EXISTS sls_product_lookup
( product_number INT NOT NULL
, product_language VARCHAR(30) NOT NULL
, product_name VARCHAR(150) NOT NULL
, product_descriptionVARCHAR(765)
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY 't'
LINES TERMINATED BY 'n'
STORED AS TEXTFILE;
-- fact table for sales
CREATE HADOOP TABLE IF NOT EXISTS sls_sales_fact
( order_day_key INT NOT NULL
, organization_key INT NOT NULL
, employee_key INT NOT NULL
, retailer_key INT NOT NULL
, retailer_site_key INT NOT NULL
, product_key INT NOT NULL
Hands On Lab Page 57
, promotion_key INT NOT NULL
, order_method_key INT NOT NULL
, sales_order_key INT NOT NULL
, ship_day_key INT NOT NULL
, close_day_key INT NOT NULL
, quantity INT
, unit_cost DOUBLE
, unit_price DOUBLE
, unit_sale_price DOUBLE
, gross_margin DOUBLE
, sale_total DOUBLE
, gross_profit DOUBLE
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY 't'
LINES TERMINATED BY 'n'
STORED AS TEXTFILE
;
-- fact table for marketing promotions
CREATE HADOOP TABLE IF NOT EXISTS mrk_promotion_fact
( organization_key INT NOT NULL
, order_day_key INT NOT NULL
, rtl_country_key INT NOT NULL
, employee_key INT NOT NULL
, retailer_key INT NOT NULL
, product_key INT NOT NULL
, promotion_key INT NOT NULL
, sales_order_key INT NOT NULL
, quantity SMALLINT
, unit_cost DOUBLE
, unit_price DOUBLE
, unit_sale_price DOUBLE
, gross_margin DOUBLE
, sale_total DOUBLE
, gross_profit DOUBLE
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY 't'
LINES TERMINATED BY 'n'
STORED AS TEXTFILE;
Let’s briefly explore some aspects of the CREATE TABLE statements shown here. If
you have a SQL background, the majority of these statements should be familiar to
you. However, after the column specification, there are some additional clauses
unique to Big SQL – clauses that enable it to exploit Hadoop storage mechanisms (in
this case, Hive). The ROW FORMAT clause specifies that fields are to be terminated
by tabs (“t”) and lines are to be terminated by new line characters (“n”). The table
will be stored in a TEXTFILE format, making it easy for a wide range of applications to
work with. For details on these clauses, refer to the Apache Hive documentation.
IBM Software
Page 58 Explore Hadoop and BigInsights
__3. Load data into each of these tables using sample data provided in files. One at a time, issue
each of the following LOAD statements and verify that each completed successfully. Remember
to change the file path shown (if needed) to the appropriate path for your environment. The
statements will return a warning message providing details on the number of rows loaded, etc.
load hadoop using file url
'file:///opt/ibm/biginsights/bigsql/samples/data/GOSALESDW.GO_REGION_DIM.txt' with SOURCE
PROPERTIES ('field.delimiter'='t') INTO TABLE GO_REGION_DIM overwrite;
load hadoop using file url
'file:///opt/ibm/biginsights/bigsql/samples/data/GOSALESDW.SLS_ORDER_METHOD_DIM.txt' with
SOURCE PROPERTIES ('field.delimiter'='t') INTO TABLE SLS_ORDER_METHOD_DIM overwrite;
load hadoop using file url
'file:///opt/ibm/biginsights/bigsql/samples/data/GOSALESDW.SLS_PRODUCT_BRAND_LOOKUP.txt'
with SOURCE PROPERTIES ('field.delimiter'='t') INTO TABLE SLS_PRODUCT_BRAND_LOOKUP
overwrite;
load hadoop using file url
'file:///opt/ibm/biginsights/bigsql/samples/data/GOSALESDW.SLS_PRODUCT_DIM.txt' with SOURCE
PROPERTIES ('field.delimiter'='t') INTO TABLE SLS_PRODUCT_DIM overwrite;
load hadoop using file url
'file:///opt/ibm/biginsights/bigsql/samples/data/GOSALESDW.SLS_PRODUCT_LINE_LOOKUP.txt' with
SOURCE PROPERTIES ('field.delimiter'='t') INTO TABLE SLS_PRODUCT_LINE_LOOKUP overwrite;
load hadoop using file url
'file:///opt/ibm/biginsights/bigsql/samples/data/GOSALESDW.SLS_PRODUCT_LOOKUP.txt' with
SOURCE PROPERTIES ('field.delimiter'='t') INTO TABLE SLS_PRODUCT_LOOKUP overwrite;
load hadoop using file url
'file:///opt/ibm/biginsights/bigsql/samples/data/GOSALESDW.SLS_SALES_FACT.txt' with SOURCE
PROPERTIES ('field.delimiter'='t') INTO TABLE SLS_SALES_FACT overwrite;
load hadoop using file url
'file:///opt/ibm/biginsights/bigsql/samples/data/GOSALESDW.MRK_PROMOTION_FACT.txt' with
SOURCE PROPERTIES ('field.delimiter'='t') INTO TABLE MRK_PROMOTION_FACT overwrite;
Hands On Lab Page 59
Let’s explore the LOAD syntax shown in these examples briefly. The first line of each
example loads data into your table using a file URL specification and then specifies the
full path to the data source file on your local file system. Note that the path is local to the
Big SQL server (not your Eclipse client). The WITH SOURCE PROPERTIES clause
specifies that fields in the source data are delimited by tabs (“t”). The INTO TABLE
clause identifies the target table for the LOAD operation. The OVERWRITE keyword
indicates that any existing data in the table will be replaced by data contained in the
source file. (If you wanted to simply add rows to the table’s content, you could specify
APPEND instead.)
Note that loading data from a local file is only one of several available options. You can
also load data using FTP or SFTP. This is particularly handy for loading data from
remote file systems, although you can practice using it against your local file system, too.
For example, the following statement for loading data into the
GOSALESDW.GO_REGION_DIM table using SFTP is equivalent to the syntax shown
earlier for loading data into this table from a local file:
load hadoop using file url
'sftp://myID:myPassword@myServer.ibm.com:22/opt/ibm/biginsights/bigsql/
samples/data/GOSALESDW.GO_REGION_DIM.txt'
with SOURCE PROPERTIES ('field.delimiter'='t') INTO TABLE
gosalesdw.GO_REGION_DIM overwrite;
Big SQL supports other LOAD options, including loading data directly from a remote
relational DBMS via a JDBC connection. See the product documentation for details.
__4. Query the tables to verify that the expected number of rows was loaded into each table. Execute
each query that follows individually and compare the results with the number of rows specified in
the comment line preceding each query.
-- total rows in GO_REGION_DIM = 21
select count(*) from GO_REGION_DIM;
-- total rows in sls_order_method_dim = 7
select count(*) from sls_order_method_dim;
-- total rows in SLS_PRODUCT_BRAND_LOOKUP = 28
select count(*) from SLS_PRODUCT_BRAND_LOOKUP;
-- total rows in SLS_PRODUCT_DIM = 274
select count(*) from SLS_PRODUCT_DIM;
-- total rows in SLS_PRODUCT_LINE_LOOKUP = 5
select count(*) from SLS_PRODUCT_LINE_LOOKUP;
-- total rows in SLS_PRODUCT_LOOKUP = 6302
select count(*) from SLS_PRODUCT_LOOKUP;
-- total rows in SLS_SALES_FACT = 446023
select count(*) from SLS_SALES_FACT;
-- total rows gosalesdw.MRK_PROMOTION_FACT = 11034
select count(*) from MRK_PROMOTION_FACT;
IBM Software
Page 60 Explore Hadoop and BigInsights
5.3. Querying tables with joins, aggregations and more
Now you're ready to query your tables. Based on earlier exercises, you've already seen that you can
perform basic SQL operations, including projections (to extract specific columns from your tables) and
restrictions (to extract specific rows meeting certain conditions you specified). Let's explore a few
examples that are a bit more sophisticated.
In this lesson, you will create and run Big SQL queries that join data from multiple tables as well as
perform aggregations and other SQL operations. Note that the queries included in this section are based
on queries shipped with BigInsights as samples. Some of these queries return hundreds of thousands of
rows; however, the Eclipse SQL Results page limits output to only 500 rows. Although you can change
that value in the Data Management preferences section, retain the default setting for this lab.
__1. Join data from multiple tables to return the product name, quantity and order method of goods
that have been sold. To do so, execute the following query.
-- Fetch the product name, quantity, and order method
-- of products sold.
-- Query 1
SELECT pnumb.product_name, sales.quantity,
meth.order_method_en
FROM
sls_sales_fact sales,
sls_product_dim prod,
sls_product_lookup pnumb,
sls_order_method_dim meth
WHERE
pnumb.product_language='EN'
AND sales.product_key=prod.product_key
AND prod.product_number=pnumb.product_number
AND meth.order_method_key=sales.order_method_key;
Let’s review a few aspects of this query briefly:
• Data from four tables will be used to drive the results of this query (see the tables referenced in
the FROM clause). Relationships between these tables are resolved through 3 join predicates
specified as part of the WHERE clause. The query relies on 3 equi-joins to filter data from the
referenced tables. (Predicates such as prod.product_number=pnumb.product_number help to
narrow the results to product numbers that match in two tables.)
• For improved readability, this query uses aliases in the SELECT and FROM clauses when
referencing tables. For example, pnumb.product_name refers to “pnumb,” which is the alias for
the gosalesdw.sls_product_lookup table. Once defined in the FROM clause, an alias can be used
in the WHERE clause so that you do not need to repeat the complete table name.
• The use of the predicate and pnumb.product_language=’EN’ helps to further narrow the result
to only English output. This database contains thousands of rows of data in various languages, so
restricting the language provides some optimization.
Hands On Lab Page 61
__2. Modify the query to restrict the order method to one type – those involving a Sales visit. To
do so, add the following query predicate just before the semi-colon:
AND order_method_en='Sales visit'
__3. Inspect the results, a subset of which is shown below:
IBM Software
Page 62 Explore Hadoop and BigInsights
__4. To find out which sales method of all the methods has the greatest quantity of orders, add a
GROUP BY clause (group by pll.product_line_en, md.order_method_en). In addition,
invoke the SUM aggregate function (sum(sf.quantity)) to total the orders by product and
method. Finally, this query cleans up the output a bit by using aliases (e.g., as Product) to
substitute a more readable column header.
-- Query 3
SELECT pll.product_line_en AS Product,
md.order_method_en AS Order_method,
sum(sf.QUANTITY) AS total
FROM
sls_order_method_dim AS md,
sls_product_dim AS pd,
sls_product_line_lookup AS pll,
sls_product_brand_lookup AS pbl,
sls_sales_fact AS sf
WHERE
pd.product_key = sf.product_key
AND md.order_method_key = sf.order_method_key
AND pll.product_line_code = pd.product_line_code
AND pbl.product_brand_code = pd.product_brand_code
GROUP BY pll.product_line_en, md.order_method_en;
Hands On Lab Page 63
__5. Inspect the results, which should contain 35 rows. A portion is shown below.
5.4. Optional: Using SerDes for non-traditional data
While data structured in CSV and TSV columns are often stored in BigInsights and loaded into Big SQL
tables, you may also need to work with other types of data – data that might require the use of a
serializer / deserializer (SerDe). SerDes are common in the Hadoop environment. You’ll find a number
of SerDes available in the public domain, or you can write your own following typical Hadoop practices.
Using a SerDe with Big SQL is pretty straightforward. Once you develop or locate the SerDe you need,
just add its JAR file to the appropriate BigInsights subdirectories. Then stop and restart the Big SQL
service, and specify the SerDe class name when you create your table.
In this lab exercise, you will use a SerDe to define a table for JSON-based blog data. The sample blog
file for this exercise is the same blog file you used as input to BigSheets in a prior lab.
__1. Download the hive-json-serde-0.2.jar into a directory of your choice on your local file system,
such as /home/biadmin/sampleData. (As of this writing, the full URL for this SerDe is
https://code.google.com/p/hive-json-serde/downloads/detail?name=hive-json-serde-0.2.jar)
__2. Register the SerDe with BigInsights.
__a. Stop the Big SQL server. From a terminal window, issue this command:
$BIGINSIGHTS_HOME/bin/stop.sh bigsql
__b. Copy the SerDe .jar file to the $BIGSQL_HOME/userlib and $HIVE_HOME/lib
directories.
IBM Software
Page 64 Explore Hadoop and BigInsights
__c. Restart the Big SQL server. From a terminal window, issue this command:
$BIGINSIGHTS_HOME/bin/start.sh bigsql
Now that you’ve registered your SerDe, you’re ready to use it. In this section, you will create a table that
relies on the SerDe you just registered. For simplicity, this will be an externally managed table – i.e., a
table created over a user directory that resides outside of the Hive warehouse. This user directory will
contain the table's data in files. As part of this exercise, you will upload the sample blogs-data.txt file into
the target DFS directory.
Creating a Big SQL table over an existing DFS directory has the effect of populating this table with all the
data in the directory. To satisfy queries, Big SQL will look in the user directory specified when you
created the table and consider all files in that directory to be the table’s contents. This is consistent with
the Hive concept of an externally managed table.
Once the table is created, you'll query that table. In doing so, you'll note that the presence of a SerDe is
transparent to your queries.
__3. If necessary, download the .zip file containing the sample data from the bottom half of the article
referenced in the introduction. Unzip the file into a directory on your local file system, such as
/home/biadmin. You will be working with the blogs-data.txt file.
From the Files tab of the Web console, navigate to the /user/biadmin/sampleData directory
of your distributed file system. Use the create directory button to create a subdirectory named
SerDe-Test.
__4. Upload the blogs-data.txt file into /user/biadmin/sampleData/SerDe-Test.
Hands On Lab Page 65
__5. Return to the Big SQL execution environment of your choice (JSqsh or Eclipse).
__6. Execute the following statement, which creates a TESTBLOGS table that includes a LOCATION
clause that specifies the DFS directory containing your sample blogs-data.txt file:
create hadoop table if not exists testblogs (
Country String,
Crawled String,
FeedInfo String,
Inserted String,
IsAdult int,
Language String,
Postsize int,
Published String,
SubjectHtml String,
Tags String,
Type String,
Url String)
row format serde 'org.apache.hadoop.hive.contrib.serde2.JsonSerde'
location '/user/biadmin/sampleData/SerDe-Test';
5.5. Optional: Developing a JDBC client application with Big SQL
You can write a JDBC client application that uses Big SQL to open a database connection, execute
queries, and process the results. In this optional exercise, you'll see how writing a client JDBC
application for Big SQL is like writing a client application for any relational DBMS that supports JDBC
access.
__1. In the IBM InfoSphere BigInsights Eclipse environment, create a Java project by clicking File >
New >Project. From the New Project window, select Java Project. Click Next.
IBM Software
Page 66 Explore Hadoop and BigInsights
__2. Type a name for the project in the Project Name field, such as MyJavaProject. Click Next.
__3. Open the Libraries tab and click Add External Jars. Add the DB2 JDBC driver for BigInsights,
located at /opt/ibm/biginsights/database/db2/java/db2jcc4.jar.
__4. Click Finish. Click Yes when you are asked if you want to open the Java perspective.
__5. Right-click the MyJavaProject project, and click New > Package. In the Name field, in the New
Java Package window, type a name for the package, such as aJavaPackage4me. Click Finish.
Hands On Lab Page 67
__6. Right-click the aJavaPackage4me package, and click New > Class.
__7. In the New Java Class window, in the Name field, type SampApp. Select the public static void
main(String[] args) check box. Click Finish.
__8. Replace the default code for this class and copy or type the following code into the
SampApp.java file (you'll find the file in
/opt/ibm/biginsights/bigsql/samples/data/SampApp.java):
package aJavaPackage4me;
//a. Import required package(s)
import java.sql.*;
public class SampApp {
IBM Software
Page 68 Explore Hadoop and BigInsights
/**
* @param args
*/
//b. set JDBC & database info
//change these as needed for your environment
static final String db = "jdbc:db2://YOUR_HOST_NAME:51000/bigsql";
static final String user = "YOUR_USER_ID";
static final String pwd = "YOUR_PASSWORD";
public static void main(String[] args) {
Connection conn = null;
Statement stmt = null;
System.out.println("Started sample JDBC application.");
try{
//c. Register JDBC driver -- not needed for DB2 JDBC type 4 connection
// Class.forName("com.ibm.db2.jcc.DB2Driver");
//d. Get a connection
conn = DriverManager.getConnection(db, user, pwd);
System.out.println("Connected to the database.");
//e. Execute a query
stmt = conn.createStatement();
System.out.println("Created a statement.");
String sql;
sql = "select product_color_code, product_number from sls_product_dim " +
"where product_key=30001";
ResultSet rs = stmt.executeQuery(sql);
System.out.println("Executed a query.");
//f. Obtain results
System.out.println("Result set: ");
while(rs.next()){
//Retrieve by column name
int product_color = rs.getInt("PRODUCT_COLOR_CODE");
int product_number = rs.getInt("PRODUCT_NUMBER");
//Display values
System.out.print("* Product Color: " + product_color + "n");
System.out.print("* Product Number: " + product_number + "n");
}
//g. Close open resources
rs.close();
stmt.close();
conn.close();
}catch(SQLException sqlE){
// Process SQL errors
sqlE.printStackTrace();
}catch(Exception e){
// Process other errors
e.printStackTrace();
}
finally{
Hands On Lab Page 69
// Ensure resources are closed before exiting
try{
if(stmt!=null)
stmt.close();
}catch(SQLException sqle2){
} // nothing we can do
try{
if(conn!=null)
conn.close();
}
catch(SQLException sqlE){
sqlE.printStackTrace();
}// end finally block
}// end try block
System.out.println("Application complete");
}}
__a. After the package declaration, ensure that you include the packages that contain
the JDBC classes that are needed for database programming (import java.sql.*;).
__b. Set up the database information so that you can refer to it. Be sure to change
the user ID, password, and connection information as needed for your environment.
__c. Optionally, register the JDBC driver. The class name is provided here for your
reference. When using the DB2 Type 4.0 JDBC driver, it’s not necessary to specify the
class name.
__d. Open the connection.
__e. Run a query by submitting an SQL statement to the database.
__f. Extract data from result set.
__g. Clean up the environment by closing all of the database resources.
__9. Save the file and right-click the Java file and click Run > Run as > Java Application.
__10. The results show in the Console view of Eclipse:
Started sample JDBC application.
Connected to the database.
Created a statement.
Executed a query.
Result set:
* Product Color: 908
* Product Number: 1110
Application complete
IBM Software
Page 70 Explore Hadoop and BigInsights
Lab 6 Summary
In this lab, you gained hands-on experience using many popular capabilities of InfoSphere BigInsights,
IBM's Hadoop-based platform for analyzing big data. You explored your BigInsights cluster using a
Web-based console and manipulated social media data using a spreadsheet-style interface. You also
created Big SQL tables for your data and executed several complex queries over this data.
To expand your skills even further, visit the HadoopDev web site (https://developer.ibm.com/hadoop/)
contains for links to free online courses, tutorials, and more.
Now that you’re ready to get started using BigInsights for your own projects. What will you do with big
data?
NOTES
NOTES
© Copyright IBM Corporation 2014.
The information contained in these materials is provided for
informational purposes only, and is provided AS IS without warranty
of any kind, express or implied. IBM shall not be responsible for any
damages arising out of the use of, or otherwise related to, these
materials. Nothing contained in these materials is intended to, nor
shall have the effect of, creating any warranties or representations
from IBM or its suppliers or licensors, or altering the terms and
conditions of the applicable license agreement governing the use of
IBM software. References in these materials to IBM products,
programs, or services do not imply that they will be available in all
countries in which IBM operates. This information is based on
current IBM product plans and strategy, which are subject to change
by IBM without notice. Product release dates and/or capabilities
referenced in these materials may change at any time at IBM’s sole
discretion based on market opportunities or other factors, and are not
intended to be a commitment to future product or feature availability
in any way.
IBM, the IBM logo and ibm.com are trademarks of International
Business Machines Corp., registered in many jurisdictions
worldwide. Other product and service names might be trademarks of
IBM or other companies. A current list of IBM trademarks is
available on the Web at “Copyright and trademark information” at
www.ibm.com/legal/copytrade.shtml.

More Related Content

What's hot

Big Data: Querying complex JSON data with BigInsights and Hadoop
Big Data:  Querying complex JSON data with BigInsights and HadoopBig Data:  Querying complex JSON data with BigInsights and Hadoop
Big Data: Querying complex JSON data with BigInsights and HadoopCynthia Saracco
 
Big Data: HBase and Big SQL self-study lab
Big Data:  HBase and Big SQL self-study lab Big Data:  HBase and Big SQL self-study lab
Big Data: HBase and Big SQL self-study lab Cynthia Saracco
 
Big SQL 3.0: Datawarehouse-grade Performance on Hadoop - At last!
Big SQL 3.0: Datawarehouse-grade Performance on Hadoop - At last!Big SQL 3.0: Datawarehouse-grade Performance on Hadoop - At last!
Big SQL 3.0: Datawarehouse-grade Performance on Hadoop - At last!Nicolas Morales
 
Big SQL 3.0 - Fast and easy SQL on Hadoop
Big SQL 3.0 - Fast and easy SQL on HadoopBig SQL 3.0 - Fast and easy SQL on Hadoop
Big SQL 3.0 - Fast and easy SQL on HadoopWilfried Hoge
 
Big Data: SQL on Hadoop from IBM
Big Data:  SQL on Hadoop from IBM Big Data:  SQL on Hadoop from IBM
Big Data: SQL on Hadoop from IBM Cynthia Saracco
 
Big Data: Getting off to a fast start with Big SQL (World of Watson 2016 sess...
Big Data: Getting off to a fast start with Big SQL (World of Watson 2016 sess...Big Data: Getting off to a fast start with Big SQL (World of Watson 2016 sess...
Big Data: Getting off to a fast start with Big SQL (World of Watson 2016 sess...Cynthia Saracco
 
Big SQL Competitive Summary - Vendor Landscape
Big SQL Competitive Summary - Vendor LandscapeBig SQL Competitive Summary - Vendor Landscape
Big SQL Competitive Summary - Vendor LandscapeNicolas Morales
 
Taming Big Data with Big SQL 3.0
Taming Big Data with Big SQL 3.0Taming Big Data with Big SQL 3.0
Taming Big Data with Big SQL 3.0Nicolas Morales
 
Running Cognos on Hadoop
Running Cognos on HadoopRunning Cognos on Hadoop
Running Cognos on HadoopSenturus
 
Best Practices for Deploying Hadoop (BigInsights) in the Cloud
Best Practices for Deploying Hadoop (BigInsights) in the CloudBest Practices for Deploying Hadoop (BigInsights) in the Cloud
Best Practices for Deploying Hadoop (BigInsights) in the CloudLeons Petražickis
 
Hadoop Innovation Summit 2014
Hadoop Innovation Summit 2014Hadoop Innovation Summit 2014
Hadoop Innovation Summit 2014Data Con LA
 
Big_SQL_3.0_Whitepaper
Big_SQL_3.0_WhitepaperBig_SQL_3.0_Whitepaper
Big_SQL_3.0_WhitepaperScott Gray
 
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookAmr Awadallah
 
Hadoop-DS: Which SQL-on-Hadoop Rules the Herd
Hadoop-DS: Which SQL-on-Hadoop Rules the HerdHadoop-DS: Which SQL-on-Hadoop Rules the Herd
Hadoop-DS: Which SQL-on-Hadoop Rules the HerdIBM Analytics
 
SQL Server Extended Events presentation from SQL Midlands User Group 14th Mar...
SQL Server Extended Events presentation from SQL Midlands User Group 14th Mar...SQL Server Extended Events presentation from SQL Midlands User Group 14th Mar...
SQL Server Extended Events presentation from SQL Midlands User Group 14th Mar...Stuart Moore
 

What's hot (16)

Big Data: Querying complex JSON data with BigInsights and Hadoop
Big Data:  Querying complex JSON data with BigInsights and HadoopBig Data:  Querying complex JSON data with BigInsights and Hadoop
Big Data: Querying complex JSON data with BigInsights and Hadoop
 
Big Data: HBase and Big SQL self-study lab
Big Data:  HBase and Big SQL self-study lab Big Data:  HBase and Big SQL self-study lab
Big Data: HBase and Big SQL self-study lab
 
Big SQL 3.0: Datawarehouse-grade Performance on Hadoop - At last!
Big SQL 3.0: Datawarehouse-grade Performance on Hadoop - At last!Big SQL 3.0: Datawarehouse-grade Performance on Hadoop - At last!
Big SQL 3.0: Datawarehouse-grade Performance on Hadoop - At last!
 
Big SQL 3.0 - Fast and easy SQL on Hadoop
Big SQL 3.0 - Fast and easy SQL on HadoopBig SQL 3.0 - Fast and easy SQL on Hadoop
Big SQL 3.0 - Fast and easy SQL on Hadoop
 
Big Data: SQL on Hadoop from IBM
Big Data:  SQL on Hadoop from IBM Big Data:  SQL on Hadoop from IBM
Big Data: SQL on Hadoop from IBM
 
Big Data: Getting off to a fast start with Big SQL (World of Watson 2016 sess...
Big Data: Getting off to a fast start with Big SQL (World of Watson 2016 sess...Big Data: Getting off to a fast start with Big SQL (World of Watson 2016 sess...
Big Data: Getting off to a fast start with Big SQL (World of Watson 2016 sess...
 
Ibm db2 big sql
Ibm db2 big sqlIbm db2 big sql
Ibm db2 big sql
 
Big SQL Competitive Summary - Vendor Landscape
Big SQL Competitive Summary - Vendor LandscapeBig SQL Competitive Summary - Vendor Landscape
Big SQL Competitive Summary - Vendor Landscape
 
Taming Big Data with Big SQL 3.0
Taming Big Data with Big SQL 3.0Taming Big Data with Big SQL 3.0
Taming Big Data with Big SQL 3.0
 
Running Cognos on Hadoop
Running Cognos on HadoopRunning Cognos on Hadoop
Running Cognos on Hadoop
 
Best Practices for Deploying Hadoop (BigInsights) in the Cloud
Best Practices for Deploying Hadoop (BigInsights) in the CloudBest Practices for Deploying Hadoop (BigInsights) in the Cloud
Best Practices for Deploying Hadoop (BigInsights) in the Cloud
 
Hadoop Innovation Summit 2014
Hadoop Innovation Summit 2014Hadoop Innovation Summit 2014
Hadoop Innovation Summit 2014
 
Big_SQL_3.0_Whitepaper
Big_SQL_3.0_WhitepaperBig_SQL_3.0_Whitepaper
Big_SQL_3.0_Whitepaper
 
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
 
Hadoop-DS: Which SQL-on-Hadoop Rules the Herd
Hadoop-DS: Which SQL-on-Hadoop Rules the HerdHadoop-DS: Which SQL-on-Hadoop Rules the Herd
Hadoop-DS: Which SQL-on-Hadoop Rules the Herd
 
SQL Server Extended Events presentation from SQL Midlands User Group 14th Mar...
SQL Server Extended Events presentation from SQL Midlands User Group 14th Mar...SQL Server Extended Events presentation from SQL Midlands User Group 14th Mar...
SQL Server Extended Events presentation from SQL Midlands User Group 14th Mar...
 

Viewers also liked

Big Data: Technical Introduction to BigSheets for InfoSphere BigInsights
Big Data:  Technical Introduction to BigSheets for InfoSphere BigInsightsBig Data:  Technical Introduction to BigSheets for InfoSphere BigInsights
Big Data: Technical Introduction to BigSheets for InfoSphere BigInsightsCynthia Saracco
 
Transformations and actions a visual guide training
Transformations and actions a visual guide trainingTransformations and actions a visual guide training
Transformations and actions a visual guide trainingSpark Summit
 
Big Data: Big SQL and HBase
Big Data:  Big SQL and HBase Big Data:  Big SQL and HBase
Big Data: Big SQL and HBase Cynthia Saracco
 
131023 instrumentos-de-los-parques
131023 instrumentos-de-los-parques131023 instrumentos-de-los-parques
131023 instrumentos-de-los-parquesÀlex Brossa Enrique
 
My Background Thierry Verlhiac
My Background Thierry VerlhiacMy Background Thierry Verlhiac
My Background Thierry Verlhiacverlhiac
 
Boletin Degremont octubre 2012
Boletin Degremont octubre 2012Boletin Degremont octubre 2012
Boletin Degremont octubre 2012slidesharedgt
 
EBI Presentation 2011
EBI Presentation 2011 EBI Presentation 2011
EBI Presentation 2011 Rod Kimber
 
Manual de usuario firma de documentos excel 2010
Manual de usuario firma de documentos excel 2010Manual de usuario firma de documentos excel 2010
Manual de usuario firma de documentos excel 2010Security Data
 
Massive Social Bookmarking Checklist Regarding Search Engine Optimisation Alo...
Massive Social Bookmarking Checklist Regarding Search Engine Optimisation Alo...Massive Social Bookmarking Checklist Regarding Search Engine Optimisation Alo...
Massive Social Bookmarking Checklist Regarding Search Engine Optimisation Alo...ebooker97
 
Id digital y seguridad en la red presentacion
Id digital y seguridad en la red  presentacionId digital y seguridad en la red  presentacion
Id digital y seguridad en la red presentacionHebe Gargiulo
 
Software educativo. sebran abc
Software educativo. sebran abcSoftware educativo. sebran abc
Software educativo. sebran abcFabiana Suárez
 
Catálogo de cursos de tantra. Agosto y septiembre 2016
Catálogo de cursos de tantra. Agosto y septiembre 2016Catálogo de cursos de tantra. Agosto y septiembre 2016
Catálogo de cursos de tantra. Agosto y septiembre 2016Tantra y Amor Consciente
 
Productos tradicionales y denominaciones consagradas por el uso
Productos tradicionales y denominaciones consagradas por el usoProductos tradicionales y denominaciones consagradas por el uso
Productos tradicionales y denominaciones consagradas por el usoainia centro tecnológico
 

Viewers also liked (17)

Big Insights v4.1
Big Insights v4.1Big Insights v4.1
Big Insights v4.1
 
Big Data: Technical Introduction to BigSheets for InfoSphere BigInsights
Big Data:  Technical Introduction to BigSheets for InfoSphere BigInsightsBig Data:  Technical Introduction to BigSheets for InfoSphere BigInsights
Big Data: Technical Introduction to BigSheets for InfoSphere BigInsights
 
Transformations and actions a visual guide training
Transformations and actions a visual guide trainingTransformations and actions a visual guide training
Transformations and actions a visual guide training
 
Big Data: Big SQL and HBase
Big Data:  Big SQL and HBase Big Data:  Big SQL and HBase
Big Data: Big SQL and HBase
 
Biodanza internacional
Biodanza internacionalBiodanza internacional
Biodanza internacional
 
131023 instrumentos-de-los-parques
131023 instrumentos-de-los-parques131023 instrumentos-de-los-parques
131023 instrumentos-de-los-parques
 
Lil wayne
Lil wayneLil wayne
Lil wayne
 
My Background Thierry Verlhiac
My Background Thierry VerlhiacMy Background Thierry Verlhiac
My Background Thierry Verlhiac
 
Boletin Degremont octubre 2012
Boletin Degremont octubre 2012Boletin Degremont octubre 2012
Boletin Degremont octubre 2012
 
EBI Presentation 2011
EBI Presentation 2011 EBI Presentation 2011
EBI Presentation 2011
 
Vicari Group unleashed
Vicari Group unleashedVicari Group unleashed
Vicari Group unleashed
 
Manual de usuario firma de documentos excel 2010
Manual de usuario firma de documentos excel 2010Manual de usuario firma de documentos excel 2010
Manual de usuario firma de documentos excel 2010
 
Massive Social Bookmarking Checklist Regarding Search Engine Optimisation Alo...
Massive Social Bookmarking Checklist Regarding Search Engine Optimisation Alo...Massive Social Bookmarking Checklist Regarding Search Engine Optimisation Alo...
Massive Social Bookmarking Checklist Regarding Search Engine Optimisation Alo...
 
Id digital y seguridad en la red presentacion
Id digital y seguridad en la red  presentacionId digital y seguridad en la red  presentacion
Id digital y seguridad en la red presentacion
 
Software educativo. sebran abc
Software educativo. sebran abcSoftware educativo. sebran abc
Software educativo. sebran abc
 
Catálogo de cursos de tantra. Agosto y septiembre 2016
Catálogo de cursos de tantra. Agosto y septiembre 2016Catálogo de cursos de tantra. Agosto y septiembre 2016
Catálogo de cursos de tantra. Agosto y septiembre 2016
 
Productos tradicionales y denominaciones consagradas por el uso
Productos tradicionales y denominaciones consagradas por el usoProductos tradicionales y denominaciones consagradas por el uso
Productos tradicionales y denominaciones consagradas por el uso
 

Similar to Big Data: Explore Hadoop and BigInsights self-study lab

IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop lab s...
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop lab s...IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop lab s...
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop lab s...Leons Petražickis
 
Hands-on-Lab: Adding Value to HBase with IBM InfoSphere BigInsights and BigSQL
Hands-on-Lab: Adding Value to HBase with IBM InfoSphere BigInsights and BigSQLHands-on-Lab: Adding Value to HBase with IBM InfoSphere BigInsights and BigSQL
Hands-on-Lab: Adding Value to HBase with IBM InfoSphere BigInsights and BigSQLPiotr Pruski
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopSaurav Sinha
 
Ibm hadoop info sphere biginsights install
Ibm hadoop info sphere biginsights installIbm hadoop info sphere biginsights install
Ibm hadoop info sphere biginsights installDarnette A
 
HDinsight Workshop - Prerequisite Activity
HDinsight Workshop - Prerequisite ActivityHDinsight Workshop - Prerequisite Activity
HDinsight Workshop - Prerequisite ActivityIdan Tohami
 
Diff sand box and farm
Diff sand box and farmDiff sand box and farm
Diff sand box and farmRajkiran Swain
 
Vipul divyanshu mahout_documentation
Vipul divyanshu mahout_documentationVipul divyanshu mahout_documentation
Vipul divyanshu mahout_documentationVipul Divyanshu
 
Installating and Configuring Java, MySQL and BIRT.
Installating and Configuring Java, MySQL and BIRT.Installating and Configuring Java, MySQL and BIRT.
Installating and Configuring Java, MySQL and BIRT.NR Computer Learning Center
 
Hadoop Tutorial
Hadoop TutorialHadoop Tutorial
Hadoop Tutorialemedin
 
Hortonworks Setup & Configuration on Azure
Hortonworks Setup & Configuration on AzureHortonworks Setup & Configuration on Azure
Hortonworks Setup & Configuration on AzureAnita Luthra
 
Informatica object migration
Informatica object migrationInformatica object migration
Informatica object migrationAmit Sharma
 
LuisRodriguezLocalDevEnvironmentsDrupalOpenDays
LuisRodriguezLocalDevEnvironmentsDrupalOpenDaysLuisRodriguezLocalDevEnvironmentsDrupalOpenDays
LuisRodriguezLocalDevEnvironmentsDrupalOpenDaysLuis Rodríguez Castromil
 
Expanding XPages with Bootstrap Plugins for Ultimate Usability
Expanding XPages with Bootstrap Plugins for Ultimate UsabilityExpanding XPages with Bootstrap Plugins for Ultimate Usability
Expanding XPages with Bootstrap Plugins for Ultimate UsabilityTeamstudio
 
Hadoop and Mapreduce Certification
Hadoop and Mapreduce CertificationHadoop and Mapreduce Certification
Hadoop and Mapreduce CertificationVskills
 
My First Hadoop Program !!!
My First Hadoop Program !!!My First Hadoop Program !!!
My First Hadoop Program !!!Ayapparaj SKS
 
Drupal Continuous Integration with Jenkins - Deploy
Drupal Continuous Integration with Jenkins - DeployDrupal Continuous Integration with Jenkins - Deploy
Drupal Continuous Integration with Jenkins - DeployJohn Smith
 
Free ERP 2BizBox Quick Start Tutorial
Free ERP 2BizBox Quick Start TutorialFree ERP 2BizBox Quick Start Tutorial
Free ERP 2BizBox Quick Start Tutorial253725291
 

Similar to Big Data: Explore Hadoop and BigInsights self-study lab (20)

IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop lab s...
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop lab s...IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop lab s...
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop lab s...
 
Hands-on-Lab: Adding Value to HBase with IBM InfoSphere BigInsights and BigSQL
Hands-on-Lab: Adding Value to HBase with IBM InfoSphere BigInsights and BigSQLHands-on-Lab: Adding Value to HBase with IBM InfoSphere BigInsights and BigSQL
Hands-on-Lab: Adding Value to HBase with IBM InfoSphere BigInsights and BigSQL
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Ibm hadoop info sphere biginsights install
Ibm hadoop info sphere biginsights installIbm hadoop info sphere biginsights install
Ibm hadoop info sphere biginsights install
 
HDinsight Workshop - Prerequisite Activity
HDinsight Workshop - Prerequisite ActivityHDinsight Workshop - Prerequisite Activity
HDinsight Workshop - Prerequisite Activity
 
Diff sand box and farm
Diff sand box and farmDiff sand box and farm
Diff sand box and farm
 
Vipul divyanshu mahout_documentation
Vipul divyanshu mahout_documentationVipul divyanshu mahout_documentation
Vipul divyanshu mahout_documentation
 
Installating and Configuring Java, MySQL and BIRT.
Installating and Configuring Java, MySQL and BIRT.Installating and Configuring Java, MySQL and BIRT.
Installating and Configuring Java, MySQL and BIRT.
 
Hadoop Tutorial
Hadoop TutorialHadoop Tutorial
Hadoop Tutorial
 
hbase lab
hbase labhbase lab
hbase lab
 
Hortonworks Setup & Configuration on Azure
Hortonworks Setup & Configuration on AzureHortonworks Setup & Configuration on Azure
Hortonworks Setup & Configuration on Azure
 
Wordpress as a framework
Wordpress as a frameworkWordpress as a framework
Wordpress as a framework
 
Informatica object migration
Informatica object migrationInformatica object migration
Informatica object migration
 
LuisRodriguezLocalDevEnvironmentsDrupalOpenDays
LuisRodriguezLocalDevEnvironmentsDrupalOpenDaysLuisRodriguezLocalDevEnvironmentsDrupalOpenDays
LuisRodriguezLocalDevEnvironmentsDrupalOpenDays
 
Expanding XPages with Bootstrap Plugins for Ultimate Usability
Expanding XPages with Bootstrap Plugins for Ultimate UsabilityExpanding XPages with Bootstrap Plugins for Ultimate Usability
Expanding XPages with Bootstrap Plugins for Ultimate Usability
 
Final White Paper_
Final White Paper_Final White Paper_
Final White Paper_
 
Hadoop and Mapreduce Certification
Hadoop and Mapreduce CertificationHadoop and Mapreduce Certification
Hadoop and Mapreduce Certification
 
My First Hadoop Program !!!
My First Hadoop Program !!!My First Hadoop Program !!!
My First Hadoop Program !!!
 
Drupal Continuous Integration with Jenkins - Deploy
Drupal Continuous Integration with Jenkins - DeployDrupal Continuous Integration with Jenkins - Deploy
Drupal Continuous Integration with Jenkins - Deploy
 
Free ERP 2BizBox Quick Start Tutorial
Free ERP 2BizBox Quick Start TutorialFree ERP 2BizBox Quick Start Tutorial
Free ERP 2BizBox Quick Start Tutorial
 

Recently uploaded

Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 

Recently uploaded (20)

The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 

Big Data: Explore Hadoop and BigInsights self-study lab

  • 1. Explore Big Data with Hadoop and InfoSphere BigInsights Cynthia M. Saracco (saracco@us.ibm.com) August 15, 2014
  • 2. IBM Software Page 2 Explore Hadoop and BigInsights Contents LAB 1 OVERVIEW......................................................................................................................................................... 3 1.1. ABOUT YOUR ENVIRONMENT ..................................................................................................................... 3 1.2. GETTING STARTED ................................................................................................................................... 4 LAB 2 ISSUING BASIC HADOOP COMMANDS .......................................................................................................... 8 2.1. CREATING A DIRECTORY IN YOUR DISTRIBUTED FILE SYSTEM......................................................................... 8 2.2. COPYING DATA INTO HDFS ...................................................................................................................... 8 2.3. RUNNING A SAMPLE MAPREDUCE APPLICATION........................................................................................... 9 LAB 3 EXPLORING AND ADMINISTERING YOUR CLUSTER WITH THE BIGINSIGHTS WEB CONSOLE ........... 13 3.1. GETTING STARTED WITH THE WEB CONSOLE ............................................................................................ 13 3.2. ADMINISTERING BIGINSIGHTS.................................................................................................................. 14 3.3. WORKING WITH THE DISTRIBUTED FILE SYSTEM (HDFS) ............................................................................ 17 3.4. MANAGING AND LAUNCHING PRE-BUILT APPLICATIONS FROM THE WEB CATALOG .......................................... 22 LAB 4 ANALYZING SOCIAL MEDIA DATA WITH BIGSHEETS ............................................................................... 28 4.1. CREATING A WORKBOOK......................................................................................................................... 28 4.2. ANALYZING AND CUSTOMIZING YOUR WORKBOOK ...................................................................................... 30 4.3. CREATING CHARTS................................................................................................................................. 38 4.4. CREATING A BIG SQL TABLE BASED ON YOUR WORKBOOK ......................................................................... 42 4.5. OPTIONAL: EXPORTING YOUR WORKBOOK DATA ....................................................................................... 45 LAB 5 QUERYING DATA WITH BIG SQL .................................................................................................................. 48 5.1. CREATING A PROJECT AND EXECUTING BIG SQL STATEMENTS ................................................................... 48 5.2. CREATING SAMPLE TABLES AND LOADING SAMPLE DATA ............................................................................. 53 5.3. QUERYING TABLES WITH JOINS, AGGREGATIONS AND MORE ........................................................................ 60 5.4. OPTIONAL: USING SERDES FOR NON-TRADITIONAL DATA .......................................................................... 63 5.5. OPTIONAL: DEVELOPING A JDBC CLIENT APPLICATION WITH BIG SQL ....................................................... 65 LAB 6 SUMMARY ....................................................................................................................................................... 70
  • 3. Hands On Lab Page 3 Lab 1 Overview In this hands-on lab, you'll learn how to work with Big Data using Apache Hadoop and InfoSphere BigInsights, IBM's Hadoop-based platform. In particular, you'll learn the basics of working with the Hadoop Distributed File System (HDFS) and see how to administer your Hadoop-based environment using the BigInsights Web console. After launching a sample MapReduce application, you'll explore a more sophisticated scenario involving social media data. In doing so, you'll learn how to use a spreadsheet-style interface to discover insights about the global coverage of a popular brand without writing any code. Finally, you'll learn how to apply industry standard SQL to data managed by BigInsights through IBM's Big SQL technology. Indeed, you'll have a chance to create tables and execute complex queries over data in HDFS, including data derived from a relational data warehouse. Ready to get started? After completing this hands-on lab, you’ll be able to: • Work directly with Apache Hadoop through file system commands • Inspect and administer your cluster through the BigInsights Web Console • Explore big data using a spreadsheet-style tool • Use Big SQL to create tables and issue complex queries Allow 2 ½ - 3 hours to complete this lab. This lab was developed by Cynthia M. Saracco, IBM Silicon Valley Lab. Please post questions or comments about this lab or the technologies it describes to the forum on Hadoop Dev at https://developer.ibm.com/hadoop/. 1.1. About your environment This lab was developed for the InfoSphere BigInsights 3.0 Quick Start Edition VMware image. If necessary, download and install the single-node cluster VMware image from this site: http://www- 01.ibm.com/software/data/infosphere/biginsights/quick-start/downloads.html The VMware image is set up in the following manner: User Password VM Image root account root password VM Image lab user account biadmin biadmin BigInsights Administrator biadmin biadmin Big SQL Administrator bigsql bigsql Lab user biadmin biadmin
  • 4. IBM Software Page 4 Explore Hadoop and BigInsights Property Value Host name bivm.ibm.com BigInsights Web Console URL http://bivm.ibm.com:8080 Big SQL database name bigsql Big SQL port number 51000 . About the screen captures, sample code, and environment configuration Screen captures in this lab depict examples and results that may vary from what you see when you complete the exercises. In addition, some code examples may need to be customized to match your environment. For example, you may need to alter directory path information or user ID information. 1.2. Getting started To get started with the lab exercises, you need to install and launch the VMware image as well as start the required services. __1. If necessary, obtain a copy of the BigInsights 3.0 Quick Start Edition VMware image from IBM's external download site (http://www-01.ibm.com/software/data/infosphere/biginsights/quick- start/downloads.html). Use the image for the single-node cluster. __2. Follow the instructions provided to decompress (unzip) the file and install the image on your laptop. Note that there is a README file with additional information. __3. If necessary, install VMware player or other required software to run VMware images. Details are in the README file provided with the BigInsights VMware image. __4. Launch the VMware image. When logging in for the first time, use the root ID (with a password of password). Follow the instructions to configure your environment, accept the licensing agreement, and enter the passwords for the root and biadmin IDs (root/password and biadmin/biadmin) when prompted. This is a one-time only requirement.
  • 5. Hands On Lab Page 5 __5. When the one-time configuration process is completed, you will be presented with a SUSE Linux log in screen. Log in as biadmin with a password of biadmin.
  • 6. IBM Software Page 6 Explore Hadoop and BigInsights __6. Verify that your screen appears similar to this: __7. Click Start BigInsights to start all required services. (Alternatively, you can open a terminal window and issue this command: $BIGINSIGHTS_HOME/bin/start-all.sh) Wait until the operation completes. This may take several minutes, depending on your machine's resources.
  • 7. Hands On Lab Page 7 __8. Verify that all required BigInsights services are up and running. From a terminal window, issue this command: $BIGINSIGHTS_HOME/bin/status.sh. __9. Inspect the results, a subset of which are shown below. Verify that, at a minimum, the following components started successfully: hdm, zookeeper, hadoop, catalog, hive, bigsql, oozie, console, and httpfs. Now you're ready to start working with big data! If have any questions or need help getting your environment up and running, visit Hadoop Dev (https://developer.ibm.com/hadoop/) and review the product documentation or post a message to the forum. You cannot proceed with subsequent lab exercises until you've logged into the VMware image and launched the necessary BigInsights services.
  • 8. IBM Software Page 8 Explore Hadoop and BigInsights Lab 2 Issuing basic Hadoop commands In this exercise, you’ll work directly with Apache Hadoop to perform some basic tasks involving the Hadoop Distributed File System (HDFS) and launching a sample application. All the work you’ll perform here involves commands and interfaces provided with Hadoop from http://hadoop.apache.org. As mentioned earlier, Hadoop is part of IBM’s InfoSphere BigInsights platform. Allow 15 minutes to complete this lab module. 2.1. Creating a directory in your distributed file system __1. Click the BigInsights Shell icon. __2. Select the Terminal icon to open a terminal window. __3. Execute the following Hadoop file system command to create a directory in HDFS for your work: hadoop fs -mkdir /user/biadmin/test Note that HDFS is distinct from your Unix/Linux local file system directory, and working with HDFS requires using hadoop fs commands. 2.2. Copying data into HDFS __1. Using standard Unix/Linux file system commands, list the contents of the /home/biadmin/licenses directory. ls /home/biadmin/licenses Note the BIlicense_en.txt file. It contains license information in English, and it will serve as a sample data file for a future exercise. __2. Copy the BIlicense_en.txt file into the /user/biadmin/test directory you just created in HDFS. hadoop fs -put /home/biadmin/licenses/BIlicense_en.txt /user/biadmin/test
  • 9. Hands On Lab Page 9 __3. List the contents of your target HDFS directory to verify that the file was successfully copied. hadoop fs -ls /user/biadmin/test 2.3. Running a sample MapReduce application WordCount is one of several sample MapReduce applications provided for Apache Hadoop. Written in Java, it simply scans through input document(s) and, for each word, returns the total number of occurrences found. You can read more about WordCount on the Apache wiki (http://wiki.apache.org/hadoop/WordCount). Since launching MapReduce applications (or jobs) is a common practice in Hadoop, you'll explore how to do that with WordCount. __1. Execute the following command to launch the sample WordCount application provided with your Hadoop distribution. hadoop jar /opt/ibm/biginsights/IHC/hadoop-example.jar wordcount /user/biadmin/test WordCount_output This command specifies that the wordcount application contained in the specified .jar file is to be launched. The input for this application is in the /user/biadmin/test directory of HDFS. The output of this job will be stored in HDFS in the WordCount_output subdirectory of the user executing this command (biadmin). Thus, the output directory will be /user/biadmin/WordCount_output. This directory will be created automatically as a result of executing this application. NOTE: If the output folder already exists or if you try to rerun a successful MapReduce job with the same parameters, you will receive an error message. This is the default behavior of the sample WordCount application.
  • 10. IBM Software Page 10 Explore Hadoop and BigInsights __2. Inspect the output of your job. hadoop fs -ls WordCount_output In this case, the output was small and contained written to a single file. If you had run WordCount against a larger volume of data, its output would have been split into multiple files (e.g., part-r-00001, part-r-00002, and so on). __3. To view the contents of part-r-0000 file, issue this command: hadoop fs -cat WordCount_output/*00 Partial output is shown here:
  • 11. Hands On Lab Page 11 __4. Optionally, inspect details about your job. Open a Web browser, or click on the web console icon on your desktop and open a new tab. Access the URL for Hadoop's Job Tracker (http://bivm.ibm.com:50030/jobtracker.jsp). Scroll to the Completed Jobs section to locate the Job ID associated with the Word Count application. Click on the Job ID link to review details, such as the number of Map and Reduce tasks launched for your application, the number of bytes read and written, etc. Partial output is shown in the second image that follows.
  • 12. IBM Software Page 12 Explore Hadoop and BigInsights
  • 13. Hands On Lab Page 13 Lab 3 Exploring and administering your cluster with the BigInsights Web console As you saw in the previous lab, Apache Hadoop users typically work through a command line interface to perform many common tasks. This lab introduces you to the BigInsights Web console, which enables you to administer your cluster, work with HDFS, launch jobs, and perform many other tasks using a graphical interface. After completing this hands-on lab, you’ll be able to: • Launch the Web console. • Work with popular resources accessible through the Welcome page. • Administer BigInsights by inspecting the status of your cluster and accessing tools for open source components provided with BigInsights. • Work with the distributed file system. In particular, you'll explore the HDFS directory structure, create subdirectories, and upload files to HDFS. • Manage and launch pre-built applications from a Web catalog. • Inspect the status of previously launched applications (jobs) and review their output. Allow 30 minutes to complete this section of lab. This lab is an introduction to a subset of console functions. Real-time monitoring, dashboards, alerts, and application linking are among the more advanced console functions that are beyond this lab's scope. 3.1. Getting started with the Web Console In this exercise, you will launch the console and inspect its Welcome page. __1. Launch the BigInsights Web console. Direct your browser to http://bivm.ibm.com:8080 or click the Web Console icon on your desktop. __2. Log in with your user name and password (biadmin / biadmin).
  • 14. IBM Software Page 14 Explore Hadoop and BigInsights __3. Verify that your Web console appears similar to this: __4. Briefly skim through the links provided in these sections to become familiar with resources available to you: Tasks: Quick access to popular BigInsights tasks Quick Links: Links to internal and external quick links and downloads to enhance your environment Learn More: Online resources available to learn more about BigInsights 3.2. Administering BigInsights The Web console allows administrators to inspect the overall health of the system as well as perform basic functions, such as starting and stopping specific servers or components, adding nodes to the cluster, and so on. You’ll explore a subset of these capabilities here.
  • 15. Hands On Lab Page 15 __5. Click on the Cluster Status tab at the top of the page. __6. Inspect the overall status of your cluster. The figure below was taken on a single-node cluster that had several services running. One service – Monitoring -- was unavailable. Your display may differ somewhat. It’s not necessary for all BigInsights services to be running to complete the exercises in this lab. __7. Click on the Hive service and note the detailed information provided for this service in the pane at right. For example, you can see the URL for Hive's Web interface and its process ID. In addition, note that you can start and stop services (such as the Hive service) from the Cluster Status page of the console.
  • 16. IBM Software Page 16 Explore Hadoop and BigInsights __8. Optionally, cut-and-paste the URL for Hive’s Web interface into a new tab of your browser. You'll see an open source tool provided with Hive for administration purposes, as shown below. Other open source tools provided with Apache Hadoop are also available through IBM's packaged distribution (BigInsights), as you'll see shortly. Close this browser tab. __9. Click on the Welcome page of your Web console. __10. Click on the Access secure cluster servers button in the Quick Links section at right. If nothing appears, verify that the pop-up blocker of your browser is disabled; a prompt should appear at the top of the page if pop-ups are blocked. __11. Inspect the list of server components for which there are additional Web-based tools. The BigInsights console displays the URLs you can use to access each of these Web sites directly. (This information will only appear if the pop-up blocker is disabled on browser.) __12. Click on the jobtracker alias. The display should be familiar to you -- it's the same one you saw in the previous lab that introduced you to some basic Hadoop facilities.
  • 17. Hands On Lab Page 17 3.3. Working with the distributed file system (HDFS) In this section, you'll learn how to use the Web console to create directories in HDFS, navigate the file system, and upload small files -- tasks you performed earlier through a command-line interface. In addition, you'll perform a few other file-related tasks as well. Many people find the console's graphical interface to be easier to use than the command-line interface. __1. Click on the Files tab at the top of the page. __2. Expand the DFS directory tree in the left pane to display the contents of /user/biadmin. Note the presence of the /WordCount_output and /test subdirectories, which you created in an earlier lab. If desired, expand each directory and inspect its contents.
  • 18. IBM Software Page 18 Explore Hadoop and BigInsights __3. Become familiar with the functions provided through the icons at the top of this pane, as we'll refer to some of these in subsequent sections of this module. Simply position your cursor on each icon to learn its function. From left to right, the icons enable you to copy a file or directory, move a file, create a directory, rename a file or directory, upload a file to HDFS, download a file from HDFS to your local file system, remove a file or directory from HDFS, set permissions, open a command window to launch HDFS shell commands, and refresh the Web console page. __4. Delete the /user/biadmin/test directory and its contents. Position your cursor on this directory, click the red X icon, and click Yes when prompted.
  • 19. Hands On Lab Page 19 __5. Create a new subdirectory in /user/biadmin. With your cursor positioned on /user/biadmin, click the create directory icon. __6. When a pop-up window appears, specify test2 as the new directory's name and click OK.
  • 20. IBM Software Page 20 Explore Hadoop and BigInsights __7. Expand the directory hierarchy to verify that your new subdirectory was created. __8. Upload a file into this directory from your local file system. Click the upload icon.
  • 21. Hands On Lab Page 21 __9. When a pop-up window appears, click the Browse button to navigate through your local file system to /home/biadmin/licenses. Select the BIlicense_en.txt file and click Open. __10. Expand the /user/biadmin/test2 directory and verify that the BIlicense_en.txt file was successfully copied into HDFS. Note that the right pane of the Web console previews the file's contents.
  • 22. IBM Software Page 22 Explore Hadoop and BigInsights 3.4. Managing and launching pre-built applications from the Web catalog The Web console includes a catalog of ready-made applications that users can launch through a graphical interface. Each application's status, execution history, and output are easy to monitor from this page as well. In this exercise, you'll first manage the catalog’s contents, selecting one of more than 20 pre-built applications provided with BigInsights to deploy on your cluster. Once deployed, the application will be visible to all authorized users. You'll then launch the application, monitor its execution status, and inspect its output. As you might have guessed, the sample application used in this lab is Word Count -- the same application you ran from a command line earlier. __1. Click the Applications tab of the Web console. No applications are deployed on a new cluster, so there won't be much to see yet. __2. In the upper left corner, click Manage. A list of applications available for deployment are displayed.
  • 23. Hands On Lab Page 23 __3. Expand the Test category and click on the Word Count application. __4. Click Deploy. __5. When a pop-up window appears, accept the defaults for all settings and click Deploy.
  • 24. IBM Software Page 24 Explore Hadoop and BigInsights __6. After the application has been deployed, you're ready to run it. Click Run in the upper left pane. __7. Verify that the Word Count application appears in the catalog. (Any other applications that were previously deployed to the Web catalog will also appear.)
  • 25. Hands On Lab Page 25 __8. Click on the Word Count icon. The pane at right prompts you to enter appropriate information. For this application, you need to specify an execution name for your application's run, the HDFS directory containing the input document(s) for the Word Count application, and an output directory in HDFS. __9. For the Execution name, enter My Test Run 1. __10. For the Input path, click Browse and navigate to /user/biadmin/test2. Click OK. __11. For the Output path, type /user/biadmin/WordCount_console_output. (Recall that the Word Count application creates this output directory at run time. If you specify an existing HDFS directory for the output, the application will fail.) __12. Verify that your display appears similar to this and click Run.
  • 26. IBM Software Page 26 Explore Hadoop and BigInsights __13. As your application executes, monitor its status through the Application History pane at lower right. __14. When the application completes successfully, click the link provided in the Output column to see the application's output. __15. Optionally, return to the Applications page of the console and click on the link provided in the Details column for your application's run.
  • 27. Hands On Lab Page 27 __16. Note that the console displays the Application Status page, which contains information about the Oozie workflow for your application as well as the application itself. If desired, click on one or more available links to explore details available for your review.
  • 28. IBM Software Page 28 Explore Hadoop and BigInsights Lab 4 Analyzing social media data with BigSheets To help business analysts and those without a programming background analyze big data, IBM provides a spreadsheet-style tool called BigSheets. In this lab, you'll learn how you can explore big data through this tool without writing any scripts or MapReduce applications. The sample data for this lab consists of social media posts about a popular brand (IBM Watson) that was collected using a sample application provided with BigInsights. For background information, you may want to read the article on Analyzing social media and structured data with InfoSphere BigInsights at http://www.ibm.com/developerworks/data/library/techarticle/dm-1206socialmedia/index.html After completing this hands-on lab, you’ll be able to: • Create a BigSheets workbook • Analyze and customize a workbook • Visualize your workbook's data in a chart • Create a Big SQL table based on your workbook • Export your workbook's data into one of several popular formats Allow 45 – 60 minutes to complete this lab. 4.1. Creating a workbook To get started, copy the sample blogs-data.txt file to HDFS and create a master workbook for it. __1. Obtain the blogs-data.txt file. You’ll find this in the sampleData.zip file provided with the article mentioned earlier. __2. Use Hadoop file system commands or the BigInsights Web console to create subdirectories in HDFS for your sample data. Under /user/biadmin, create a /sampleData directory. Beneath /user/biadmin/sampleData, create the /IBMWatson subdirectory. Where did this data come from? For time efficiency, social media data about "IBM Watson" was already collected using the Boardreader sample application, which collects social media data from various global sites and writes the output in JSON array format to files. This lab focuses on blog data collected about IBM Watson for a six-month interval. Boardreader is an IBM business partner that offers a social media content aggregation and provisioning service based on a multilingual data dating back to 2001. The service searches message boards / forums, social networks, blogs/comments, microblogs, reviews, videos/comments and online news. Customers who want to use the Boardreader service should contact the firm directly to obtain a license key.
  • 29. Hands On Lab Page 29 If you forgot how to create a subdirectory in HDFS, consult the earlier labs on Issuing Basic Hadoop Commands or Exploring and Administering Your Cluster with the BigInsights Web Console. __3. Upload the blogs-data.txt file to the /user/biadmin/sampleData/IBMWatson directory. You can use Hadoop file system commands or the BigInsights Web console to do this. (If you forgot how to copy a file to HDFS, consult the earlier labs on Issuing Basic Hadoop Commands or Exploring and Administering Your Cluster with the BigInsights Web Console.) __4. From the Files page of the Web console, position your cursor on the /user/biadmin/sampleData/IBMWatson/blogs-data.txt file, as shown in the previous image. __5. Click the Sheet radio button to preview this data in a spreadsheet-style format. __6. Because the sample blog data for this lab is uses a JSON Array structure, you must click on the pencil icon to select an appropriate reader (data format translator) for this data. Select the JSON Array reader and click the green check.
  • 30. IBM Software Page 30 Explore Hadoop and BigInsights __7. Save this as a Master Workbook named Watson Blogs. Optionally, provide a description. Click Save. __8. Note that the BigSheets page of the Web console will open and your new workbook will be displayed. Now you're ready to begin exploring this data using BigSheets. 4.2. Analyzing and customizing your workbook BigSheets offers analysts a variety of macros, functions, and built-in analytical features. You'll learn about a few here.
  • 31. Hands On Lab Page 31 __1. To make it easier to search and manage your workbooks, add a few tags to the Watson Blogs master workbook you just created. In the upper right corner, click the icon to toggle the workbook display to show additional fields. Depending on the size of your browser, an additional scroll bar may appear at right. __2. Scroll down to the Workbook Details section. Locate the Tags field, select the green plus sign (+) , enter a tag for Watson, and click the green check mark. Repeat the process to add separate tags for IBM and blogs. __3. Click on the Workbooks link the upper left corner of your open workbook. __4. From the list of available workbooks, you can quickly search for a specific tag. Use the drop- down Tags menu to select the blogs tag or type tag: blogs into the box.
  • 32. IBM Software Page 32 Explore Hadoop and BigInsights __5. Open the Watson Blogs master workbook again. (Double click on it.) __6. Create a new workbook based on this master workbook. In BigSheets, a master workbook is a “base” workbook and has a limited set of things you can edit. So, to manipulate the data contained within a workbook, you want to create a new workbook derived from the master. __a. Click the Build new Workbook button. __b. When the new Workbook appears, change its default name. Click the pencil icon next to the name, enter Watson Blogs Revised as the new name, and click the green check mark. __c. Click the Fit column(s) button to more easily see columns A through H on your screen . __7. Remove the column IsAdult from your workbook. This is currently column E. Click on the triangle next to the column name of IsAdult and select the Remove.
  • 33. Hands On Lab Page 33 __8. In this case, you want to keep only a few columns. To easily remove several columns, click the triangle again (from any column) and select Organize Columns __a. Click the red X button next to each column you want to remove. In this case, KEEP the following columns __i. Country __ii. FeedInfo __iii. Language __iv. Published __v. SubjectHtml __vi. Tags __vii. Type __viii. Url __b. Click the green check mark button when you are ready to remove the columns you selected. Did I lose data? Deleting a column does not remove data. Deleting a column in a workbook just removes the mapping to this column.
  • 34. IBM Software Page 34 Explore Hadoop and BigInsights __9. Click on the Fit column(s) button again to show columns A through H. Verify that your screen appears similar to this: __10. From the Save menu at upper left, select Save. Provide a description for your workbook if you’d like. __11. Apply a built-in function to further investigate the contents of this workbook. Click the Add Sheets button in the lower left corner.
  • 35. Hands On Lab Page 35 __12. From the pop-up menu, select Function. You're going to apply a built-in function that extracts the URL Host information from the full URL links associated with the blog data that was captured. Doing so will enable you to identify and chart sites with greatest blog coverage of IBM Watson. __13. From the Function menu, click Categories and Url. __14. Select the URLHOST function. __15. In the new menu that appears, enter Get Host URL as the sheet name and select the Url column as the source of input to the URLHOST function.
  • 36. IBM Software Page 36 Explore Hadoop and BigInsights __16. At the bottom of the menu, click the Carry Over tab to specify which columns from the workbook you'd like to retain. Select Add All and click the green check mark. __17. Verify that your workbook contains a new URLHOST column and all previously existing columns. (Whenever you create a new Sheet or edit your workbook in some way, BigSheets will preview the results of your work against a small sample of the data represented by your workbook.) If desired, click the Fit Column button to show more columns on your screen.
  • 37. Hands On Lab Page 37 __18. Click Save > Save & Exit. __19. When prompted to Run or Close the workbook, click Run. "Running" a workbook instructs BigSheets to apply the logic you specified graphically against all data associated with your workbook. You can monitor the progress of your request by watching the status bar indicator in the upper right-hand side of the page. __20. When the operation completes, verify that your workbook appears similar to this:
  • 38. IBM Software Page 38 Explore Hadoop and BigInsights __21. If desired, use the Next button in the lower right corner to see page through the content a few times, noting the various URLHOST values. If desired, you could use built-in BigSheets features to sort the data based on URLHOST (or other) values, filter records (such as blogs written in the English language), etc. But perhaps the quickest way to see which sites published the most blogs about IBM Watson during this time period is to chart the results. You'll do that next. 4.3. Creating charts Now that you've customized your workbook to eliminate some unwanted columns and generate a new column containing URL host information, it's time to visualize the results. In this short exercise, you'll create two simple charts that identify the top 10 global sites with the most blog posts about IBM Watson. __1. If necessary, open the Watson Blogs Revised notebook. __2. Click on the Add chart link in the lower left.
  • 39. Hands On Lab Page 39 __3. Select chart > Bar as the chart type. __4. Specify appropriate properties for the bar chart, paying close attention to these fields: __a. Title: Top 10 Blog Sites for IBM Watson __b. X Axis: URLHOST __c. Sort By: Y Axis __d. Occurrence Order: Descending __e. Limit: 10
  • 40. IBM Software Page 40 Explore Hadoop and BigInsights __5. Click the green check mark. __6. When prompted, Run the chart. This causes BigSheets to apply your instructions to the entire data set. __7. Inspect the results. Are you surprised that ibm.com wasn’t the top site for blog posts about IBM Watson?
  • 41. Hands On Lab Page 41 __8. If desired, hover over each bar to see the URL host name and the number of blogs posted at that site. __9. Next, create a new chart of a different type to visualize the information in a different format. Select Add Chart > Categories > cloud > Bubble Cloud. __10. Provide appropriate values for the following fields: __a. Title: Top 10 Blog Sites for IBM Waton __b. Tags: URLHOST __c. Occurrence Order: Descending __d. Sort By: Count __e. Limit: 10
  • 42. IBM Software Page 42 Explore Hadoop and BigInsights __11. Click the green check mark. __12. When prompted, Run the chart. __13. Inspect the results. If desired, hover over a bubble to see the number of blog postings for that site. 4.4. Creating a Big SQL table based on your workbook BigSheets offers a wide range of built-in features, including the ability to create a Big SQL table from your workbook. This is quite handy if you have SQL-based tools or applications that you'd like to use with data you've customized in BigSheets.
  • 43. Hands On Lab Page 43 __1. If necessary, open your Watson Blogs Revised workbook. __2. Click Create Table button just above the columns of your workbook. When prompted, accept sheets as the target schema name and type mywatsonblogs as the target table name. __3. Click Confirm. __4. From the Files page of the Web console, click the Catalog Tables tab in the navigation window and expand the sheets folder. __5. Click the mywatsonblogs file. Note that a preview of the table appears in the pane at right. __6. Click the Welcome tab of the Web console. In the Quick Links section, click the Run Big SQL queries link.
  • 44. IBM Software Page 44 Explore Hadoop and BigInsights __7. A new tab will appear in your Web browser. __8. In the box where you're prompted to enter your Big SQL query, type this statement: select urlhost, language, subjecthtml from sheets.mywatsonblogs fetch first 10 rows only; __9. Verify that the Big SQL radio button is checked (not the Big SQL V1 radio button). __10. If necessary, use the scroll bar at right to expose the Run button just below the radio buttons. Click Run. __11. Inspect the results.
  • 45. Hands On Lab Page 45 __12. Close the Big SQL browser tab. 4.5. Optional: Exporting your workbook data In this optional exercise, you'll see how easy it is to export data in your workbook to one of several popular formats so that other applications can easily access the data. __1. If necessary, open your Watson Blogs Revised workbook. __2. Click Export data. From the drop-down menu, select TSV (tab separated value) as the format type. __3. Click the File radio button to export the data to a file in your distributed file system. Querying tables with Big SQL While the Web console's Big SQL query interface is handy for executing test queries that return a small amount of data, it's best to use other facilities provided by IBM or third parties to execute Big SQL queries that return larger volumes of data to avoid memory constraints imposed by your browser. In a subsequent lab, you'll learn how to execute Big SQL queries from Eclipse.
  • 46. IBM Software Page 46 Explore Hadoop and BigInsights __4. Use the Browse button to navigate to the directory in HDFS where you would like to export this workbook. In this case, select /user/biadmin/sampleData/IBMWatson. In the box below the directory tree, enter myworkbook as the file name. Do not add a file extension such as .tsv. Click OK. __5. Click OK again to initiate the data export operation. __6. When a message appears indicating that the operation has finished, click OK. __7. On the Files page of the Web console, navigate to the directory you specified for the export (/user/biadmin/sampleData/IBMWatson) and locate your new myworkbook.tsv file.
  • 47. Hands On Lab Page 47 __8. Optionally, click the download icon to copy the file from HDFS to a directory of your choice in your local file system.
  • 48. IBM Software Page 48 Explore Hadoop and BigInsights Lab 5 Querying data with Big SQL Now that you know how to work with HDFS and analyze your data with a spreadsheet-style tool, it’s a good time to explore how you can query your data with Big SQL. Big SQL provides broad SQL support based on the ISO SQL standard. You can issue queries using JDBC or ODBC drivers to access data that is stored in InfoSphere BigInsights in the same way that you access relational databases from your enterprise applications. The SQL query engine supports joins, unions, grouping, common table expressions, windowing functions, and other familiar SQL expressions. This tutorial uses sales data from a fictional company that sells and distributes outdoor products to third- party retailer stores as well as directly to consumers through its online store. It maintains its data in a series of FACT and DIMENSION tables, as is common in relational data warehouse environments. In this lab, you will explore how to create, populate, and query a subset of the star schema database to investigate the company’s performance and offerings. Note that BigInsights provides scripts to create and populate the more than 60 tables that comprise the sample GOSALESDW database. You will use fewer than 10 of these tables in this lab. To execute the queries in this lab, you will use the open source Eclipse environment provided with the BigInsights Quick Start Edition VMware image. Of course, you can use other tools or interfaces to invoke Big SQL, such as the Java SQL Shell (JSqsh), a command-line facility provided with the BigInsights. However, Eclipse is a good choice for this lab, as it formats query results in a manner that’s easy to read and encourages you to collect your SQL statements into scripts for editing and testing. After you complete the lessons in this module, you will understand how to: • Connect to the Big SQL server from Eclipse • Execute individual or multiple Big SQL statements • Create Big SQL tables in Hadoop • Populate Big SQL tables with data from local files • Query Big SQL tables using projections, restrictions, joins, aggregations, and other popular expressions. • Create and query a view based on multiple Big SQL tables. • Create and run a JDBC client application for Big SQL using Eclipse. Allow 45 – 60 minutes to complete this lab. 5.1. Creating a project and executing Big SQL statements To begin, create a BigInsights project and Big SQL script. __1. Launch Eclipse using the icon on your desktop. Accept the default workspace when prompted. __2. Create a BigInsights project for your work. From the Eclipse menu bar, click File > New > Other. Expand the BigInsights folder, and select BigInsights Project, and then click Next.
  • 49. Hands On Lab Page 49 __3. Type myBigSQL in the Project name field, and then click Finish. __4. If you are not already in the BigInsights perspective, a Switch to the BigInsights perspective window opens. Click Yes to switch to the BigInsights perspective. __5. Create a new SQL script file. From the Eclipse menu bar, click File > New > Other. Expand the BigInsights folder, and select SQL script, and then click Next. __6. In the New SQL File window, in the Enter or select the parent folder field, select myBigSQL. Your new SQL file is stored in this project folder. __7. In the File name field, type aFirstFile. The .sql extension is added automatically. Click Finish. In the Select Connection Profile window, locate the Big SQL JDBC connection, which is the pre-defined connection to Big SQL 3.0 provided with the VMware image. Inspect the properties displayed in the Properties field. Verify that the connection uses the JDBC driver and database name shown in the Properties pane here.
  • 50. IBM Software Page 50 Explore Hadoop and BigInsights About the driver selection You may be wondering why you are using a connection that employs the com.ibm.com.db2.jcc.DB2 driver class. In 2014, IBM released a common SQL query engine as part of its DB2 and BigInsights offerings. Doing so provides for greater SQL commonality across its relational DBMS and Hadoop-based offerings. It also brings a greater breadth of SQL function to Hadoop (BigInsights) users. This common query engine is accessible through the DB2 driver. The Big SQL driver remains operational and offers connectivity to an earlier, BigInsights-specific SQL query engine. This lab focuses on using the common SQL query engine. __8. Click Edit to edit this connection's log in information.
  • 51. Hands On Lab Page 51 __9. Change the user name and password properties to match your user ID and password (e.g., biadmin / biadmin). Leave the remaining property values intact. __10. Click Test Connection to verify that you can successfully connect to the server. __11. Check the Save password box and click OK. __12. Click Finish to close the connection window. Your empty SQL script will be displayed. __13. Copy the following statement into your SQL script: create hadoop table test1 (col1 int, col2 varchar(5));
  • 52. IBM Software Page 52 Explore Hadoop and BigInsights Because you didn't specify a schema name for the table, it will be created in your default schema, which is your user name (biadmin). Thus, the previous statement is equivalent to create hadoop table biadmin.test1 (col1 int, col2 varchar(5)); In some cases, the Eclipse SQL editor may flag certain Big SQL statements as containing syntax errors. Ignore these false warnings and continue with your lab exercises. __14. Save your file (press Ctrl + S or click File > Save). __15. Right mouse click anywhere in the script to display a menu of options. __16. Select Run SQL or press F5. This causes all statements in your script to be executed. __17. Inspect the SQL Results pane that appears towards the bottom of your display. (If desired, double click on the SQL Results tab to enlarge this pane. Then double click on the tab again to return the pane to its normal size.) Verify that the statement executed successfully. Your Big SQL database now contains a new table named BIADMIN.TEST1. Note that your schema and table name were folded into upper case.
  • 53. Hands On Lab Page 53 For the remainder of this lab, you should execute each SQL statement individually. To do so, highlight the statement with your cursor and press F5. When you’re developing a SQL script with multiple statements, it’s generally a good idea to test each statement one at a time to verify that each is working as expected. __18. From your Eclipse project, query the system for meta data about your test1 table: select tabschema, colname, colno, typename, length from syscat.columns where tabschema = USER and tabname= 'TEST1'; In case you're wondering, syscat.columns is one of a number of views supplied over system catalog data automatically maintained for you by the Big SQL service. __19. Inspect the SQL Results to verify that the query executed successfully, and click on the Result1 tab to view its output. __20. Finally, clean up the object you created in the database. drop table test1; __21. Save your file. If desired, leave it open to execute statements for subsequent exercises. Now that you’ve set up your Eclipse environment and know how to create SQL scripts and execute queries, you’re ready to develop more sophisticated scenarios using Big SQL. In the next lab, you will create a number of tables in your schema and use Eclipse to query them. 5.2. Creating sample tables and loading sample data In this lesson, you will create several sample tables and load data into these tables from local files.
  • 54. IBM Software Page 54 Explore Hadoop and BigInsights __1. Determine the location of the sample data in your local file system and make a note of it. You will need to use this path specification when issuing LOAD commands later in this lab. Subsequent examples in this section presume your sample data is in the /opt/ibm/biginsights/bigsql/samples/data directory. This is the location of the data on the BigInsights VMware image, and it is the default location in typical BigInsights installations. Furthermore, the /opt/ibm/biginsights/bigsql/samples/queries directory contains SQL scripts that include the CREATE TABLE, LOAD, and SELECT statements used in this lab, as well as other statements. __2. Create several tables to track information about sales. Issue each of the following CREATE TABLE statements one at a time, and verify that each completed successfully: -- dimension table for region info CREATE HADOOP TABLE IF NOT EXISTS go_region_dim ( country_key INT NOT NULL , country_code INT NOT NULL , flag_image VARCHAR(45) , iso_three_letter_code VARCHAR(9) NOT NULL , iso_two_letter_code VARCHAR(6) NOT NULL , iso_three_digit_code VARCHAR(9) NOT NULL , region_key INT NOT NULL , region_code INT NOT NULL , region_en VARCHAR(90) NOT NULL , country_en VARCHAR(90) NOT NULL , region_de VARCHAR(90), country_de VARCHAR(90), region_fr VARCHAR(90) , country_fr VARCHAR(90), region_ja VARCHAR(90), country_ja VARCHAR(90) , region_cs VARCHAR(90), country_cs VARCHAR(90), region_da VARCHAR(90) , country_da VARCHAR(90), region_el VARCHAR(90), country_el VARCHAR(90) , region_es VARCHAR(90), country_es VARCHAR(90), region_fi VARCHAR(90) , country_fi VARCHAR(90), region_hu VARCHAR(90), country_hu VARCHAR(90) , region_id VARCHAR(90), country_id VARCHAR(90), region_it VARCHAR(90) , country_it VARCHAR(90), region_ko VARCHAR(90), country_ko VARCHAR(90) , region_ms VARCHAR(90), country_ms VARCHAR(90), region_nl VARCHAR(90) , country_nl VARCHAR(90), region_no VARCHAR(90), country_no VARCHAR(90) , region_pl VARCHAR(90), country_pl VARCHAR(90), region_pt VARCHAR(90) , country_pt VARCHAR(90), region_ru VARCHAR(90), country_ru VARCHAR(90) , region_sc VARCHAR(90), country_sc VARCHAR(90), region_sv VARCHAR(90) , country_sv VARCHAR(90), region_tc VARCHAR(90), country_tc VARCHAR(90) , region_th VARCHAR(90), country_th VARCHAR(90) ) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' LINES TERMINATED BY 'n' STORED AS TEXTFILE ; -- dimension table tracking method of order for the sale (e.g., Web, fax) CREATE HADOOP TABLE IF NOT EXISTS sls_order_method_dim
  • 55. Hands On Lab Page 55 ( order_method_key INT NOT NULL , order_method_code INT NOT NULL , order_method_en VARCHAR(90) NOT NULL , order_method_de VARCHAR(90), order_method_fr VARCHAR(90) , order_method_ja VARCHAR(90), order_method_cs VARCHAR(90) , order_method_da VARCHAR(90), order_method_el VARCHAR(90) , order_method_es VARCHAR(90), order_method_fi VARCHAR(90) , order_method_hu VARCHAR(90), order_method_id VARCHAR(90) , order_method_it VARCHAR(90), order_method_ko VARCHAR(90) , order_method_ms VARCHAR(90), order_method_nl VARCHAR(90) , order_method_no VARCHAR(90), order_method_pl VARCHAR(90) , order_method_pt VARCHAR(90), order_method_ru VARCHAR(90) , order_method_sc VARCHAR(90), order_method_sv VARCHAR(90) , order_method_tc VARCHAR(90), order_method_th VARCHAR(90) ) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' LINES TERMINATED BY 'n' STORED AS TEXTFILE ; -- look up table with product brand info in various languages CREATE HADOOP TABLE IF NOT EXISTS sls_product_brand_lookup ( product_brand_code INT NOT NULL , product_brand_en VARCHAR(90) NOT NULL , product_brand_de VARCHAR(90), product_brand_fr VARCHAR(90) , product_brand_ja VARCHAR(90), product_brand_cs VARCHAR(90) , product_brand_da VARCHAR(90), product_brand_el VARCHAR(90) , product_brand_es VARCHAR(90), product_brand_fi VARCHAR(90) , product_brand_hu VARCHAR(90), product_brand_id VARCHAR(90) , product_brand_it VARCHAR(90), product_brand_ko VARCHAR(90) , product_brand_ms VARCHAR(90), product_brand_nl VARCHAR(90) , product_brand_no VARCHAR(90), product_brand_pl VARCHAR(90) , product_brand_pt VARCHAR(90), product_brand_ru VARCHAR(90) , product_brand_sc VARCHAR(90), product_brand_sv VARCHAR(90) , product_brand_tc VARCHAR(90), product_brand_th VARCHAR(90) ) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' LINES TERMINATED BY 'n' STORED AS TEXTFILE ; -- product dimension table CREATE HADOOP TABLE IF NOT EXISTS sls_product_dim ( product_key INT NOT NULL , product_line_code INT NOT NULL , product_type_key INT NOT NULL , product_type_code INT NOT NULL , product_number INT NOT NULL , base_product_key INT NOT NULL , base_product_number INT NOT NULL , product_color_code INT
  • 56. IBM Software Page 56 Explore Hadoop and BigInsights , product_size_code INT , product_brand_key INT NOT NULL , product_brand_code INT NOT NULL , product_image VARCHAR(60) , introduction_date TIMESTAMP , discontinued_date TIMESTAMP ) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' LINES TERMINATED BY 'n' STORED AS TEXTFILE ; -- look up table with product line info in various languages CREATE HADOOP TABLE IF NOT EXISTS sls_product_line_lookup ( product_line_code INT NOT NULL , product_line_en VARCHAR(90) NOT NULL , product_line_de VARCHAR(90), product_line_fr VARCHAR(90) , product_line_ja VARCHAR(90), product_line_cs VARCHAR(90) , product_line_da VARCHAR(90), product_line_el VARCHAR(90) , product_line_es VARCHAR(90), product_line_fi VARCHAR(90) , product_line_hu VARCHAR(90), product_line_id VARCHAR(90) , product_line_it VARCHAR(90), product_line_ko VARCHAR(90) , product_line_ms VARCHAR(90), product_line_nl VARCHAR(90) , product_line_no VARCHAR(90), product_line_pl VARCHAR(90) , product_line_pt VARCHAR(90), product_line_ru VARCHAR(90) , product_line_sc VARCHAR(90), product_line_sv VARCHAR(90) , product_line_tc VARCHAR(90), product_line_th VARCHAR(90) ) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' LINES TERMINATED BY 'n' STORED AS TEXTFILE; -- look up table for products CREATE HADOOP TABLE IF NOT EXISTS sls_product_lookup ( product_number INT NOT NULL , product_language VARCHAR(30) NOT NULL , product_name VARCHAR(150) NOT NULL , product_descriptionVARCHAR(765) ) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' LINES TERMINATED BY 'n' STORED AS TEXTFILE; -- fact table for sales CREATE HADOOP TABLE IF NOT EXISTS sls_sales_fact ( order_day_key INT NOT NULL , organization_key INT NOT NULL , employee_key INT NOT NULL , retailer_key INT NOT NULL , retailer_site_key INT NOT NULL , product_key INT NOT NULL
  • 57. Hands On Lab Page 57 , promotion_key INT NOT NULL , order_method_key INT NOT NULL , sales_order_key INT NOT NULL , ship_day_key INT NOT NULL , close_day_key INT NOT NULL , quantity INT , unit_cost DOUBLE , unit_price DOUBLE , unit_sale_price DOUBLE , gross_margin DOUBLE , sale_total DOUBLE , gross_profit DOUBLE ) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' LINES TERMINATED BY 'n' STORED AS TEXTFILE ; -- fact table for marketing promotions CREATE HADOOP TABLE IF NOT EXISTS mrk_promotion_fact ( organization_key INT NOT NULL , order_day_key INT NOT NULL , rtl_country_key INT NOT NULL , employee_key INT NOT NULL , retailer_key INT NOT NULL , product_key INT NOT NULL , promotion_key INT NOT NULL , sales_order_key INT NOT NULL , quantity SMALLINT , unit_cost DOUBLE , unit_price DOUBLE , unit_sale_price DOUBLE , gross_margin DOUBLE , sale_total DOUBLE , gross_profit DOUBLE ) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' LINES TERMINATED BY 'n' STORED AS TEXTFILE; Let’s briefly explore some aspects of the CREATE TABLE statements shown here. If you have a SQL background, the majority of these statements should be familiar to you. However, after the column specification, there are some additional clauses unique to Big SQL – clauses that enable it to exploit Hadoop storage mechanisms (in this case, Hive). The ROW FORMAT clause specifies that fields are to be terminated by tabs (“t”) and lines are to be terminated by new line characters (“n”). The table will be stored in a TEXTFILE format, making it easy for a wide range of applications to work with. For details on these clauses, refer to the Apache Hive documentation.
  • 58. IBM Software Page 58 Explore Hadoop and BigInsights __3. Load data into each of these tables using sample data provided in files. One at a time, issue each of the following LOAD statements and verify that each completed successfully. Remember to change the file path shown (if needed) to the appropriate path for your environment. The statements will return a warning message providing details on the number of rows loaded, etc. load hadoop using file url 'file:///opt/ibm/biginsights/bigsql/samples/data/GOSALESDW.GO_REGION_DIM.txt' with SOURCE PROPERTIES ('field.delimiter'='t') INTO TABLE GO_REGION_DIM overwrite; load hadoop using file url 'file:///opt/ibm/biginsights/bigsql/samples/data/GOSALESDW.SLS_ORDER_METHOD_DIM.txt' with SOURCE PROPERTIES ('field.delimiter'='t') INTO TABLE SLS_ORDER_METHOD_DIM overwrite; load hadoop using file url 'file:///opt/ibm/biginsights/bigsql/samples/data/GOSALESDW.SLS_PRODUCT_BRAND_LOOKUP.txt' with SOURCE PROPERTIES ('field.delimiter'='t') INTO TABLE SLS_PRODUCT_BRAND_LOOKUP overwrite; load hadoop using file url 'file:///opt/ibm/biginsights/bigsql/samples/data/GOSALESDW.SLS_PRODUCT_DIM.txt' with SOURCE PROPERTIES ('field.delimiter'='t') INTO TABLE SLS_PRODUCT_DIM overwrite; load hadoop using file url 'file:///opt/ibm/biginsights/bigsql/samples/data/GOSALESDW.SLS_PRODUCT_LINE_LOOKUP.txt' with SOURCE PROPERTIES ('field.delimiter'='t') INTO TABLE SLS_PRODUCT_LINE_LOOKUP overwrite; load hadoop using file url 'file:///opt/ibm/biginsights/bigsql/samples/data/GOSALESDW.SLS_PRODUCT_LOOKUP.txt' with SOURCE PROPERTIES ('field.delimiter'='t') INTO TABLE SLS_PRODUCT_LOOKUP overwrite; load hadoop using file url 'file:///opt/ibm/biginsights/bigsql/samples/data/GOSALESDW.SLS_SALES_FACT.txt' with SOURCE PROPERTIES ('field.delimiter'='t') INTO TABLE SLS_SALES_FACT overwrite; load hadoop using file url 'file:///opt/ibm/biginsights/bigsql/samples/data/GOSALESDW.MRK_PROMOTION_FACT.txt' with SOURCE PROPERTIES ('field.delimiter'='t') INTO TABLE MRK_PROMOTION_FACT overwrite;
  • 59. Hands On Lab Page 59 Let’s explore the LOAD syntax shown in these examples briefly. The first line of each example loads data into your table using a file URL specification and then specifies the full path to the data source file on your local file system. Note that the path is local to the Big SQL server (not your Eclipse client). The WITH SOURCE PROPERTIES clause specifies that fields in the source data are delimited by tabs (“t”). The INTO TABLE clause identifies the target table for the LOAD operation. The OVERWRITE keyword indicates that any existing data in the table will be replaced by data contained in the source file. (If you wanted to simply add rows to the table’s content, you could specify APPEND instead.) Note that loading data from a local file is only one of several available options. You can also load data using FTP or SFTP. This is particularly handy for loading data from remote file systems, although you can practice using it against your local file system, too. For example, the following statement for loading data into the GOSALESDW.GO_REGION_DIM table using SFTP is equivalent to the syntax shown earlier for loading data into this table from a local file: load hadoop using file url 'sftp://myID:myPassword@myServer.ibm.com:22/opt/ibm/biginsights/bigsql/ samples/data/GOSALESDW.GO_REGION_DIM.txt' with SOURCE PROPERTIES ('field.delimiter'='t') INTO TABLE gosalesdw.GO_REGION_DIM overwrite; Big SQL supports other LOAD options, including loading data directly from a remote relational DBMS via a JDBC connection. See the product documentation for details. __4. Query the tables to verify that the expected number of rows was loaded into each table. Execute each query that follows individually and compare the results with the number of rows specified in the comment line preceding each query. -- total rows in GO_REGION_DIM = 21 select count(*) from GO_REGION_DIM; -- total rows in sls_order_method_dim = 7 select count(*) from sls_order_method_dim; -- total rows in SLS_PRODUCT_BRAND_LOOKUP = 28 select count(*) from SLS_PRODUCT_BRAND_LOOKUP; -- total rows in SLS_PRODUCT_DIM = 274 select count(*) from SLS_PRODUCT_DIM; -- total rows in SLS_PRODUCT_LINE_LOOKUP = 5 select count(*) from SLS_PRODUCT_LINE_LOOKUP; -- total rows in SLS_PRODUCT_LOOKUP = 6302 select count(*) from SLS_PRODUCT_LOOKUP; -- total rows in SLS_SALES_FACT = 446023 select count(*) from SLS_SALES_FACT; -- total rows gosalesdw.MRK_PROMOTION_FACT = 11034 select count(*) from MRK_PROMOTION_FACT;
  • 60. IBM Software Page 60 Explore Hadoop and BigInsights 5.3. Querying tables with joins, aggregations and more Now you're ready to query your tables. Based on earlier exercises, you've already seen that you can perform basic SQL operations, including projections (to extract specific columns from your tables) and restrictions (to extract specific rows meeting certain conditions you specified). Let's explore a few examples that are a bit more sophisticated. In this lesson, you will create and run Big SQL queries that join data from multiple tables as well as perform aggregations and other SQL operations. Note that the queries included in this section are based on queries shipped with BigInsights as samples. Some of these queries return hundreds of thousands of rows; however, the Eclipse SQL Results page limits output to only 500 rows. Although you can change that value in the Data Management preferences section, retain the default setting for this lab. __1. Join data from multiple tables to return the product name, quantity and order method of goods that have been sold. To do so, execute the following query. -- Fetch the product name, quantity, and order method -- of products sold. -- Query 1 SELECT pnumb.product_name, sales.quantity, meth.order_method_en FROM sls_sales_fact sales, sls_product_dim prod, sls_product_lookup pnumb, sls_order_method_dim meth WHERE pnumb.product_language='EN' AND sales.product_key=prod.product_key AND prod.product_number=pnumb.product_number AND meth.order_method_key=sales.order_method_key; Let’s review a few aspects of this query briefly: • Data from four tables will be used to drive the results of this query (see the tables referenced in the FROM clause). Relationships between these tables are resolved through 3 join predicates specified as part of the WHERE clause. The query relies on 3 equi-joins to filter data from the referenced tables. (Predicates such as prod.product_number=pnumb.product_number help to narrow the results to product numbers that match in two tables.) • For improved readability, this query uses aliases in the SELECT and FROM clauses when referencing tables. For example, pnumb.product_name refers to “pnumb,” which is the alias for the gosalesdw.sls_product_lookup table. Once defined in the FROM clause, an alias can be used in the WHERE clause so that you do not need to repeat the complete table name. • The use of the predicate and pnumb.product_language=’EN’ helps to further narrow the result to only English output. This database contains thousands of rows of data in various languages, so restricting the language provides some optimization.
  • 61. Hands On Lab Page 61 __2. Modify the query to restrict the order method to one type – those involving a Sales visit. To do so, add the following query predicate just before the semi-colon: AND order_method_en='Sales visit' __3. Inspect the results, a subset of which is shown below:
  • 62. IBM Software Page 62 Explore Hadoop and BigInsights __4. To find out which sales method of all the methods has the greatest quantity of orders, add a GROUP BY clause (group by pll.product_line_en, md.order_method_en). In addition, invoke the SUM aggregate function (sum(sf.quantity)) to total the orders by product and method. Finally, this query cleans up the output a bit by using aliases (e.g., as Product) to substitute a more readable column header. -- Query 3 SELECT pll.product_line_en AS Product, md.order_method_en AS Order_method, sum(sf.QUANTITY) AS total FROM sls_order_method_dim AS md, sls_product_dim AS pd, sls_product_line_lookup AS pll, sls_product_brand_lookup AS pbl, sls_sales_fact AS sf WHERE pd.product_key = sf.product_key AND md.order_method_key = sf.order_method_key AND pll.product_line_code = pd.product_line_code AND pbl.product_brand_code = pd.product_brand_code GROUP BY pll.product_line_en, md.order_method_en;
  • 63. Hands On Lab Page 63 __5. Inspect the results, which should contain 35 rows. A portion is shown below. 5.4. Optional: Using SerDes for non-traditional data While data structured in CSV and TSV columns are often stored in BigInsights and loaded into Big SQL tables, you may also need to work with other types of data – data that might require the use of a serializer / deserializer (SerDe). SerDes are common in the Hadoop environment. You’ll find a number of SerDes available in the public domain, or you can write your own following typical Hadoop practices. Using a SerDe with Big SQL is pretty straightforward. Once you develop or locate the SerDe you need, just add its JAR file to the appropriate BigInsights subdirectories. Then stop and restart the Big SQL service, and specify the SerDe class name when you create your table. In this lab exercise, you will use a SerDe to define a table for JSON-based blog data. The sample blog file for this exercise is the same blog file you used as input to BigSheets in a prior lab. __1. Download the hive-json-serde-0.2.jar into a directory of your choice on your local file system, such as /home/biadmin/sampleData. (As of this writing, the full URL for this SerDe is https://code.google.com/p/hive-json-serde/downloads/detail?name=hive-json-serde-0.2.jar) __2. Register the SerDe with BigInsights. __a. Stop the Big SQL server. From a terminal window, issue this command: $BIGINSIGHTS_HOME/bin/stop.sh bigsql __b. Copy the SerDe .jar file to the $BIGSQL_HOME/userlib and $HIVE_HOME/lib directories.
  • 64. IBM Software Page 64 Explore Hadoop and BigInsights __c. Restart the Big SQL server. From a terminal window, issue this command: $BIGINSIGHTS_HOME/bin/start.sh bigsql Now that you’ve registered your SerDe, you’re ready to use it. In this section, you will create a table that relies on the SerDe you just registered. For simplicity, this will be an externally managed table – i.e., a table created over a user directory that resides outside of the Hive warehouse. This user directory will contain the table's data in files. As part of this exercise, you will upload the sample blogs-data.txt file into the target DFS directory. Creating a Big SQL table over an existing DFS directory has the effect of populating this table with all the data in the directory. To satisfy queries, Big SQL will look in the user directory specified when you created the table and consider all files in that directory to be the table’s contents. This is consistent with the Hive concept of an externally managed table. Once the table is created, you'll query that table. In doing so, you'll note that the presence of a SerDe is transparent to your queries. __3. If necessary, download the .zip file containing the sample data from the bottom half of the article referenced in the introduction. Unzip the file into a directory on your local file system, such as /home/biadmin. You will be working with the blogs-data.txt file. From the Files tab of the Web console, navigate to the /user/biadmin/sampleData directory of your distributed file system. Use the create directory button to create a subdirectory named SerDe-Test. __4. Upload the blogs-data.txt file into /user/biadmin/sampleData/SerDe-Test.
  • 65. Hands On Lab Page 65 __5. Return to the Big SQL execution environment of your choice (JSqsh or Eclipse). __6. Execute the following statement, which creates a TESTBLOGS table that includes a LOCATION clause that specifies the DFS directory containing your sample blogs-data.txt file: create hadoop table if not exists testblogs ( Country String, Crawled String, FeedInfo String, Inserted String, IsAdult int, Language String, Postsize int, Published String, SubjectHtml String, Tags String, Type String, Url String) row format serde 'org.apache.hadoop.hive.contrib.serde2.JsonSerde' location '/user/biadmin/sampleData/SerDe-Test'; 5.5. Optional: Developing a JDBC client application with Big SQL You can write a JDBC client application that uses Big SQL to open a database connection, execute queries, and process the results. In this optional exercise, you'll see how writing a client JDBC application for Big SQL is like writing a client application for any relational DBMS that supports JDBC access. __1. In the IBM InfoSphere BigInsights Eclipse environment, create a Java project by clicking File > New >Project. From the New Project window, select Java Project. Click Next.
  • 66. IBM Software Page 66 Explore Hadoop and BigInsights __2. Type a name for the project in the Project Name field, such as MyJavaProject. Click Next. __3. Open the Libraries tab and click Add External Jars. Add the DB2 JDBC driver for BigInsights, located at /opt/ibm/biginsights/database/db2/java/db2jcc4.jar. __4. Click Finish. Click Yes when you are asked if you want to open the Java perspective. __5. Right-click the MyJavaProject project, and click New > Package. In the Name field, in the New Java Package window, type a name for the package, such as aJavaPackage4me. Click Finish.
  • 67. Hands On Lab Page 67 __6. Right-click the aJavaPackage4me package, and click New > Class. __7. In the New Java Class window, in the Name field, type SampApp. Select the public static void main(String[] args) check box. Click Finish. __8. Replace the default code for this class and copy or type the following code into the SampApp.java file (you'll find the file in /opt/ibm/biginsights/bigsql/samples/data/SampApp.java): package aJavaPackage4me; //a. Import required package(s) import java.sql.*; public class SampApp {
  • 68. IBM Software Page 68 Explore Hadoop and BigInsights /** * @param args */ //b. set JDBC & database info //change these as needed for your environment static final String db = "jdbc:db2://YOUR_HOST_NAME:51000/bigsql"; static final String user = "YOUR_USER_ID"; static final String pwd = "YOUR_PASSWORD"; public static void main(String[] args) { Connection conn = null; Statement stmt = null; System.out.println("Started sample JDBC application."); try{ //c. Register JDBC driver -- not needed for DB2 JDBC type 4 connection // Class.forName("com.ibm.db2.jcc.DB2Driver"); //d. Get a connection conn = DriverManager.getConnection(db, user, pwd); System.out.println("Connected to the database."); //e. Execute a query stmt = conn.createStatement(); System.out.println("Created a statement."); String sql; sql = "select product_color_code, product_number from sls_product_dim " + "where product_key=30001"; ResultSet rs = stmt.executeQuery(sql); System.out.println("Executed a query."); //f. Obtain results System.out.println("Result set: "); while(rs.next()){ //Retrieve by column name int product_color = rs.getInt("PRODUCT_COLOR_CODE"); int product_number = rs.getInt("PRODUCT_NUMBER"); //Display values System.out.print("* Product Color: " + product_color + "n"); System.out.print("* Product Number: " + product_number + "n"); } //g. Close open resources rs.close(); stmt.close(); conn.close(); }catch(SQLException sqlE){ // Process SQL errors sqlE.printStackTrace(); }catch(Exception e){ // Process other errors e.printStackTrace(); } finally{
  • 69. Hands On Lab Page 69 // Ensure resources are closed before exiting try{ if(stmt!=null) stmt.close(); }catch(SQLException sqle2){ } // nothing we can do try{ if(conn!=null) conn.close(); } catch(SQLException sqlE){ sqlE.printStackTrace(); }// end finally block }// end try block System.out.println("Application complete"); }} __a. After the package declaration, ensure that you include the packages that contain the JDBC classes that are needed for database programming (import java.sql.*;). __b. Set up the database information so that you can refer to it. Be sure to change the user ID, password, and connection information as needed for your environment. __c. Optionally, register the JDBC driver. The class name is provided here for your reference. When using the DB2 Type 4.0 JDBC driver, it’s not necessary to specify the class name. __d. Open the connection. __e. Run a query by submitting an SQL statement to the database. __f. Extract data from result set. __g. Clean up the environment by closing all of the database resources. __9. Save the file and right-click the Java file and click Run > Run as > Java Application. __10. The results show in the Console view of Eclipse: Started sample JDBC application. Connected to the database. Created a statement. Executed a query. Result set: * Product Color: 908 * Product Number: 1110 Application complete
  • 70. IBM Software Page 70 Explore Hadoop and BigInsights Lab 6 Summary In this lab, you gained hands-on experience using many popular capabilities of InfoSphere BigInsights, IBM's Hadoop-based platform for analyzing big data. You explored your BigInsights cluster using a Web-based console and manipulated social media data using a spreadsheet-style interface. You also created Big SQL tables for your data and executed several complex queries over this data. To expand your skills even further, visit the HadoopDev web site (https://developer.ibm.com/hadoop/) contains for links to free online courses, tutorials, and more. Now that you’re ready to get started using BigInsights for your own projects. What will you do with big data?
  • 71. NOTES
  • 72. NOTES
  • 73.
  • 74. © Copyright IBM Corporation 2014. The information contained in these materials is provided for informational purposes only, and is provided AS IS without warranty of any kind, express or implied. IBM shall not be responsible for any damages arising out of the use of, or otherwise related to, these materials. Nothing contained in these materials is intended to, nor shall have the effect of, creating any warranties or representations from IBM or its suppliers or licensors, or altering the terms and conditions of the applicable license agreement governing the use of IBM software. References in these materials to IBM products, programs, or services do not imply that they will be available in all countries in which IBM operates. This information is based on current IBM product plans and strategy, which are subject to change by IBM without notice. Product release dates and/or capabilities referenced in these materials may change at any time at IBM’s sole discretion based on market opportunities or other factors, and are not intended to be a commitment to future product or feature availability in any way. IBM, the IBM logo and ibm.com are trademarks of International Business Machines Corp., registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on the Web at “Copyright and trademark information” at www.ibm.com/legal/copytrade.shtml.