GHCNPaper3

Using LittleFe for “Big” Data Education: An Experience
Report with GHCN
David Monismith
monismia
Bala Venkata
Paneendra
Abburi
S519288b
Achyuth
Chaitanya
Chitumalla
S519322b
Santhosh Reddy
Damasani
S519323b
Satyanarayana
Juttiga
S519343b
Snehitha Reddy
Padakanti
S519400b
Tejaswi Potu
S519411b
Sandeep
Raghavareddy
S519414b
Susmith Reddy
Siddipeta
S519421b
Prashanthi
Kanneboina
S519464b
Spandana Sama
S519474b
Northwest Missouri State University, 800 University Dr., CH 2050, Maryville, MO 64468, (660) 562-1802, +1
a@nwmissouri.edu | b@mail.nwmissouri.edu
ABSTRACT
This paper provides an experience report of a graduate directed
project that began in Fall 2014 at Northwest Missouri State
University as a two semester experiment to determine if a “Big”
Data project could be accomplished with a LittleFe cluster
computer. Described herein are the details of the project, the
hardware, and the software stack. The project itself is based upon
existing projects that make use of the GHCN data set – an ~20GB
raw text data set consisting of daily climactic data spanning over
100 years. The students on this project attempted to bulk load the
data on to a modified LittleFe v4d cluster computer using Hive,
HBase and Hadoop/HDFS. The end result of the project was to
be an interactive website that would allow for display and
animation of the climactic data using Apache Tomcat, Java Server
Pages, Google Maps API, and D3.js. After describing the project
and the approaches to solving this problem, lessons learned and
future applications for this and similar projects are described.
Categories and Subject Descriptors
K.3.2 [Computers and Education]: Computer and Information
Science Education – Computer science education, curriculum,
self-assessment.
General Terms
Algorithms, Design, Experimentation.
Keywords
Computer Science Education, Big Data, LittleFe.
1. INTRODUCTION
In the fall of 2014, several students began their graduate directed
project at Northwest Missouri State University. At the faculty
mentor/client’s request, this project involved development of a
web application and an HBase/Hadoop backend that would allow
for query, display and visualization of global historical climactic
network (GHCN) weather station data as available at the National
Climactic Data Center (NCDC). The overall goals for this project
were multi-faceted – 1) to investigate the viability of the LittleFe
platform for use with Hadoop and HBase, 2) to teach students
about Linux, distributed computing, and processing large data
sets, and 3) to investigate new and interesting big data tools.
The student authors worked with a faculty client (David
Monismith) and were provided with boilerplate code and
significant technical assistance from faculty for this project.
Students worked with Linux virtual machines for testing and with
a modified LittleFe v4d as a deployment system. The GHCN_All
data set was used to represent a “big” data for this project. This
data set contains daily weather data from nearly 90,000 weather
stations for as many as 100 years per location, and is available at
ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/. The size of this data
set is 2.3GB compressed and over 20GB raw plain text. The bulk
of the work on this project included installation of a Hadoop
ecosystem on a LittleFe system, discovery of tools to effectively
work with the data set, development of queries, and development
of a web interface to interact with the data.
Since the Global Historical Climatology Network consists of a
relatively large amount of data as compared to the system
specifications of a LittleFe cluster, Hadoop was chosen as a
backend processing tool and HBase was chosen as a database.
Using Hadoop and HBase, the data is queried and retrieved in a
straightforward manner, however, for data loading and retrieval
SQL-like commands are preferred because of their power.
Therefore, students chose to use SQL middleware to allow for
query access into the database. Initially, students chose to use
Hive to bulk load the GHCN data into HBase with SQL-like
queries. After some research, Apache Phoenix was chosen as a
replacement for Hive for two reasons – 1) a JDBC connection to
Phoenix is available for programmatic access to HBase via SQL-
Permissiontomake digital or hardcopies of all or part of this work for
personal or classroom use is grantedwithout fee providedthat copies are
not made or distributed for profit or commercial advantage and that
copies bear this notice andthe full citation on the first page. To copy
otherwise, or republish, to post on servers or to redistribute to lists,
requires prior specific permission and/or a fee.
XSEDE’15, July 26–30, 2015, St. Louis, MO, USA.
Copyright 2015heldby Owner/Author.Publication Rights Licensed to
ACM 1-58113-000-0/00/0010 …$15.00.

like queries and 2) Phoenix reportedly has better performance
than Hive. Thus data sets retrieved with Phoenix via JDBC may
be sent directly to the front end for display.
The front end of the application displays temperature data to the
user with a heat map using visualization tools including Google
Maps API, D3.js, Java Server Pages (JSP), Java, and Apache
Tomcat. The front end provides the user with the ability to select
a date range and a These tools allow for the display of latitude and
longitude values for all stations present in the dataset and for the
display of user-selected attributes via heat maps - showing a
different color tone for different range of values and including a
legend. Weather station data for the front end is provided from
HBase via dynamic JDBC Phoenix queries provided through Java
Server Pages (JSP) and related Java code.
Students and the faculty mentor/client spent a significant amount
of time learning about the tools and dataset used within this
project. After gaining such domain knowledge, the authors
developed a web application to interface with the tools and
database presented herein. Finally, the faculty mentor performed
a self-assessment of the approach taken herein. Therefore, this
paper is organized such that the project background is described in
the second section. Following sections include a detailed project
description in the third section, a description of learning outcomes
of the project in the fourth section, and finally, lessons learned,
including difficulties encountered and conclusions are presented
in the final sections of this paper.
2. BACKGROUND
In this section, background information on the data, tools and
scope of the project is provided. This first includes a description
of the graduate directed projects course to provide the reader with
an understanding of the duration and scope of the project.
Thereafter, a description of the GHCN_ALL dataset is provided,
and descriptions for the various components of the
software/hardware stack used in this project are also provided.
Components used in the software/hardware stack include the
Google Maps API, D3.js, Hadoop/HDFS, HBase, Phoenix, and
Tomcat.
2.1 Graduate Directed Project
Graduate Directed Projects at Northwest Missouri State
University are a two-semester course sequence that serves in place
of a thesis project for Master of Science in Applied Computer
Science students. Students in this course work in teams of
approximately ten students with a faculty mentor and a client to
complete a significant project that encompasses most or all of the
software development lifecycle ranging from requirements
gathering and design to testing and maintenance. Standard
graduate projects at Northwest often involve gathering project
requirements, performing user interface and database design,
developing a website or mobile application that interacts with a
database or another complex tool, performing unit and integration
testing that application, deploying it to a test server, and
performing usability testing on the application.
2.2 Data Set
The GHCN_All dataset includes data representing different
stations, wherein each station includes various daily attributes
such as temperature, precipitation, snowfall, etc. A table
describing the format of the data used within this project from the
GHCN_All dataset follows below.
Table 1. GHCN_All Data File Format
Variable Columns Type Description
Id 1-11 Character
Station identification
code
Year 12-15 Integer Record year
Month 16-17 Integer Record month
Element 18-21 Character
Type of weather
observation
Value1 22-26 Integer
Value on the first day
of the month
Mflag1 27 Character Measurement flag
Qflag1 28 Character Quality flag
Sflag1 29 Character Source flag
… … … …
Value31
to Sflag31
262-269
Integer &
Character
Value & flags on the
31st
day of the month
As previously mentioned, the GCHN_All dataset contains climate
data (daily measurements) for nearly 90,000 weather stations
spanning as far back as 100 years for some stations. Data
measured at each station is stored in a “.dly” file with the station
identifier (“id” in Table 1) as the filename prefix. Each file
contains plain text data with each line in the file containing the
information as shown in Table 1 with one line of data per month.
Daily weather observations include elements such as precipitation
in tenths of millimeters (PRCP), snowfall in millimeters (SNOW),
snow depth (SNWD), minimum temperature (TMIN), and
maximum temperature (TMAX) in tenths of degrees Celsius. As
mentioned above the data was bulk loaded into HBase and is
retrieved from the database as necessary for visualization.
Initially, this data was retrieved on the front end with Apache
Thrift, however, the use of Phoenix in conjunction with JDBC,
Apache Tomcat (a servlet container), and JavaScript, has replaced
the need for Thrift. By using Java Server Pages and JavaScript in
the front end, visualization with line plots and box plots was
achieved with D3.js. In a similar fashion, heat map visualization
was achieved via the Google Maps API. Heat map results are
displayed by selecting a date range and desired attribute. At the
click of a button, weather conditions may be displayed over the
given date range.
2.3 Deployment Platform
In this project, a system called LittleFe3 was used as the
deployment system. This system was used to provide a low-cost,
parallel, distributed computing environment. The cost of this
system was approximately $3000, and it was built using
commercial, off-the-shelf hardware and a custom chassis provided
by Earlham University. The LittleFe3 system makes use of the
BCCD operating system – a Debian Linux variant. This system is
a modified version of Earlham’s LittleFe v4d system, and includes
6 nodes. One of which is the head node (node000), and 5 are
child nodes (node011 through node015). The block diagram
below describes the LittleFe3 system, which has 8GB RAM per
node, 1x512GB SSD on the head node, 5x256GB SSDs (one per
child node), and 6x quad core Celeron J1900 Processors (one per
node).

Figure 1: LittleFe3 block diagram
Provided that proper load balancing may be achieved and that
code or data operations can be parallelized, a parallel platform
such as LittleFe3 may provide both speedup and efficiency.
Additionally, on a distributed system such as LittleFe3, both
shared memory and distributed memory parallelism may be
achieved through the use of multiple multi-core systems.
2.4 Software Stack
On the LittleFe3 system, the authors of this paper installed a
Hadoop software stack that included Hadoop, an HDFS file
system, the HBase NoSQL database, Hive, Phoenix, Tomcat,
D3.js, Google Maps API. Theprimary software stack is shown in
the image below.
Figure 2: Software Stack
2.4.1 Hadoop
Hadoop consists of algorithms that may be used to manage
distributed storage and to perform big data processing. Hadoop
was developed by the Apache Software Foundation, and was
written in Java in a manner such that it will operate on many
different hardware platforms. There are a number of different
operation modes for Hadoop including a local/standalone mode, a
pseudo-distributed mode, and a fully distributed mode. Included
within this framework are modules including the Hadoop
Distributed File System(HDFS) and Hadoop MapReduce. In this
project, Hadoop is used because it provides a parallel computation
environment and distributed file system. As the dataset being
used, GHCN, is quite large, Hadoop has proven effective in
processing this data. Currently installation of Hadoop on a
LittleFe system is somewhat complex and requires a significant
amount of effort to install. In particular this includes modifying
the capacity-scheduler, core-site.xml, mapred-site.xml, yarn-
site.xml, masters, and slaves. These files allow for initialization
of system dependent scheduling variables, the HDFS location, the
HDFS replication factor, identification of master and slave nodes,
and memory allocations for various Hadoop components. Values
for these system dependent variables are provided in the
Appendix.
2.4.2 HBase, Hive, and Phoenix
HBase provides random read/write access to HDFS in the form of
a NoSQL column store database. This type of database provides
data access in a form similar to that of a spreadsheet tool wherein
data is stored using rows, columns, and column families (similar
to sheets). Simple operations like get and put allow for direct
access to each cell within the column store provided the row,
column, and/or column family names. Additionally operations
such as scan and list allow for full display of the contents of a
table within the database and of all the tables stored therein,
respectively. Interestingly, HBase provides for such data to be
distributed across multiple systems when used in conjunction with
Hadoop. Therefore, Hbase is well suited to store large distributed
datasets, especially those datasets where such data may be read
and processed relatively many times when compared to the
number of database writes.
Commands such as get and put are quite primitive when compared
to SQL, so many developers prefer to use a middleware tool that
provides a SQL layer over HBase for ease of use. Tools such as
Apache Hive and Phoenix provide a relational database layer on
top of HBase. First, Hive provides data warehouse access to the
distributed storage layer in HDFS. It provides the capability to
bulk load such stored data into HBase. While Hive is quite useful
for bulk loading and data warehousing, Phoenix has proven to be
a more useful tool for this project. Phoenix also provides an SQL
layer over HBase; however, it also provides low-latency JDBC
functionality because it uses the Hbase API directly. This
increases the query performance project. Empirical results in our
project have shown that Phoenix is faster than Hive. Where the
Hive may take seconds to querying small numbers of rows,
Phoenix may takes just seconds to query ten million of row. As
our project may require thousands or hundreds of thousands of
rows to be retrieved from Hbase, Phoenix is preferred because it
takes less time to display results. Theoretical performance reports
from Apache indicate Phoenix may be 50-70 times faster than
Hive.

2.4.3 Google MapsAPI, D3.js,and Tomcat
For front-end work, three web APIs were used in this project –
Google Maps API, D3.js, and Apache Tomcat. The Google Maps
API was used for map and weather data display. D3.js was used
for data visualization of single weather station data over time and
for comparison of such data between several weather stations.
Finally, Apache Tomcat was used to provide container for the
JDBC-enabled data access layer between the front-end and back-
end
Google Maps API was used to provide both a map display layer
for web view and to provide heatmap functionality for the
application described in the next section. The map display layer
functionality provides an important role – it allows for acquisition
of the minimum and maximum longitude and latitude coordinates.
These two values are important in allowing for display of the
appropriate temperature data because they allow for selection of
the appropriate weather stations from the database.
D3.js is a JavaScript library for producing data visualizations, that
is, mapping datasets to images or animations. Using the D3
library, it is possible to produce both static and dynamic
visualizations. These may be interactive and can be produced in
real time using a standard web technology – JavaScript. Within
this project, D3.js was used to create both Box Plots and Line
Plots. Examples of both box plots and line plots as generated in
this project are provided below.
Figure 3: D3.js Box Plot
Notice that the box plot allows for a graphical display numerical
data via quartiles. Included in the diagram above are the
minimum, lower quartile, median, upper quartile, and maximum.
Using such an approach, the user is able to view and analyze the
differences in datasets, namely temperatures, quickly. This
approach was used in this project to allow for graphical display of
temperatures on different days from different weather stations.
Additionally, line plot graphs were generated using D3.js as
shown in the example below.
Such graphs were used to display the variation of factors like
temperature, precipitation, and snowfall for the selected date
range. Notice that the y-axis of the graph is displayed in tenths of
degrees Celsius.
3. PROJECT DESCRIPTION
3.1 Requirements
Students were to develop a web (and possibly a mobile front end)
and a HBase/Hadoop backend that will allow for query, display
and visualization of global historical climactic weather station
data as available at the National Climactic Data Center.
Boilerplate code allowing for direct HBase connectivity and data
structures to represent GHCN data was made available to students
for this project. Students were provided with a copy of the data
set for this project and were made aware of the location of
additional documentation. On beginning the project students were
made aware that they would need to 1) work with a
Hadoop/HBase ecosystem, 2) discover how to bulk load the data
into thedatabase, and 3) develop a web application to display data
retrieved from HBase.
3.2 System Preparation
Prior to the start of the project, the faculty mentor identified two
different means of allowing for Hadoop/HBase connectivity.
These included 1) using Cloudera Virtual Machines on university
lab computers and 2) installing Hadoop from scratch on a LittleFe
system. Students found the Cloudera VMs straightforward to use
through the use of the HUE interface. In preparation for bulk
loading, the students and faculty mentor discovered that when
used on un-clustered desktop systems, such VMs lacked sufficient
compute power to complete bulk loading of data. Students even
tried dividing this data between several VMs on different desktop
computers, but were unsuccessful at processing the data because
they were not able to obtain sole access to lab computers for batch
processing.
The faculty mentor suggested students use the newly built
LittleFe3 machine and install Hadoop and HBase from scratch.
Students agreed to accomplish this task and were provided with
the following resources to install Hadoop 2.6.0 and HBase 0.98.9
on the cluster computer.
Table 2. Hadoop/HBase files provided to students.
Filename Description
startHadoop
Homemade Bash shell script to start Hadoop
on all LittleFe nodes
startNode
ManagersAnd
DataNodes
Homemade Bash shell script to start Hadoop
manager servers on all LittleFe nodes –
called by startHadoop
capacity-
scheduler.xml
Site specific memory/scheduling settings
core-site.xml Contains HDFS URL definition
mapred-
site.xml
Map-Reduce settings for Hadoop
yarn-site.xml
YARN resource negotiation settings for
Hadoop
hdfs-site.xml HDFS settings including replication
masters List of Hadoop master nodes
slaves List of Hadoop worker nodes
hbase-site.xml
Contains HBase HDFS locations and list of
Zookeeper nodes
regionservers List of HBase worker nodes

Students were provided with instruction on how to use a Linux
system, how to write shell scripts, and how to perform basic
system administration. They were then asked to install and start
Hadoop and HBase on LittleFe3 and were successful in doing so.
3.3 Bulk Loading
Students worked to bulk load (i.e. load the entirety of a large data
set) into HBase through the use of Hive scripts. This loading took
place in several stages. First data was parsed and converted into a
comma-separated format using Java. Next, the data was uploaded
into the HDFS Metadata store, using Hive. Finally, such data was
transferred from Hive into HBase.
Bulk Loading Process
Parse and convert data into a programmer-
friendly format.
Upload data into HDFS Metadata Store via
Hive.
Transfer Data from Hive to HBase.
Two data sets were bulk loaded into HBase. These included a list
of all stations with the longitude and latitude coordinates for each
station along with some additional data that may be useful in
future development. A database diagram showing the
components of this table is provided below.
Table 3. stations_hbase table.
stations_hbase
rowId
cf:latitude
cf:longitude
cf:elevation
cf:state
cf:name
cf:gsnflag
cf:hcnflag
cf:wmoid
The second table loaded into HBase included the minimum and
maximum temperatures as values and the station name, year, and
month as row identifiers.
****allstations_hbase table goes here
3.4 Web Application Development
Development of the web application to display heatmap data was
a multi-step process. This involved writing JavaScript code to
display Acquiring Data via Phoenix
JDBC Connection
Google Maps API Max/Min Long/Lat
Limited data display
Table 4. Table captions should be placed above the table
Graphics Top In-between Bottom
Tables End Last First
Figures Good Similar Very well
.
4. LEARNING OUTCOMES
Standard learning outcomes
****Learn to work in a team
****Gain real world programming experience in a safe setting (no
risk of being fired)
****Improve communication skills
****Improve programming and systems integration skills
Additional Learning Outcomes
****Command Line Interface
****Shell scripting
****System administration
****SQL queries
****Understanding of various APIs
.
5. LESSONS LEARNED
Initial attempts
****Virtual machine implementation is ok for small data set
testing
****Hive
Time spent teaching students about related topics
****Shell scripting
****Hadoop ecosystem
6. ACKNOWLEDGMENTS
The authors thank Dr. Scott Bell for his support as mentor while
Dr. Monismith assumed the role of a client in the first semester of
the project. The authors also thank the LittleFe team for their
work on LittleFe system and chassis design.
7. REFERENCES
[1] http://littlefe.net/
[2] http://littlefe.net/buildout
[3] http://en.wikipedia.org/wiki/Apache_Hadoop
[4] http://searchcloudcomputing.techtarget.com/definition/Hado
op
[5] http://www.sas.com/en_us/insights/big-data/hadoop.html
[6] https://github.com/mbostock/d3/wiki/Gallery
[7] D3 Book
http://chimera.labs.oreilly.com/books/1230000000345/ch01.
html
[8] Hadoop:
http://hadoop.apache.org/#What+Is+Apache+Hadoop%3F
[9] HBase: http://hbase.apache.org/
[10] D3.js: http://bl.ocks.org/mbostock/4061502
[11] Google maps API:
https://developers.google.com/maps/documentation/javascrip
t/tutorial

[12] Apache Thrift:https://thrift.apache.org/
[13] GHCN:http://www.ncdc.noaa.gov/data-access/land-based-
station-data/land-based-datasets/global-historical-
climatology-network-ghcn
[14] .
[15]
[16]
Columns on Last Page Should Be Made As Close As
Possible to Equal Length

GHCNPaper3

Recommended

Recommended

More Related Content

What's hot

What's hot (14)

Similar to GHCNPaper3

Similar to GHCNPaper3 (20)

GHCNPaper3