SlideShare a Scribd company logo
Using LittleFe for “Big” Data Education: An Experience
Report with GHCN
David Monismith
monismia
Bala Venkata
Paneendra
Abburi
S519288b
Achyuth
Chaitanya
Chitumalla
S519322b
Santhosh Reddy
Damasani
S519323b
Satyanarayana
Juttiga
S519343b
Snehitha Reddy
Padakanti
S519400b
Tejaswi Potu
S519411b
Sandeep
Raghavareddy
S519414b
Susmith Reddy
Siddipeta
S519421b
Prashanthi
Kanneboina
S519464b
Spandana Sama
S519474b
Northwest Missouri State University, 800 University Dr., CH 2050, Maryville, MO 64468, (660) 562-1802, +1
a@nwmissouri.edu | b@mail.nwmissouri.edu
ABSTRACT
This paper provides an experience report of a graduate directed
project that began in Fall 2014 at Northwest Missouri State
University as a two semester experiment to determine if a “Big”
Data project could be accomplished with a LittleFe cluster
computer. Described herein are the details of the project, the
hardware, and the software stack. The project itself is based upon
existing projects that make use of the GHCN data set – an ~20GB
raw text data set consisting of daily climactic data spanning over
100 years. The students on this project attempted to bulk load the
data on to a modified LittleFe v4d cluster computer using Hive,
HBase and Hadoop/HDFS. The end result of the project was to
be an interactive website that would allow for display and
animation of the climactic data using Apache Tomcat, Java Server
Pages, Google Maps API, and D3.js. After describing the project
and the approaches to solving this problem, lessons learned and
future applications for this and similar projects are described.
Categories and Subject Descriptors
K.3.2 [Computers and Education]: Computer and Information
Science Education – Computer science education, curriculum,
self-assessment.
General Terms
Algorithms, Design, Experimentation.
Keywords
Computer Science Education, Big Data, LittleFe.
1. INTRODUCTION
In the fall of 2014, several students began their graduate directed
project at Northwest Missouri State University. At the faculty
mentor/client’s request, this project involved development of a
web application and an HBase/Hadoop backend that would allow
for query, display and visualization of global historical climactic
network (GHCN) weather station data as available at the National
Climactic Data Center (NCDC). The overall goals for this project
were multi-faceted – 1) to investigate the viability of the LittleFe
platform for use with Hadoop and HBase, 2) to teach students
about Linux, distributed computing, and processing large data
sets, and 3) to investigate new and interesting big data tools.
The student authors worked with a faculty client (David
Monismith) and were provided with boilerplate code and
significant technical assistance from faculty for this project.
Students worked with Linux virtual machines for testing and with
a modified LittleFe v4d as a deployment system. The GHCN_All
data set was used to represent a “big” data for this project. This
data set contains daily weather data from nearly 90,000 weather
stations for as many as 100 years per location, and is available at
ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/. The size of this data
set is 2.3GB compressed and over 20GB raw plain text. The bulk
of the work on this project included installation of a Hadoop
ecosystem on a LittleFe system, discovery of tools to effectively
work with the data set, development of queries, and development
of a web interface to interact with the data.
Since the Global Historical Climatology Network consists of a
relatively large amount of data as compared to the system
specifications of a LittleFe cluster, Hadoop was chosen as a
backend processing tool and HBase was chosen as a database.
Using Hadoop and HBase, the data is queried and retrieved in a
straightforward manner, however, for data loading and retrieval
SQL-like commands are preferred because of their power.
Therefore, students chose to use SQL middleware to allow for
query access into the database. Initially, students chose to use
Hive to bulk load the GHCN data into HBase with SQL-like
queries. After some research, Apache Phoenix was chosen as a
replacement for Hive for two reasons – 1) a JDBC connection to
Phoenix is available for programmatic access to HBase via SQL-
Permissiontomake digital or hardcopies of all or part of this work for
personal or classroom use is grantedwithout fee providedthat copies are
not made or distributed for profit or commercial advantage and that
copies bear this notice andthe full citation on the first page. To copy
otherwise, or republish, to post on servers or to redistribute to lists,
requires prior specific permission and/or a fee.
XSEDE’15, July 26–30, 2015, St. Louis, MO, USA.
Copyright 2015heldby Owner/Author.Publication Rights Licensed to
ACM 1-58113-000-0/00/0010 …$15.00.
like queries and 2) Phoenix reportedly has better performance
than Hive. Thus data sets retrieved with Phoenix via JDBC may
be sent directly to the front end for display.
The front end of the application displays temperature data to the
user with a heat map using visualization tools including Google
Maps API, D3.js, Java Server Pages (JSP), Java, and Apache
Tomcat. The front end provides the user with the ability to select
a date range and a These tools allow for the display of latitude and
longitude values for all stations present in the dataset and for the
display of user-selected attributes via heat maps - showing a
different color tone for different range of values and including a
legend. Weather station data for the front end is provided from
HBase via dynamic JDBC Phoenix queries provided through Java
Server Pages (JSP) and related Java code.
Students and the faculty mentor/client spent a significant amount
of time learning about the tools and dataset used within this
project. After gaining such domain knowledge, the authors
developed a web application to interface with the tools and
database presented herein. Finally, the faculty mentor performed
a self-assessment of the approach taken herein. Therefore, this
paper is organized such that the project background is described in
the second section. Following sections include a detailed project
description in the third section, a description of learning outcomes
of the project in the fourth section, and finally, lessons learned,
including difficulties encountered and conclusions are presented
in the final sections of this paper.
2. BACKGROUND
In this section, background information on the data, tools and
scope of the project is provided. This first includes a description
of the graduate directed projects course to provide the reader with
an understanding of the duration and scope of the project.
Thereafter, a description of the GHCN_ALL dataset is provided,
and descriptions for the various components of the
software/hardware stack used in this project are also provided.
Components used in the software/hardware stack include the
Google Maps API, D3.js, Hadoop/HDFS, HBase, Phoenix, and
Tomcat.
2.1 Graduate Directed Project
Graduate Directed Projects at Northwest Missouri State
University are a two-semester course sequence that serves in place
of a thesis project for Master of Science in Applied Computer
Science students. Students in this course work in teams of
approximately ten students with a faculty mentor and a client to
complete a significant project that encompasses most or all of the
software development lifecycle ranging from requirements
gathering and design to testing and maintenance. Standard
graduate projects at Northwest often involve gathering project
requirements, performing user interface and database design,
developing a website or mobile application that interacts with a
database or another complex tool, performing unit and integration
testing that application, deploying it to a test server, and
performing usability testing on the application.
2.2 Data Set
The GHCN_All dataset includes data representing different
stations, wherein each station includes various daily attributes
such as temperature, precipitation, snowfall, etc. A table
describing the format of the data used within this project from the
GHCN_All dataset follows below.
Table 1. GHCN_All Data File Format
Variable Columns Type Description
Id 1-11 Character
Station identification
code
Year 12-15 Integer Record year
Month 16-17 Integer Record month
Element 18-21 Character
Type of weather
observation
Value1 22-26 Integer
Value on the first day
of the month
Mflag1 27 Character Measurement flag
Qflag1 28 Character Quality flag
Sflag1 29 Character Source flag
… … … …
Value31
to Sflag31
262-269
Integer &
Character
Value & flags on the
31st
day of the month
As previously mentioned, the GCHN_All dataset contains climate
data (daily measurements) for nearly 90,000 weather stations
spanning as far back as 100 years for some stations. Data
measured at each station is stored in a “.dly” file with the station
identifier (“id” in Table 1) as the filename prefix. Each file
contains plain text data with each line in the file containing the
information as shown in Table 1 with one line of data per month.
Daily weather observations include elements such as precipitation
in tenths of millimeters (PRCP), snowfall in millimeters (SNOW),
snow depth (SNWD), minimum temperature (TMIN), and
maximum temperature (TMAX) in tenths of degrees Celsius. As
mentioned above the data was bulk loaded into HBase and is
retrieved from the database as necessary for visualization.
Initially, this data was retrieved on the front end with Apache
Thrift, however, the use of Phoenix in conjunction with JDBC,
Apache Tomcat (a servlet container), and JavaScript, has replaced
the need for Thrift. By using Java Server Pages and JavaScript in
the front end, visualization with line plots and box plots was
achieved with D3.js. In a similar fashion, heat map visualization
was achieved via the Google Maps API. Heat map results are
displayed by selecting a date range and desired attribute. At the
click of a button, weather conditions may be displayed over the
given date range.
2.3 Deployment Platform
In this project, a system called LittleFe3 was used as the
deployment system. This system was used to provide a low-cost,
parallel, distributed computing environment. The cost of this
system was approximately $3000, and it was built using
commercial, off-the-shelf hardware and a custom chassis provided
by Earlham University. The LittleFe3 system makes use of the
BCCD operating system – a Debian Linux variant. This system is
a modified version of Earlham’s LittleFe v4d system, and includes
6 nodes. One of which is the head node (node000), and 5 are
child nodes (node011 through node015). The block diagram
below describes the LittleFe3 system, which has 8GB RAM per
node, 1x512GB SSD on the head node, 5x256GB SSDs (one per
child node), and 6x quad core Celeron J1900 Processors (one per
node).
Figure 1: LittleFe3 block diagram
Provided that proper load balancing may be achieved and that
code or data operations can be parallelized, a parallel platform
such as LittleFe3 may provide both speedup and efficiency.
Additionally, on a distributed system such as LittleFe3, both
shared memory and distributed memory parallelism may be
achieved through the use of multiple multi-core systems.
2.4 Software Stack
On the LittleFe3 system, the authors of this paper installed a
Hadoop software stack that included Hadoop, an HDFS file
system, the HBase NoSQL database, Hive, Phoenix, Tomcat,
D3.js, Google Maps API. Theprimary software stack is shown in
the image below.
Figure 2: Software Stack
2.4.1 Hadoop
Hadoop consists of algorithms that may be used to manage
distributed storage and to perform big data processing. Hadoop
was developed by the Apache Software Foundation, and was
written in Java in a manner such that it will operate on many
different hardware platforms. There are a number of different
operation modes for Hadoop including a local/standalone mode, a
pseudo-distributed mode, and a fully distributed mode. Included
within this framework are modules including the Hadoop
Distributed File System(HDFS) and Hadoop MapReduce. In this
project, Hadoop is used because it provides a parallel computation
environment and distributed file system. As the dataset being
used, GHCN, is quite large, Hadoop has proven effective in
processing this data. Currently installation of Hadoop on a
LittleFe system is somewhat complex and requires a significant
amount of effort to install. In particular this includes modifying
the capacity-scheduler, core-site.xml, mapred-site.xml, yarn-
site.xml, masters, and slaves. These files allow for initialization
of system dependent scheduling variables, the HDFS location, the
HDFS replication factor, identification of master and slave nodes,
and memory allocations for various Hadoop components. Values
for these system dependent variables are provided in the
Appendix.
2.4.2 HBase, Hive, and Phoenix
HBase provides random read/write access to HDFS in the form of
a NoSQL column store database. This type of database provides
data access in a form similar to that of a spreadsheet tool wherein
data is stored using rows, columns, and column families (similar
to sheets). Simple operations like get and put allow for direct
access to each cell within the column store provided the row,
column, and/or column family names. Additionally operations
such as scan and list allow for full display of the contents of a
table within the database and of all the tables stored therein,
respectively. Interestingly, HBase provides for such data to be
distributed across multiple systems when used in conjunction with
Hadoop. Therefore, Hbase is well suited to store large distributed
datasets, especially those datasets where such data may be read
and processed relatively many times when compared to the
number of database writes.
Commands such as get and put are quite primitive when compared
to SQL, so many developers prefer to use a middleware tool that
provides a SQL layer over HBase for ease of use. Tools such as
Apache Hive and Phoenix provide a relational database layer on
top of HBase. First, Hive provides data warehouse access to the
distributed storage layer in HDFS. It provides the capability to
bulk load such stored data into HBase. While Hive is quite useful
for bulk loading and data warehousing, Phoenix has proven to be
a more useful tool for this project. Phoenix also provides an SQL
layer over HBase; however, it also provides low-latency JDBC
functionality because it uses the Hbase API directly. This
increases the query performance project. Empirical results in our
project have shown that Phoenix is faster than Hive. Where the
Hive may take seconds to querying small numbers of rows,
Phoenix may takes just seconds to query ten million of row. As
our project may require thousands or hundreds of thousands of
rows to be retrieved from Hbase, Phoenix is preferred because it
takes less time to display results. Theoretical performance reports
from Apache indicate Phoenix may be 50-70 times faster than
Hive.
2.4.3 Google MapsAPI, D3.js,and Tomcat
For front-end work, three web APIs were used in this project –
Google Maps API, D3.js, and Apache Tomcat. The Google Maps
API was used for map and weather data display. D3.js was used
for data visualization of single weather station data over time and
for comparison of such data between several weather stations.
Finally, Apache Tomcat was used to provide container for the
JDBC-enabled data access layer between the front-end and back-
end
Google Maps API was used to provide both a map display layer
for web view and to provide heatmap functionality for the
application described in the next section. The map display layer
functionality provides an important role – it allows for acquisition
of the minimum and maximum longitude and latitude coordinates.
These two values are important in allowing for display of the
appropriate temperature data because they allow for selection of
the appropriate weather stations from the database.
D3.js is a JavaScript library for producing data visualizations, that
is, mapping datasets to images or animations. Using the D3
library, it is possible to produce both static and dynamic
visualizations. These may be interactive and can be produced in
real time using a standard web technology – JavaScript. Within
this project, D3.js was used to create both Box Plots and Line
Plots. Examples of both box plots and line plots as generated in
this project are provided below.
Figure 3: D3.js Box Plot
Notice that the box plot allows for a graphical display numerical
data via quartiles. Included in the diagram above are the
minimum, lower quartile, median, upper quartile, and maximum.
Using such an approach, the user is able to view and analyze the
differences in datasets, namely temperatures, quickly. This
approach was used in this project to allow for graphical display of
temperatures on different days from different weather stations.
Additionally, line plot graphs were generated using D3.js as
shown in the example below.
Such graphs were used to display the variation of factors like
temperature, precipitation, and snowfall for the selected date
range. Notice that the y-axis of the graph is displayed in tenths of
degrees Celsius.
3. PROJECT DESCRIPTION
3.1 Requirements
Students were to develop a web (and possibly a mobile front end)
and a HBase/Hadoop backend that will allow for query, display
and visualization of global historical climactic weather station
data as available at the National Climactic Data Center.
Boilerplate code allowing for direct HBase connectivity and data
structures to represent GHCN data was made available to students
for this project. Students were provided with a copy of the data
set for this project and were made aware of the location of
additional documentation. On beginning the project students were
made aware that they would need to 1) work with a
Hadoop/HBase ecosystem, 2) discover how to bulk load the data
into thedatabase, and 3) develop a web application to display data
retrieved from HBase.
3.2 System Preparation
Prior to the start of the project, the faculty mentor identified two
different means of allowing for Hadoop/HBase connectivity.
These included 1) using Cloudera Virtual Machines on university
lab computers and 2) installing Hadoop from scratch on a LittleFe
system. Students found the Cloudera VMs straightforward to use
through the use of the HUE interface. In preparation for bulk
loading, the students and faculty mentor discovered that when
used on un-clustered desktop systems, such VMs lacked sufficient
compute power to complete bulk loading of data. Students even
tried dividing this data between several VMs on different desktop
computers, but were unsuccessful at processing the data because
they were not able to obtain sole access to lab computers for batch
processing.
The faculty mentor suggested students use the newly built
LittleFe3 machine and install Hadoop and HBase from scratch.
Students agreed to accomplish this task and were provided with
the following resources to install Hadoop 2.6.0 and HBase 0.98.9
on the cluster computer.
Table 2. Hadoop/HBase files provided to students.
Filename Description
startHadoop
Homemade Bash shell script to start Hadoop
on all LittleFe nodes
startNode
ManagersAnd
DataNodes
Homemade Bash shell script to start Hadoop
manager servers on all LittleFe nodes –
called by startHadoop
capacity-
scheduler.xml
Site specific memory/scheduling settings
core-site.xml Contains HDFS URL definition
mapred-
site.xml
Map-Reduce settings for Hadoop
yarn-site.xml
YARN resource negotiation settings for
Hadoop
hdfs-site.xml HDFS settings including replication
masters List of Hadoop master nodes
slaves List of Hadoop worker nodes
hbase-site.xml
Contains HBase HDFS locations and list of
Zookeeper nodes
regionservers List of HBase worker nodes
Students were provided with instruction on how to use a Linux
system, how to write shell scripts, and how to perform basic
system administration. They were then asked to install and start
Hadoop and HBase on LittleFe3 and were successful in doing so.
3.3 Bulk Loading
Students worked to bulk load (i.e. load the entirety of a large data
set) into HBase through the use of Hive scripts. This loading took
place in several stages. First data was parsed and converted into a
comma-separated format using Java. Next, the data was uploaded
into the HDFS Metadata store, using Hive. Finally, such data was
transferred from Hive into HBase.
Bulk Loading Process
Parse and convert data into a programmer-
friendly format.
Upload data into HDFS Metadata Store via
Hive.
Transfer Data from Hive to HBase.
Two data sets were bulk loaded into HBase. These included a list
of all stations with the longitude and latitude coordinates for each
station along with some additional data that may be useful in
future development. A database diagram showing the
components of this table is provided below.
Table 3. stations_hbase table.
stations_hbase
rowId
cf:latitude
cf:longitude
cf:elevation
cf:state
cf:name
cf:gsnflag
cf:hcnflag
cf:wmoid
The second table loaded into HBase included the minimum and
maximum temperatures as values and the station name, year, and
month as row identifiers.
****allstations_hbase table goes here
3.4 Web Application Development
Development of the web application to display heatmap data was
a multi-step process. This involved writing JavaScript code to
display Acquiring Data via Phoenix
JDBC Connection
Google Maps API Max/Min Long/Lat
Limited data display
Table 4. Table captions should be placed above the table
Graphics Top In-between Bottom
Tables End Last First
Figures Good Similar Very well
.
4. LEARNING OUTCOMES
Standard learning outcomes
****Learn to work in a team
****Gain real world programming experience in a safe setting (no
risk of being fired)
****Improve communication skills
****Improve programming and systems integration skills
Additional Learning Outcomes
****Command Line Interface
****Shell scripting
****System administration
****SQL queries
****Understanding of various APIs
.
5. LESSONS LEARNED
Initial attempts
****Virtual machine implementation is ok for small data set
testing
****Hive
Time spent teaching students about related topics
****Shell scripting
****Hadoop ecosystem
6. ACKNOWLEDGMENTS
The authors thank Dr. Scott Bell for his support as mentor while
Dr. Monismith assumed the role of a client in the first semester of
the project. The authors also thank the LittleFe team for their
work on LittleFe system and chassis design.
7. REFERENCES
[1] http://littlefe.net/
[2] http://littlefe.net/buildout
[3] http://en.wikipedia.org/wiki/Apache_Hadoop
[4] http://searchcloudcomputing.techtarget.com/definition/Hado
op
[5] http://www.sas.com/en_us/insights/big-data/hadoop.html
[6] https://github.com/mbostock/d3/wiki/Gallery
[7] D3 Book
http://chimera.labs.oreilly.com/books/1230000000345/ch01.
html
[8] Hadoop:
http://hadoop.apache.org/#What+Is+Apache+Hadoop%3F
[9] HBase: http://hbase.apache.org/
[10] D3.js: http://bl.ocks.org/mbostock/4061502
[11] Google maps API:
https://developers.google.com/maps/documentation/javascrip
t/tutorial
[12] Apache Thrift:https://thrift.apache.org/
[13] GHCN:http://www.ncdc.noaa.gov/data-access/land-based-
station-data/land-based-datasets/global-historical-
climatology-network-ghcn
[14] .
[15]
[16]
Columns on Last Page Should Be Made As Close As
Possible to Equal Length

More Related Content

What's hot

Hadoop Mapreduce Performance Enhancement Using In-Node Combiners
Hadoop Mapreduce Performance Enhancement Using In-Node CombinersHadoop Mapreduce Performance Enhancement Using In-Node Combiners
Hadoop Mapreduce Performance Enhancement Using In-Node Combiners
ijcsit
 
Implementation of p pic algorithm in map reduce to handle big data
Implementation of p pic algorithm in map reduce to handle big dataImplementation of p pic algorithm in map reduce to handle big data
Implementation of p pic algorithm in map reduce to handle big data
eSAT Publishing House
 
A Performance Study of Big Spatial Data Systems
A Performance Study of Big Spatial Data SystemsA Performance Study of Big Spatial Data Systems
A Performance Study of Big Spatial Data Systems
Dhaka University of Engineering & Technology(DUET)
 
Big Data Clustering Model based on Fuzzy Gaussian
Big Data Clustering Model based on Fuzzy GaussianBig Data Clustering Model based on Fuzzy Gaussian
Big Data Clustering Model based on Fuzzy Gaussian
IJCSIS Research Publications
 
An adaptive clustering and classification algorithm for Twitter data streamin...
An adaptive clustering and classification algorithm for Twitter data streamin...An adaptive clustering and classification algorithm for Twitter data streamin...
An adaptive clustering and classification algorithm for Twitter data streamin...
TELKOMNIKA JOURNAL
 
A bi objective workflow application
A bi objective workflow applicationA bi objective workflow application
A bi objective workflow application
IJITE
 
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...
IRJET Journal
 
KDD Cup Research Paper
KDD Cup Research PaperKDD Cup Research Paper
KDD Cup Research Paper
Tharindu Ranasinghe
 
JPJ1423 Keyword Query Routing
JPJ1423   Keyword Query RoutingJPJ1423   Keyword Query Routing
JPJ1423 Keyword Query Routing
chennaijp
 
WEB LOG PREPROCESSING BASED ON PARTIAL ANCESTRAL GRAPH TECHNIQUE FOR SESSION ...
WEB LOG PREPROCESSING BASED ON PARTIAL ANCESTRAL GRAPH TECHNIQUE FOR SESSION ...WEB LOG PREPROCESSING BASED ON PARTIAL ANCESTRAL GRAPH TECHNIQUE FOR SESSION ...
WEB LOG PREPROCESSING BASED ON PARTIAL ANCESTRAL GRAPH TECHNIQUE FOR SESSION ...
cscpconf
 
IEEE 2014 JAVA DATA MINING PROJECTS Keyword query routing
IEEE 2014 JAVA DATA MINING PROJECTS Keyword query routingIEEE 2014 JAVA DATA MINING PROJECTS Keyword query routing
IEEE 2014 JAVA DATA MINING PROJECTS Keyword query routing
IEEEFINALYEARSTUDENTPROJECTS
 
A DATA EXTRACTION ALGORITHM FROM OPEN SOURCE SOFTWARE PROJECT REPOSITORIES FO...
A DATA EXTRACTION ALGORITHM FROM OPEN SOURCE SOFTWARE PROJECT REPOSITORIES FO...A DATA EXTRACTION ALGORITHM FROM OPEN SOURCE SOFTWARE PROJECT REPOSITORIES FO...
A DATA EXTRACTION ALGORITHM FROM OPEN SOURCE SOFTWARE PROJECT REPOSITORIES FO...
ijseajournal
 
A New Multi-Dimensional Hyperbolic Structure for Cloud Service Indexing
A New Multi-Dimensional Hyperbolic Structure for Cloud Service IndexingA New Multi-Dimensional Hyperbolic Structure for Cloud Service Indexing
A New Multi-Dimensional Hyperbolic Structure for Cloud Service Indexing
ijdms
 

What's hot (14)

Hadoop Mapreduce Performance Enhancement Using In-Node Combiners
Hadoop Mapreduce Performance Enhancement Using In-Node CombinersHadoop Mapreduce Performance Enhancement Using In-Node Combiners
Hadoop Mapreduce Performance Enhancement Using In-Node Combiners
 
Implementation of p pic algorithm in map reduce to handle big data
Implementation of p pic algorithm in map reduce to handle big dataImplementation of p pic algorithm in map reduce to handle big data
Implementation of p pic algorithm in map reduce to handle big data
 
p27
p27p27
p27
 
A Performance Study of Big Spatial Data Systems
A Performance Study of Big Spatial Data SystemsA Performance Study of Big Spatial Data Systems
A Performance Study of Big Spatial Data Systems
 
Big Data Clustering Model based on Fuzzy Gaussian
Big Data Clustering Model based on Fuzzy GaussianBig Data Clustering Model based on Fuzzy Gaussian
Big Data Clustering Model based on Fuzzy Gaussian
 
An adaptive clustering and classification algorithm for Twitter data streamin...
An adaptive clustering and classification algorithm for Twitter data streamin...An adaptive clustering and classification algorithm for Twitter data streamin...
An adaptive clustering and classification algorithm for Twitter data streamin...
 
A bi objective workflow application
A bi objective workflow applicationA bi objective workflow application
A bi objective workflow application
 
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...
 
KDD Cup Research Paper
KDD Cup Research PaperKDD Cup Research Paper
KDD Cup Research Paper
 
JPJ1423 Keyword Query Routing
JPJ1423   Keyword Query RoutingJPJ1423   Keyword Query Routing
JPJ1423 Keyword Query Routing
 
WEB LOG PREPROCESSING BASED ON PARTIAL ANCESTRAL GRAPH TECHNIQUE FOR SESSION ...
WEB LOG PREPROCESSING BASED ON PARTIAL ANCESTRAL GRAPH TECHNIQUE FOR SESSION ...WEB LOG PREPROCESSING BASED ON PARTIAL ANCESTRAL GRAPH TECHNIQUE FOR SESSION ...
WEB LOG PREPROCESSING BASED ON PARTIAL ANCESTRAL GRAPH TECHNIQUE FOR SESSION ...
 
IEEE 2014 JAVA DATA MINING PROJECTS Keyword query routing
IEEE 2014 JAVA DATA MINING PROJECTS Keyword query routingIEEE 2014 JAVA DATA MINING PROJECTS Keyword query routing
IEEE 2014 JAVA DATA MINING PROJECTS Keyword query routing
 
A DATA EXTRACTION ALGORITHM FROM OPEN SOURCE SOFTWARE PROJECT REPOSITORIES FO...
A DATA EXTRACTION ALGORITHM FROM OPEN SOURCE SOFTWARE PROJECT REPOSITORIES FO...A DATA EXTRACTION ALGORITHM FROM OPEN SOURCE SOFTWARE PROJECT REPOSITORIES FO...
A DATA EXTRACTION ALGORITHM FROM OPEN SOURCE SOFTWARE PROJECT REPOSITORIES FO...
 
A New Multi-Dimensional Hyperbolic Structure for Cloud Service Indexing
A New Multi-Dimensional Hyperbolic Structure for Cloud Service IndexingA New Multi-Dimensional Hyperbolic Structure for Cloud Service Indexing
A New Multi-Dimensional Hyperbolic Structure for Cloud Service Indexing
 

Similar to GHCNPaper3

Worldranking universities final documentation
Worldranking universities final documentationWorldranking universities final documentation
Worldranking universities final documentation
Bhadra Gowdra
 
Linked Data and Semantic Web Application Development by Peter Haase
Linked Data and Semantic Web Application Development by Peter HaaseLinked Data and Semantic Web Application Development by Peter Haase
Linked Data and Semantic Web Application Development by Peter Haase
Laboratory of Information Science and Semantic Technologies
 
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame WorkA Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
IRJET Journal
 
X18134599 mansi chowkkar
X18134599 mansi chowkkarX18134599 mansi chowkkar
X18134599 mansi chowkkar
MansiChowkkar
 
IRJET - Weather Log Analysis based on Hadoop Technology
IRJET - Weather Log Analysis based on Hadoop TechnologyIRJET - Weather Log Analysis based on Hadoop Technology
IRJET - Weather Log Analysis based on Hadoop Technology
IRJET Journal
 
Data dissemination and materials informatics at LBNL
Data dissemination and materials informatics at LBNLData dissemination and materials informatics at LBNL
Data dissemination and materials informatics at LBNL
Anubhav Jain
 
2004-10-15 SHAirED: Services for Helping the Air-quality Community use ESE Data
2004-10-15 SHAirED: Services for Helping the Air-quality Community use ESE Data2004-10-15 SHAirED: Services for Helping the Air-quality Community use ESE Data
2004-10-15 SHAirED: Services for Helping the Air-quality Community use ESE DataRudolf Husar
 
Seeds Poster2
Seeds Poster2Seeds Poster2
Seeds Poster2
Rudolf Husar
 
Final Report_798 Project_Nithin_Sharmila
Final Report_798 Project_Nithin_SharmilaFinal Report_798 Project_Nithin_Sharmila
Final Report_798 Project_Nithin_SharmilaNithin Kakkireni
 
The PRP and Its Applications
The PRP and Its ApplicationsThe PRP and Its Applications
The PRP and Its Applications
Larry Smarr
 
B017320612
B017320612B017320612
B017320612
IOSR Journals
 
Leveraging Map Reduce With Hadoop for Weather Data Analytics
Leveraging Map Reduce With Hadoop for Weather Data Analytics Leveraging Map Reduce With Hadoop for Weather Data Analytics
Leveraging Map Reduce With Hadoop for Weather Data Analytics
iosrjce
 
A Query Model for Ad Hoc Queries using a Scanning Architecture
A Query Model for Ad Hoc Queries using a Scanning ArchitectureA Query Model for Ad Hoc Queries using a Scanning Architecture
A Query Model for Ad Hoc Queries using a Scanning Architecture
Flurry, Inc.
 
Presentation
PresentationPresentation
Presentationbolu804
 
Rpi talk foster september 2011
Rpi talk foster september 2011Rpi talk foster september 2011
Rpi talk foster september 2011
Ian Foster
 
Spark-MPI: Approaching the Fifth Paradigm with Nikolay Malitsky
Spark-MPI: Approaching the Fifth Paradigm with Nikolay MalitskySpark-MPI: Approaching the Fifth Paradigm with Nikolay Malitsky
Spark-MPI: Approaching the Fifth Paradigm with Nikolay Malitsky
Databricks
 
DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...
DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...
DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...
cscpconf
 
Design architecture based on web
Design architecture based on webDesign architecture based on web
Design architecture based on web
csandit
 
So Long Computer Overlords
So Long Computer OverlordsSo Long Computer Overlords
So Long Computer Overlords
Ian Foster
 

Similar to GHCNPaper3 (20)

Worldranking universities final documentation
Worldranking universities final documentationWorldranking universities final documentation
Worldranking universities final documentation
 
Linked Data and Semantic Web Application Development by Peter Haase
Linked Data and Semantic Web Application Development by Peter HaaseLinked Data and Semantic Web Application Development by Peter Haase
Linked Data and Semantic Web Application Development by Peter Haase
 
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame WorkA Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
 
X18134599 mansi chowkkar
X18134599 mansi chowkkarX18134599 mansi chowkkar
X18134599 mansi chowkkar
 
IRJET - Weather Log Analysis based on Hadoop Technology
IRJET - Weather Log Analysis based on Hadoop TechnologyIRJET - Weather Log Analysis based on Hadoop Technology
IRJET - Weather Log Analysis based on Hadoop Technology
 
Data dissemination and materials informatics at LBNL
Data dissemination and materials informatics at LBNLData dissemination and materials informatics at LBNL
Data dissemination and materials informatics at LBNL
 
2004-10-15 SHAirED: Services for Helping the Air-quality Community use ESE Data
2004-10-15 SHAirED: Services for Helping the Air-quality Community use ESE Data2004-10-15 SHAirED: Services for Helping the Air-quality Community use ESE Data
2004-10-15 SHAirED: Services for Helping the Air-quality Community use ESE Data
 
Seeds Poster2
Seeds Poster2Seeds Poster2
Seeds Poster2
 
Final Report_798 Project_Nithin_Sharmila
Final Report_798 Project_Nithin_SharmilaFinal Report_798 Project_Nithin_Sharmila
Final Report_798 Project_Nithin_Sharmila
 
The PRP and Its Applications
The PRP and Its ApplicationsThe PRP and Its Applications
The PRP and Its Applications
 
banian
banianbanian
banian
 
B017320612
B017320612B017320612
B017320612
 
Leveraging Map Reduce With Hadoop for Weather Data Analytics
Leveraging Map Reduce With Hadoop for Weather Data Analytics Leveraging Map Reduce With Hadoop for Weather Data Analytics
Leveraging Map Reduce With Hadoop for Weather Data Analytics
 
A Query Model for Ad Hoc Queries using a Scanning Architecture
A Query Model for Ad Hoc Queries using a Scanning ArchitectureA Query Model for Ad Hoc Queries using a Scanning Architecture
A Query Model for Ad Hoc Queries using a Scanning Architecture
 
Presentation
PresentationPresentation
Presentation
 
Rpi talk foster september 2011
Rpi talk foster september 2011Rpi talk foster september 2011
Rpi talk foster september 2011
 
Spark-MPI: Approaching the Fifth Paradigm with Nikolay Malitsky
Spark-MPI: Approaching the Fifth Paradigm with Nikolay MalitskySpark-MPI: Approaching the Fifth Paradigm with Nikolay Malitsky
Spark-MPI: Approaching the Fifth Paradigm with Nikolay Malitsky
 
DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...
DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...
DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...
 
Design architecture based on web
Design architecture based on webDesign architecture based on web
Design architecture based on web
 
So Long Computer Overlords
So Long Computer OverlordsSo Long Computer Overlords
So Long Computer Overlords
 

GHCNPaper3

  • 1. Using LittleFe for “Big” Data Education: An Experience Report with GHCN David Monismith monismia Bala Venkata Paneendra Abburi S519288b Achyuth Chaitanya Chitumalla S519322b Santhosh Reddy Damasani S519323b Satyanarayana Juttiga S519343b Snehitha Reddy Padakanti S519400b Tejaswi Potu S519411b Sandeep Raghavareddy S519414b Susmith Reddy Siddipeta S519421b Prashanthi Kanneboina S519464b Spandana Sama S519474b Northwest Missouri State University, 800 University Dr., CH 2050, Maryville, MO 64468, (660) 562-1802, +1 a@nwmissouri.edu | b@mail.nwmissouri.edu ABSTRACT This paper provides an experience report of a graduate directed project that began in Fall 2014 at Northwest Missouri State University as a two semester experiment to determine if a “Big” Data project could be accomplished with a LittleFe cluster computer. Described herein are the details of the project, the hardware, and the software stack. The project itself is based upon existing projects that make use of the GHCN data set – an ~20GB raw text data set consisting of daily climactic data spanning over 100 years. The students on this project attempted to bulk load the data on to a modified LittleFe v4d cluster computer using Hive, HBase and Hadoop/HDFS. The end result of the project was to be an interactive website that would allow for display and animation of the climactic data using Apache Tomcat, Java Server Pages, Google Maps API, and D3.js. After describing the project and the approaches to solving this problem, lessons learned and future applications for this and similar projects are described. Categories and Subject Descriptors K.3.2 [Computers and Education]: Computer and Information Science Education – Computer science education, curriculum, self-assessment. General Terms Algorithms, Design, Experimentation. Keywords Computer Science Education, Big Data, LittleFe. 1. INTRODUCTION In the fall of 2014, several students began their graduate directed project at Northwest Missouri State University. At the faculty mentor/client’s request, this project involved development of a web application and an HBase/Hadoop backend that would allow for query, display and visualization of global historical climactic network (GHCN) weather station data as available at the National Climactic Data Center (NCDC). The overall goals for this project were multi-faceted – 1) to investigate the viability of the LittleFe platform for use with Hadoop and HBase, 2) to teach students about Linux, distributed computing, and processing large data sets, and 3) to investigate new and interesting big data tools. The student authors worked with a faculty client (David Monismith) and were provided with boilerplate code and significant technical assistance from faculty for this project. Students worked with Linux virtual machines for testing and with a modified LittleFe v4d as a deployment system. The GHCN_All data set was used to represent a “big” data for this project. This data set contains daily weather data from nearly 90,000 weather stations for as many as 100 years per location, and is available at ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/. The size of this data set is 2.3GB compressed and over 20GB raw plain text. The bulk of the work on this project included installation of a Hadoop ecosystem on a LittleFe system, discovery of tools to effectively work with the data set, development of queries, and development of a web interface to interact with the data. Since the Global Historical Climatology Network consists of a relatively large amount of data as compared to the system specifications of a LittleFe cluster, Hadoop was chosen as a backend processing tool and HBase was chosen as a database. Using Hadoop and HBase, the data is queried and retrieved in a straightforward manner, however, for data loading and retrieval SQL-like commands are preferred because of their power. Therefore, students chose to use SQL middleware to allow for query access into the database. Initially, students chose to use Hive to bulk load the GHCN data into HBase with SQL-like queries. After some research, Apache Phoenix was chosen as a replacement for Hive for two reasons – 1) a JDBC connection to Phoenix is available for programmatic access to HBase via SQL- Permissiontomake digital or hardcopies of all or part of this work for personal or classroom use is grantedwithout fee providedthat copies are not made or distributed for profit or commercial advantage and that copies bear this notice andthe full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. XSEDE’15, July 26–30, 2015, St. Louis, MO, USA. Copyright 2015heldby Owner/Author.Publication Rights Licensed to ACM 1-58113-000-0/00/0010 …$15.00.
  • 2. like queries and 2) Phoenix reportedly has better performance than Hive. Thus data sets retrieved with Phoenix via JDBC may be sent directly to the front end for display. The front end of the application displays temperature data to the user with a heat map using visualization tools including Google Maps API, D3.js, Java Server Pages (JSP), Java, and Apache Tomcat. The front end provides the user with the ability to select a date range and a These tools allow for the display of latitude and longitude values for all stations present in the dataset and for the display of user-selected attributes via heat maps - showing a different color tone for different range of values and including a legend. Weather station data for the front end is provided from HBase via dynamic JDBC Phoenix queries provided through Java Server Pages (JSP) and related Java code. Students and the faculty mentor/client spent a significant amount of time learning about the tools and dataset used within this project. After gaining such domain knowledge, the authors developed a web application to interface with the tools and database presented herein. Finally, the faculty mentor performed a self-assessment of the approach taken herein. Therefore, this paper is organized such that the project background is described in the second section. Following sections include a detailed project description in the third section, a description of learning outcomes of the project in the fourth section, and finally, lessons learned, including difficulties encountered and conclusions are presented in the final sections of this paper. 2. BACKGROUND In this section, background information on the data, tools and scope of the project is provided. This first includes a description of the graduate directed projects course to provide the reader with an understanding of the duration and scope of the project. Thereafter, a description of the GHCN_ALL dataset is provided, and descriptions for the various components of the software/hardware stack used in this project are also provided. Components used in the software/hardware stack include the Google Maps API, D3.js, Hadoop/HDFS, HBase, Phoenix, and Tomcat. 2.1 Graduate Directed Project Graduate Directed Projects at Northwest Missouri State University are a two-semester course sequence that serves in place of a thesis project for Master of Science in Applied Computer Science students. Students in this course work in teams of approximately ten students with a faculty mentor and a client to complete a significant project that encompasses most or all of the software development lifecycle ranging from requirements gathering and design to testing and maintenance. Standard graduate projects at Northwest often involve gathering project requirements, performing user interface and database design, developing a website or mobile application that interacts with a database or another complex tool, performing unit and integration testing that application, deploying it to a test server, and performing usability testing on the application. 2.2 Data Set The GHCN_All dataset includes data representing different stations, wherein each station includes various daily attributes such as temperature, precipitation, snowfall, etc. A table describing the format of the data used within this project from the GHCN_All dataset follows below. Table 1. GHCN_All Data File Format Variable Columns Type Description Id 1-11 Character Station identification code Year 12-15 Integer Record year Month 16-17 Integer Record month Element 18-21 Character Type of weather observation Value1 22-26 Integer Value on the first day of the month Mflag1 27 Character Measurement flag Qflag1 28 Character Quality flag Sflag1 29 Character Source flag … … … … Value31 to Sflag31 262-269 Integer & Character Value & flags on the 31st day of the month As previously mentioned, the GCHN_All dataset contains climate data (daily measurements) for nearly 90,000 weather stations spanning as far back as 100 years for some stations. Data measured at each station is stored in a “.dly” file with the station identifier (“id” in Table 1) as the filename prefix. Each file contains plain text data with each line in the file containing the information as shown in Table 1 with one line of data per month. Daily weather observations include elements such as precipitation in tenths of millimeters (PRCP), snowfall in millimeters (SNOW), snow depth (SNWD), minimum temperature (TMIN), and maximum temperature (TMAX) in tenths of degrees Celsius. As mentioned above the data was bulk loaded into HBase and is retrieved from the database as necessary for visualization. Initially, this data was retrieved on the front end with Apache Thrift, however, the use of Phoenix in conjunction with JDBC, Apache Tomcat (a servlet container), and JavaScript, has replaced the need for Thrift. By using Java Server Pages and JavaScript in the front end, visualization with line plots and box plots was achieved with D3.js. In a similar fashion, heat map visualization was achieved via the Google Maps API. Heat map results are displayed by selecting a date range and desired attribute. At the click of a button, weather conditions may be displayed over the given date range. 2.3 Deployment Platform In this project, a system called LittleFe3 was used as the deployment system. This system was used to provide a low-cost, parallel, distributed computing environment. The cost of this system was approximately $3000, and it was built using commercial, off-the-shelf hardware and a custom chassis provided by Earlham University. The LittleFe3 system makes use of the BCCD operating system – a Debian Linux variant. This system is a modified version of Earlham’s LittleFe v4d system, and includes 6 nodes. One of which is the head node (node000), and 5 are child nodes (node011 through node015). The block diagram below describes the LittleFe3 system, which has 8GB RAM per node, 1x512GB SSD on the head node, 5x256GB SSDs (one per child node), and 6x quad core Celeron J1900 Processors (one per node).
  • 3. Figure 1: LittleFe3 block diagram Provided that proper load balancing may be achieved and that code or data operations can be parallelized, a parallel platform such as LittleFe3 may provide both speedup and efficiency. Additionally, on a distributed system such as LittleFe3, both shared memory and distributed memory parallelism may be achieved through the use of multiple multi-core systems. 2.4 Software Stack On the LittleFe3 system, the authors of this paper installed a Hadoop software stack that included Hadoop, an HDFS file system, the HBase NoSQL database, Hive, Phoenix, Tomcat, D3.js, Google Maps API. Theprimary software stack is shown in the image below. Figure 2: Software Stack 2.4.1 Hadoop Hadoop consists of algorithms that may be used to manage distributed storage and to perform big data processing. Hadoop was developed by the Apache Software Foundation, and was written in Java in a manner such that it will operate on many different hardware platforms. There are a number of different operation modes for Hadoop including a local/standalone mode, a pseudo-distributed mode, and a fully distributed mode. Included within this framework are modules including the Hadoop Distributed File System(HDFS) and Hadoop MapReduce. In this project, Hadoop is used because it provides a parallel computation environment and distributed file system. As the dataset being used, GHCN, is quite large, Hadoop has proven effective in processing this data. Currently installation of Hadoop on a LittleFe system is somewhat complex and requires a significant amount of effort to install. In particular this includes modifying the capacity-scheduler, core-site.xml, mapred-site.xml, yarn- site.xml, masters, and slaves. These files allow for initialization of system dependent scheduling variables, the HDFS location, the HDFS replication factor, identification of master and slave nodes, and memory allocations for various Hadoop components. Values for these system dependent variables are provided in the Appendix. 2.4.2 HBase, Hive, and Phoenix HBase provides random read/write access to HDFS in the form of a NoSQL column store database. This type of database provides data access in a form similar to that of a spreadsheet tool wherein data is stored using rows, columns, and column families (similar to sheets). Simple operations like get and put allow for direct access to each cell within the column store provided the row, column, and/or column family names. Additionally operations such as scan and list allow for full display of the contents of a table within the database and of all the tables stored therein, respectively. Interestingly, HBase provides for such data to be distributed across multiple systems when used in conjunction with Hadoop. Therefore, Hbase is well suited to store large distributed datasets, especially those datasets where such data may be read and processed relatively many times when compared to the number of database writes. Commands such as get and put are quite primitive when compared to SQL, so many developers prefer to use a middleware tool that provides a SQL layer over HBase for ease of use. Tools such as Apache Hive and Phoenix provide a relational database layer on top of HBase. First, Hive provides data warehouse access to the distributed storage layer in HDFS. It provides the capability to bulk load such stored data into HBase. While Hive is quite useful for bulk loading and data warehousing, Phoenix has proven to be a more useful tool for this project. Phoenix also provides an SQL layer over HBase; however, it also provides low-latency JDBC functionality because it uses the Hbase API directly. This increases the query performance project. Empirical results in our project have shown that Phoenix is faster than Hive. Where the Hive may take seconds to querying small numbers of rows, Phoenix may takes just seconds to query ten million of row. As our project may require thousands or hundreds of thousands of rows to be retrieved from Hbase, Phoenix is preferred because it takes less time to display results. Theoretical performance reports from Apache indicate Phoenix may be 50-70 times faster than Hive.
  • 4. 2.4.3 Google MapsAPI, D3.js,and Tomcat For front-end work, three web APIs were used in this project – Google Maps API, D3.js, and Apache Tomcat. The Google Maps API was used for map and weather data display. D3.js was used for data visualization of single weather station data over time and for comparison of such data between several weather stations. Finally, Apache Tomcat was used to provide container for the JDBC-enabled data access layer between the front-end and back- end Google Maps API was used to provide both a map display layer for web view and to provide heatmap functionality for the application described in the next section. The map display layer functionality provides an important role – it allows for acquisition of the minimum and maximum longitude and latitude coordinates. These two values are important in allowing for display of the appropriate temperature data because they allow for selection of the appropriate weather stations from the database. D3.js is a JavaScript library for producing data visualizations, that is, mapping datasets to images or animations. Using the D3 library, it is possible to produce both static and dynamic visualizations. These may be interactive and can be produced in real time using a standard web technology – JavaScript. Within this project, D3.js was used to create both Box Plots and Line Plots. Examples of both box plots and line plots as generated in this project are provided below. Figure 3: D3.js Box Plot Notice that the box plot allows for a graphical display numerical data via quartiles. Included in the diagram above are the minimum, lower quartile, median, upper quartile, and maximum. Using such an approach, the user is able to view and analyze the differences in datasets, namely temperatures, quickly. This approach was used in this project to allow for graphical display of temperatures on different days from different weather stations. Additionally, line plot graphs were generated using D3.js as shown in the example below. Such graphs were used to display the variation of factors like temperature, precipitation, and snowfall for the selected date range. Notice that the y-axis of the graph is displayed in tenths of degrees Celsius. 3. PROJECT DESCRIPTION 3.1 Requirements Students were to develop a web (and possibly a mobile front end) and a HBase/Hadoop backend that will allow for query, display and visualization of global historical climactic weather station data as available at the National Climactic Data Center. Boilerplate code allowing for direct HBase connectivity and data structures to represent GHCN data was made available to students for this project. Students were provided with a copy of the data set for this project and were made aware of the location of additional documentation. On beginning the project students were made aware that they would need to 1) work with a Hadoop/HBase ecosystem, 2) discover how to bulk load the data into thedatabase, and 3) develop a web application to display data retrieved from HBase. 3.2 System Preparation Prior to the start of the project, the faculty mentor identified two different means of allowing for Hadoop/HBase connectivity. These included 1) using Cloudera Virtual Machines on university lab computers and 2) installing Hadoop from scratch on a LittleFe system. Students found the Cloudera VMs straightforward to use through the use of the HUE interface. In preparation for bulk loading, the students and faculty mentor discovered that when used on un-clustered desktop systems, such VMs lacked sufficient compute power to complete bulk loading of data. Students even tried dividing this data between several VMs on different desktop computers, but were unsuccessful at processing the data because they were not able to obtain sole access to lab computers for batch processing. The faculty mentor suggested students use the newly built LittleFe3 machine and install Hadoop and HBase from scratch. Students agreed to accomplish this task and were provided with the following resources to install Hadoop 2.6.0 and HBase 0.98.9 on the cluster computer. Table 2. Hadoop/HBase files provided to students. Filename Description startHadoop Homemade Bash shell script to start Hadoop on all LittleFe nodes startNode ManagersAnd DataNodes Homemade Bash shell script to start Hadoop manager servers on all LittleFe nodes – called by startHadoop capacity- scheduler.xml Site specific memory/scheduling settings core-site.xml Contains HDFS URL definition mapred- site.xml Map-Reduce settings for Hadoop yarn-site.xml YARN resource negotiation settings for Hadoop hdfs-site.xml HDFS settings including replication masters List of Hadoop master nodes slaves List of Hadoop worker nodes hbase-site.xml Contains HBase HDFS locations and list of Zookeeper nodes regionservers List of HBase worker nodes
  • 5. Students were provided with instruction on how to use a Linux system, how to write shell scripts, and how to perform basic system administration. They were then asked to install and start Hadoop and HBase on LittleFe3 and were successful in doing so. 3.3 Bulk Loading Students worked to bulk load (i.e. load the entirety of a large data set) into HBase through the use of Hive scripts. This loading took place in several stages. First data was parsed and converted into a comma-separated format using Java. Next, the data was uploaded into the HDFS Metadata store, using Hive. Finally, such data was transferred from Hive into HBase. Bulk Loading Process Parse and convert data into a programmer- friendly format. Upload data into HDFS Metadata Store via Hive. Transfer Data from Hive to HBase. Two data sets were bulk loaded into HBase. These included a list of all stations with the longitude and latitude coordinates for each station along with some additional data that may be useful in future development. A database diagram showing the components of this table is provided below. Table 3. stations_hbase table. stations_hbase rowId cf:latitude cf:longitude cf:elevation cf:state cf:name cf:gsnflag cf:hcnflag cf:wmoid The second table loaded into HBase included the minimum and maximum temperatures as values and the station name, year, and month as row identifiers. ****allstations_hbase table goes here 3.4 Web Application Development Development of the web application to display heatmap data was a multi-step process. This involved writing JavaScript code to display Acquiring Data via Phoenix JDBC Connection Google Maps API Max/Min Long/Lat Limited data display Table 4. Table captions should be placed above the table Graphics Top In-between Bottom Tables End Last First Figures Good Similar Very well . 4. LEARNING OUTCOMES Standard learning outcomes ****Learn to work in a team ****Gain real world programming experience in a safe setting (no risk of being fired) ****Improve communication skills ****Improve programming and systems integration skills Additional Learning Outcomes ****Command Line Interface ****Shell scripting ****System administration ****SQL queries ****Understanding of various APIs . 5. LESSONS LEARNED Initial attempts ****Virtual machine implementation is ok for small data set testing ****Hive Time spent teaching students about related topics ****Shell scripting ****Hadoop ecosystem 6. ACKNOWLEDGMENTS The authors thank Dr. Scott Bell for his support as mentor while Dr. Monismith assumed the role of a client in the first semester of the project. The authors also thank the LittleFe team for their work on LittleFe system and chassis design. 7. REFERENCES [1] http://littlefe.net/ [2] http://littlefe.net/buildout [3] http://en.wikipedia.org/wiki/Apache_Hadoop [4] http://searchcloudcomputing.techtarget.com/definition/Hado op [5] http://www.sas.com/en_us/insights/big-data/hadoop.html [6] https://github.com/mbostock/d3/wiki/Gallery [7] D3 Book http://chimera.labs.oreilly.com/books/1230000000345/ch01. html [8] Hadoop: http://hadoop.apache.org/#What+Is+Apache+Hadoop%3F [9] HBase: http://hbase.apache.org/ [10] D3.js: http://bl.ocks.org/mbostock/4061502 [11] Google maps API: https://developers.google.com/maps/documentation/javascrip t/tutorial
  • 6. [12] Apache Thrift:https://thrift.apache.org/ [13] GHCN:http://www.ncdc.noaa.gov/data-access/land-based- station-data/land-based-datasets/global-historical- climatology-network-ghcn [14] . [15] [16] Columns on Last Page Should Be Made As Close As Possible to Equal Length