This document provides an experience report of a graduate student project that used a LittleFe cluster computer to analyze a large climate data set (GHCN) using Hadoop, HBase, Phoenix and other tools. The goals were to determine if LittleFe could handle big data, teach students distributed computing, and use new tools. Students loaded the 20GB GHCN data into HBase on LittleFe, then built an interactive website using Tomcat, JSP and D3.js to visualize the data. They encountered difficulties with the LittleFe installation but were able to complete the project, concluding LittleFe is suitable for educational big data projects with some modifications.
Enhancing Big Data Analysis by using Map-reduce TechniquejournalBEEI
Database is defined as a set of data that is organized and distributed in a manner that permits the user to access the data being stored in an easy and more convenient manner. However, in the era of big-data the traditional methods of data analytics may not be able to manage and process the large amount of data. In order to develop an efficient way of handling big-data, this work enhances the use of Map-Reduce technique to handle big-data distributed on the cloud. This approach was evaluated using Hadoop server and applied on Electroencephalogram (EEG) Big-data as a case study. The proposed approach showed clear enhancement on managing and processing the EEG Big-data with average of 50% reduction on response time. The obtained results provide EEG researchers and specialist with an easy and fast method of handling the EEG big data.
Global warming big data is analyzed and processed using Pig, Hive, Mapreduce, HDFC, HBase technologies. Results are visualized using Tableau and stored in HBase in Programming in Data Analytics Project.
Using Page Size for Controlling Duplicate Query Results in Semantic WebIJwest
Semantic web is a web of future. The Resource Description Framework (RDF) is a language
to represent resources in the World Wide Web. When these resources are queried the problem of duplicate
query results occurs. The present techniques used hash index comparison to remove duplicate query
results. The major drawback of using the hash index to remove duplicate query results is that, if there is a
slight change in formatting or word order, then hash index is changed and query results are no more
considered as duplicate even though they have same contents. We presented an algorithm for detection and
elimination of duplicate query results from semantic web using hash index and page size comparisons.
Experimental results showed that the proposed technique removed duplicate query results from semantic
web efficiently, solved the problems of using hash index for duplicate handling and could be embedded in
existing SQL-Based query system for semantic web. Research could be carried out for certain flexibilities
in existing SQL-Based query system of semantic web to accommodate other duplicate detection techniques
as well.
Enhancing Big Data Analysis by using Map-reduce TechniquejournalBEEI
Database is defined as a set of data that is organized and distributed in a manner that permits the user to access the data being stored in an easy and more convenient manner. However, in the era of big-data the traditional methods of data analytics may not be able to manage and process the large amount of data. In order to develop an efficient way of handling big-data, this work enhances the use of Map-Reduce technique to handle big-data distributed on the cloud. This approach was evaluated using Hadoop server and applied on Electroencephalogram (EEG) Big-data as a case study. The proposed approach showed clear enhancement on managing and processing the EEG Big-data with average of 50% reduction on response time. The obtained results provide EEG researchers and specialist with an easy and fast method of handling the EEG big data.
Global warming big data is analyzed and processed using Pig, Hive, Mapreduce, HDFC, HBase technologies. Results are visualized using Tableau and stored in HBase in Programming in Data Analytics Project.
Using Page Size for Controlling Duplicate Query Results in Semantic WebIJwest
Semantic web is a web of future. The Resource Description Framework (RDF) is a language
to represent resources in the World Wide Web. When these resources are queried the problem of duplicate
query results occurs. The present techniques used hash index comparison to remove duplicate query
results. The major drawback of using the hash index to remove duplicate query results is that, if there is a
slight change in formatting or word order, then hash index is changed and query results are no more
considered as duplicate even though they have same contents. We presented an algorithm for detection and
elimination of duplicate query results from semantic web using hash index and page size comparisons.
Experimental results showed that the proposed technique removed duplicate query results from semantic
web efficiently, solved the problems of using hash index for duplicate handling and could be embedded in
existing SQL-Based query system for semantic web. Research could be carried out for certain flexibilities
in existing SQL-Based query system of semantic web to accommodate other duplicate detection techniques
as well.
Hadoop Mapreduce Performance Enhancement Using In-Node Combinersijcsit
While advanced analysis of large dataset is in high demand, data sizes have surpassed capabilities of
conventional software and hardware. Hadoop framework distributes large datasets over multiple
commodity servers and performs parallel computations. We discuss the I/O bottlenecks of Hadoop
framework and propose methods for enhancing I/O performance. A proven approach is to cache data to
maximize memory-locality of all map tasks. We introduce an approach to optimize I/O, the in-node
combining design which extends the traditional combiner to a node level. The in-node combiner reduces
the total number of intermediate results and curtail network traffic between mappers and reducers.
While advanced analysis of large dataset is in high demand, data sizes have surpassed capabilities of
conventional software and hardware. Hadoop framework distributes large datasets over multiple
commodity servers and performs parallel computations. We discuss the I/O bottlenecks of Hadoop
framework and propose methods for enhancing I/O performance. A proven approach is to cache data to
maximize memory-locality of all map tasks. We introduce an approach to optimize I/O, the in-node
combining design which extends the traditional combiner to a node level. The in-node combiner reduces
the total number of intermediate results and curtail network traffic between mappers and reducers.
Implementation of p pic algorithm in map reduce to handle big dataeSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
Clustering is also known as data segmentation aims to partitions data set into groups, clusters, according to their similarity. Cluster analysis has been extensively studied in many researches. There are many algorithms for different types of clustering. These classical algorithms can't be applied on big data due to its distinct features. It is a challenge to apply the traditional techniques on large unstructured data. This study proposes a hybrid model to cluster big data using the famous traditional K-means clustering algorithm. The proposed model consists of three phases namely; Mapper phase, Clustering Phase and Reduce phase. The first phase uses map-reduce algorithm to split big data into small datasets. Whereas, the second phase implements the traditional clustering K-means algorithm on each of the spitted small data sets. The last phase is responsible of producing the general clusters output of the complete data set. Two functions, Mode and Fuzzy Gaussian, have been implemented and compared at the last phase to determine the most suitable one. The experimental study used four benchmark big data sets; Covtype, Covtype-2, Poker, and Poker-2. The results proved the efficiency of the proposed model in clustering big data using the traditional K-means algorithm. Also, the experiments show that the Fuzzy Gaussian function produces more accurate results than the traditional Mode function.
An adaptive clustering and classification algorithm for Twitter data streamin...TELKOMNIKA JOURNAL
On-going big data from social networks sites alike Twitter or Facebook has been an entrancing
hotspot for investigation by researchers in current decades as a result of various aspects including
up-to-date-ness, accessibility and popularity; however anyway there may be a trade off in accuracy.
Moreover, clustering of twitter data has caught the attention of researchers. As such, an algorithm which
can cluster data within a lesser computational time, especially for data streaming is needed. The presented
adaptive clustering and classification algorithm is used for data streaming in Apache spark to overcome
the existing problems is processed in two phases. In the first phase, the input pre-processed twitter data is
viably clustered utilizing an Improved Fuzzy C-means clustering and the proposed clustering is additionally
improved by an Adaptive Particle swarm optimization (PSO) algorithm. Further the clustered data
streaming is assessed utilizing spark engine. In the second phase, the input pre-processed Higgs data is
classified utilizing the modified support vector machine (MSVM) classifier with grid search optimization.
At long last the optimized information is assessed in spark engine and the assessed esteem is utilized to
discover an accomplished confusion matrix. The proposed work is utilizing Twitter dataset and Higgs
dataset for the data streaming in Apache Spark. The computational examinations exhibit the superiority of
presented approach comparing with the existing methods in terms of precision, recall, F-score,
convergence, ROC curve and accuracy.
The task scheduling is a key process in large-scale distributed systems like cloud computing infrastructures
which can have much impressed on system performance. This problem is referred to as a NP-hard problem
because of some reasons such as heterogeneous and dynamic features and dependencies among the
requests. Here, we proposed a bi-objective method called DWSGA to obtain a proper solution for
allocating the requests on resources. The purpose of this algorithm is to earn the response quickly, with
some goal-oriented operations. At first, it makes a good initial population by a special way that uses a bidirectional
tasks prioritization. Then the algorithm moves to get the most appropriate possible solution in a
conscious manner by focus on optimizing the makespan, and considering a good distribution of workload
on resources by using efficient parameters in the mentioned systems. Here, the experiments indicate that
the DWSGA amends the results when the numbers of tasks are increased in application graph, in order to
mentioned objectives. The results are compared with other studied algorithms.
We are good IEEE java projects development center in Chennai and Pondicherry. We guided advanced java technologies projects of cloud computing, data mining, Secure Computing, Networking, Parallel & Distributed Systems, Mobile Computing and Service Computing (Web Service).
For More Details:
http://jpinfotech.org/final-year-ieee-projects/2014-ieee-projects/java-projects/
WEB LOG PREPROCESSING BASED ON PARTIAL ANCESTRAL GRAPH TECHNIQUE FOR SESSION ...cscpconf
Web access log analysis is to analyze the patterns of web site usage and the features of users behavior. It is
the fact that the normal Log data is very noisy and unclear and it is vital to preprocess the log data for
efficient web usage mining process. Preprocessing comprises of three phases which includes data cleaning,
user identification and session construction. Session construction is very vital and numerous real world
problems can be modeled as traversals on graph and mining from these traversals would provide the
requirement for preprocessing phase. On the other hand, the traversals on unweighted graph have been
taken into consideration in existing works. This paper oversimplifies this to the case where vertices of
graph are given weights to reflect their significance. The proposed method constructs sessions as a Partial
Ancestral Graph which contains pages with calculated weights. This will help site administrators to find
the interesting pages for users and to redesign their web pages. After weighting each page according to
browsing time a PAG structure is constructed for each user session. Existing system in which there is a
problem of learning with the latent variables of the data and the problem can be overcome by the proposed
method.
To Get any Project for CSE, IT ECE, EEE Contact Me @ 09666155510, 09849539085 or mail us - ieeefinalsemprojects@gmail.com-Visit Our Website: www.finalyearprojects.org
A DATA EXTRACTION ALGORITHM FROM OPEN SOURCE SOFTWARE PROJECT REPOSITORIES FO...ijseajournal
Software project estimation is important for allocating resources and planning a reasonable work schedule. Estimation models are typically built using data from completed projects. While organizations have their historical data repositories, it is difficult to obtain their collaboration due to privacy and competitive concerns. To overcome the issue of public access to private data repositories this study proposes an algorithm to extract sufficient data from the GitHub repository for building duration estimation models. More specifically, this study extracts and analyses historical data on WordPress projects to estimate OSS project duration using commits as an independent variable as well as an improved classification of contributors based on the number of active days for each contributor within a release period. The results indicate that duration estimation models using data from OSS repositories perform well and partially solves the problem of lack of data encountered in empirical research in software engineering.
A New Multi-Dimensional Hyperbolic Structure for Cloud Service Indexingijdms
Nowadays, the structure of cloud uses the data processing of simple key value, which cannot adapted in
search of closeness effectively because of the lack of structures of effective indexes, and with the increase of
dimension, the structures of indexes similar to the tree existing could lead to the problem " of the curse of
dimension”. In this paper, we define a new cloud computing architecture in such global index, storage
nodes is based on a Distributed Hash Table (DHT). In order to reduce the cost of calculation, store
different services on the hyperbolic tree by using virtual coordinates thus that greedy routing based on
hyperbolic space properties in order to improve query storage and retrieving performance effectively. Next,
we perform and evaluate our cloud computing indexing structure based on a hyperbolic tree using virtual
coordinates taken in the hyperbolic plane. We show through our experimental results that we compare with
others clouds systems to show our solution ensures consistence and scalability for Cloud platform.
Worldranking universities final documentationBhadra Gowdra
With the upcoming data deluge of semantic data, the fast growth of ontology bases has brought significant challenges in performing efficient and scalable reasoning. Traditional centralized reasoning methods are not sufficient to process large ontologies. Distributed searching methods are thus required to improve the scalability and performance of inferences. This paper proposes an incremental and distributed inference method for large-scale ontologies by using Map reduce, which realizes high-performance reasoning and runtime searching, especially for incremental knowledge base. By constructing transfer inference forest and effective assertion triples, the storage is largely reduced and the search process is simplified and accelerated. We propose an incremental and distributed inference method (IDIM) for large-scale RDF datasets via Map reduce. The choice of Map reduce is motivated by the fact that it can limit data exchange and alleviate load balancing problems by dynamically scheduling jobs on computing nodes. In order to store the incremental RDF triples more efficiently, we present two novel concepts, i.e., transfer inference forest (TIF) and effective assertion triples (EAT). Their use can largely reduce the storage and simplify the reasoning process. Based on TIF/EAT, we need not compute and store RDF closure, and the reasoning time so significantly decreases that a user’s online query can be answered timely, which is more efficient than existing methods to our best knowledge. More importantly, the update of TIF/EAT needs only minimum computation since the relationship between new triples and existing ones is fully used, which is not found in the existing literature. In order to store the incremental RDF triples more efficiently, we present two novel concepts, transfer inference forest and effective assertion triples. Their use can largely reduce the storage and simplify the searching process.
In this project Hadoop, HDFS, MapReduce, HBase, Hive,
and Pig are used for analyzing global warming and CO2
emission data. Big data is being used for analysis and it is
accessed and stored using Hadoop distributed system. The
highest CO2 emission countries and the maximum global
warming temperature countries are related is confirmed from
this research.
Hadoop Mapreduce Performance Enhancement Using In-Node Combinersijcsit
While advanced analysis of large dataset is in high demand, data sizes have surpassed capabilities of
conventional software and hardware. Hadoop framework distributes large datasets over multiple
commodity servers and performs parallel computations. We discuss the I/O bottlenecks of Hadoop
framework and propose methods for enhancing I/O performance. A proven approach is to cache data to
maximize memory-locality of all map tasks. We introduce an approach to optimize I/O, the in-node
combining design which extends the traditional combiner to a node level. The in-node combiner reduces
the total number of intermediate results and curtail network traffic between mappers and reducers.
While advanced analysis of large dataset is in high demand, data sizes have surpassed capabilities of
conventional software and hardware. Hadoop framework distributes large datasets over multiple
commodity servers and performs parallel computations. We discuss the I/O bottlenecks of Hadoop
framework and propose methods for enhancing I/O performance. A proven approach is to cache data to
maximize memory-locality of all map tasks. We introduce an approach to optimize I/O, the in-node
combining design which extends the traditional combiner to a node level. The in-node combiner reduces
the total number of intermediate results and curtail network traffic between mappers and reducers.
Implementation of p pic algorithm in map reduce to handle big dataeSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
Clustering is also known as data segmentation aims to partitions data set into groups, clusters, according to their similarity. Cluster analysis has been extensively studied in many researches. There are many algorithms for different types of clustering. These classical algorithms can't be applied on big data due to its distinct features. It is a challenge to apply the traditional techniques on large unstructured data. This study proposes a hybrid model to cluster big data using the famous traditional K-means clustering algorithm. The proposed model consists of three phases namely; Mapper phase, Clustering Phase and Reduce phase. The first phase uses map-reduce algorithm to split big data into small datasets. Whereas, the second phase implements the traditional clustering K-means algorithm on each of the spitted small data sets. The last phase is responsible of producing the general clusters output of the complete data set. Two functions, Mode and Fuzzy Gaussian, have been implemented and compared at the last phase to determine the most suitable one. The experimental study used four benchmark big data sets; Covtype, Covtype-2, Poker, and Poker-2. The results proved the efficiency of the proposed model in clustering big data using the traditional K-means algorithm. Also, the experiments show that the Fuzzy Gaussian function produces more accurate results than the traditional Mode function.
An adaptive clustering and classification algorithm for Twitter data streamin...TELKOMNIKA JOURNAL
On-going big data from social networks sites alike Twitter or Facebook has been an entrancing
hotspot for investigation by researchers in current decades as a result of various aspects including
up-to-date-ness, accessibility and popularity; however anyway there may be a trade off in accuracy.
Moreover, clustering of twitter data has caught the attention of researchers. As such, an algorithm which
can cluster data within a lesser computational time, especially for data streaming is needed. The presented
adaptive clustering and classification algorithm is used for data streaming in Apache spark to overcome
the existing problems is processed in two phases. In the first phase, the input pre-processed twitter data is
viably clustered utilizing an Improved Fuzzy C-means clustering and the proposed clustering is additionally
improved by an Adaptive Particle swarm optimization (PSO) algorithm. Further the clustered data
streaming is assessed utilizing spark engine. In the second phase, the input pre-processed Higgs data is
classified utilizing the modified support vector machine (MSVM) classifier with grid search optimization.
At long last the optimized information is assessed in spark engine and the assessed esteem is utilized to
discover an accomplished confusion matrix. The proposed work is utilizing Twitter dataset and Higgs
dataset for the data streaming in Apache Spark. The computational examinations exhibit the superiority of
presented approach comparing with the existing methods in terms of precision, recall, F-score,
convergence, ROC curve and accuracy.
The task scheduling is a key process in large-scale distributed systems like cloud computing infrastructures
which can have much impressed on system performance. This problem is referred to as a NP-hard problem
because of some reasons such as heterogeneous and dynamic features and dependencies among the
requests. Here, we proposed a bi-objective method called DWSGA to obtain a proper solution for
allocating the requests on resources. The purpose of this algorithm is to earn the response quickly, with
some goal-oriented operations. At first, it makes a good initial population by a special way that uses a bidirectional
tasks prioritization. Then the algorithm moves to get the most appropriate possible solution in a
conscious manner by focus on optimizing the makespan, and considering a good distribution of workload
on resources by using efficient parameters in the mentioned systems. Here, the experiments indicate that
the DWSGA amends the results when the numbers of tasks are increased in application graph, in order to
mentioned objectives. The results are compared with other studied algorithms.
We are good IEEE java projects development center in Chennai and Pondicherry. We guided advanced java technologies projects of cloud computing, data mining, Secure Computing, Networking, Parallel & Distributed Systems, Mobile Computing and Service Computing (Web Service).
For More Details:
http://jpinfotech.org/final-year-ieee-projects/2014-ieee-projects/java-projects/
WEB LOG PREPROCESSING BASED ON PARTIAL ANCESTRAL GRAPH TECHNIQUE FOR SESSION ...cscpconf
Web access log analysis is to analyze the patterns of web site usage and the features of users behavior. It is
the fact that the normal Log data is very noisy and unclear and it is vital to preprocess the log data for
efficient web usage mining process. Preprocessing comprises of three phases which includes data cleaning,
user identification and session construction. Session construction is very vital and numerous real world
problems can be modeled as traversals on graph and mining from these traversals would provide the
requirement for preprocessing phase. On the other hand, the traversals on unweighted graph have been
taken into consideration in existing works. This paper oversimplifies this to the case where vertices of
graph are given weights to reflect their significance. The proposed method constructs sessions as a Partial
Ancestral Graph which contains pages with calculated weights. This will help site administrators to find
the interesting pages for users and to redesign their web pages. After weighting each page according to
browsing time a PAG structure is constructed for each user session. Existing system in which there is a
problem of learning with the latent variables of the data and the problem can be overcome by the proposed
method.
To Get any Project for CSE, IT ECE, EEE Contact Me @ 09666155510, 09849539085 or mail us - ieeefinalsemprojects@gmail.com-Visit Our Website: www.finalyearprojects.org
A DATA EXTRACTION ALGORITHM FROM OPEN SOURCE SOFTWARE PROJECT REPOSITORIES FO...ijseajournal
Software project estimation is important for allocating resources and planning a reasonable work schedule. Estimation models are typically built using data from completed projects. While organizations have their historical data repositories, it is difficult to obtain their collaboration due to privacy and competitive concerns. To overcome the issue of public access to private data repositories this study proposes an algorithm to extract sufficient data from the GitHub repository for building duration estimation models. More specifically, this study extracts and analyses historical data on WordPress projects to estimate OSS project duration using commits as an independent variable as well as an improved classification of contributors based on the number of active days for each contributor within a release period. The results indicate that duration estimation models using data from OSS repositories perform well and partially solves the problem of lack of data encountered in empirical research in software engineering.
A New Multi-Dimensional Hyperbolic Structure for Cloud Service Indexingijdms
Nowadays, the structure of cloud uses the data processing of simple key value, which cannot adapted in
search of closeness effectively because of the lack of structures of effective indexes, and with the increase of
dimension, the structures of indexes similar to the tree existing could lead to the problem " of the curse of
dimension”. In this paper, we define a new cloud computing architecture in such global index, storage
nodes is based on a Distributed Hash Table (DHT). In order to reduce the cost of calculation, store
different services on the hyperbolic tree by using virtual coordinates thus that greedy routing based on
hyperbolic space properties in order to improve query storage and retrieving performance effectively. Next,
we perform and evaluate our cloud computing indexing structure based on a hyperbolic tree using virtual
coordinates taken in the hyperbolic plane. We show through our experimental results that we compare with
others clouds systems to show our solution ensures consistence and scalability for Cloud platform.
Worldranking universities final documentationBhadra Gowdra
With the upcoming data deluge of semantic data, the fast growth of ontology bases has brought significant challenges in performing efficient and scalable reasoning. Traditional centralized reasoning methods are not sufficient to process large ontologies. Distributed searching methods are thus required to improve the scalability and performance of inferences. This paper proposes an incremental and distributed inference method for large-scale ontologies by using Map reduce, which realizes high-performance reasoning and runtime searching, especially for incremental knowledge base. By constructing transfer inference forest and effective assertion triples, the storage is largely reduced and the search process is simplified and accelerated. We propose an incremental and distributed inference method (IDIM) for large-scale RDF datasets via Map reduce. The choice of Map reduce is motivated by the fact that it can limit data exchange and alleviate load balancing problems by dynamically scheduling jobs on computing nodes. In order to store the incremental RDF triples more efficiently, we present two novel concepts, i.e., transfer inference forest (TIF) and effective assertion triples (EAT). Their use can largely reduce the storage and simplify the reasoning process. Based on TIF/EAT, we need not compute and store RDF closure, and the reasoning time so significantly decreases that a user’s online query can be answered timely, which is more efficient than existing methods to our best knowledge. More importantly, the update of TIF/EAT needs only minimum computation since the relationship between new triples and existing ones is fully used, which is not found in the existing literature. In order to store the incremental RDF triples more efficiently, we present two novel concepts, transfer inference forest and effective assertion triples. Their use can largely reduce the storage and simplify the searching process.
In this project Hadoop, HDFS, MapReduce, HBase, Hive,
and Pig are used for analyzing global warming and CO2
emission data. Big data is being used for analysis and it is
accessed and stored using Hadoop distributed system. The
highest CO2 emission countries and the maximum global
warming temperature countries are related is confirmed from
this research.
Leveraging Map Reduce With Hadoop for Weather Data Analytics iosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
A Query Model for Ad Hoc Queries using a Scanning ArchitectureFlurry, Inc.
Systems like Hadoop, Hbase and Hive allowed the world to take huge strides in managing and analyzing large amounts of data. Products like Flurry analytics make efficient use of large amounts of hardware using these tools to build statistics for hundreds of thousands of applications. However, these tools require the end user to first set up relevant analytics queries and then wait days for the results. If the results prompt new questions or the original query is not quite right, the user must rerun and wait again for the results.
We present the Burst system developed at Flurry to support low-latency single pass queries over very large and complex mobile application streams. We have created a data schema and query model that can answer very complex ad-hoc queries over data, and is highly parallelizable while maintaining low-latency. We implement these scans so that they are time and space efficient using the advanced disk scanning techniques provided by the underlying operating system.
A talk at the RPI-NSF Workshop on Multiscale Modeling of Complex Data, September 12, 2011, Troy NY, USA.
We have made much progress over the past decade toward effectively
harnessing the collective power of IT resources distributed across the
globe. In fields such as high-energy physics, astronomy, and climate,
thousands benefit daily from tools that manage and analyze large
quantities of data produced and consumed by large collaborative teams.
But we now face a far greater challenge: Exploding data volumes and powerful simulation tools mean that far more--ultimately
most?--researchers will soon require capabilities not so different from those used by these big-science teams. How is the general population of researchers and institutions to meet these needs? Must every lab be filled
with computers loaded with sophisticated software, and every researcher become an information technology (IT) specialist? Can we possibly afford to equip our labs in this way, and where would we find the experts to operate them?
Consumers and businesses face similar challenges, and industry has
responded by moving IT out of homes and offices to so-called cloud providers (e.g., GMail, Google Docs, Salesforce), slashing costs and complexity. I suggest that by similarly moving research IT out of the lab, we can realize comparable economies of scale and reductions in complexity. More importantly, we can free researchers from the burden of managing IT, giving them back their time to focus on research and empowering them to go beyond the scope of what was previously possible.
I describe work we are doing at the Computation Institute to realize this approach, focusing initially on research data lifecycle management. I present promising results obtained to date and suggest a path towards
large-scale delivery of these capabilities.
Spark-MPI: Approaching the Fifth Paradigm with Nikolay MalitskyDatabricks
Over the past decade, the fourth paradigm of data-intensive science rapidly became a major driving concept of multiple application domains encompassing and generating large-scale devices such as light sources and cutting edge telescopes. The success of data-intensive projects subsequently triggered the next generation of machine learning approaches. These new artificial intelligent systems clearly represent a paradigm shift from data processing pipelines towards the fifth paradigm of knowledge-centric cognitive applications requiring the integration of Big Data processing platforms and HPC technologies.
The talk addresses the existing impedance mismatch between data-intensive and compute-intensive ecosystems by presenting the Spark-MPI approach based on the MPI Process Management Interface (PMI). The approach was originally designed for building high-performance streaming image reconstruction pipelines at light source facilities. This talk will demonstrate Spark-MPI within the context of distributed deep learning applications by integrating the Apache Spark platform, PMI Exascale and Horovod MPI-based training framework for TensorFlow.
DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...cscpconf
Cloud has been a computational and storage solution for many data centric organizations. The
problem today those organizations are facing from the cloud is in data searching in an efficient
manner. A framework is required to distribute the work of searching and fetching from
thousands of computers. The data in HDFS is scattered and needs lots of time to retrieve. The
major idea is to design a web server in the map phase using the jetty web server which will give
a fast and efficient way of searching data in MapReduce paradigm. For real time processing on
Hadoop, a searchable mechanism is implemented in HDFS by creating a multilevel index in
web server with multi-level index keys. The web server uses to handle traffic throughput. By web
clustering technology we can improve the application performance. To keep the work down, the
load balancer should automatically be able to distribute load to the newly added nodes in the
server.
Cloud has been a computational and storage solution for many data centric organizations. The
problem today those organizations are facing from the cloud is in data searching in an efficient
manner. A framework is required to distribute the work of searching and fetching from
thousands of computers. The data in HDFS is scattered and needs lots of time to retrieve. The
major idea is to design a web server in the map phase using the jetty web server which will give
a fast and efficient way of searching data in MapReduce paradigm. For real time processing on
Hadoop, a searchable mechanism is implemented in HDFS by creating a multilevel index in
web server with multi-level index keys. The web server uses to handle traffic throughput. By web
clustering technology we can improve the application performance. To keep the work down, the
load balancer should automatically be able to distribute load to the newly added nodes in the
server.
The title of this talk is a crass attempt to be catchy and topical, by referring to the recent victory of Watson in Jeopardy.
My point (perhaps confusingly) is not that new computer capabilities are a bad thing. On the contrary, these capabilities represent a tremendous opportunity for science. The challenge that I speak to is how we leverage these capabilities without computers and computation overwhelming the research community in terms of both human and financial resources. The solution, I suggest, is to get computation out of the lab—to outsource it to third party providers.
Abstract follows:
We have made much progress over the past decade toward effective distributed cyberinfrastructure. In big-science fields such as high energy physics, astronomy, and climate, thousands benefit daily from tools that enable the distributed management and analysis of vast quantities of data. But we now face a far greater challenge. Exploding data volumes and new research methodologies mean that many more--ultimately most?--researchers will soon require similar capabilities. How can we possible supply information technology (IT) at this scale, given constrained budgets? Must every lab become filled with computers, and every researcher an IT specialist?
I propose that the answer is to take a leaf from industry, which is slashing both the costs and complexity of consumer and business IT by moving it out of homes and offices to so-called cloud providers. I suggest that by similarly moving research IT out of the lab, we can realize comparable economies of scale and reductions in complexity, empowering investigators with new capabilities and freeing them to focus on their research.
I describe work we are doing to realize this approach, focusing initially on research data lifecycle management. I present promising results obtained to date, and suggest a path towards large-scale delivery of these capabilities. I also suggest that these developments are part of a larger "revolution in scientific affairs," as profound in its implications as the much-discussed "revolution in military affairs" resulting from more capable, low-cost IT. I conclude with some thoughts on how researchers, educators, and institutions may want to prepare for this revolution.
1. Using LittleFe for “Big” Data Education: An Experience
Report with GHCN
David Monismith
monismia
Bala Venkata
Paneendra
Abburi
S519288b
Achyuth
Chaitanya
Chitumalla
S519322b
Santhosh Reddy
Damasani
S519323b
Satyanarayana
Juttiga
S519343b
Snehitha Reddy
Padakanti
S519400b
Tejaswi Potu
S519411b
Sandeep
Raghavareddy
S519414b
Susmith Reddy
Siddipeta
S519421b
Prashanthi
Kanneboina
S519464b
Spandana Sama
S519474b
Northwest Missouri State University, 800 University Dr., CH 2050, Maryville, MO 64468, (660) 562-1802, +1
a@nwmissouri.edu | b@mail.nwmissouri.edu
ABSTRACT
This paper provides an experience report of a graduate directed
project that began in Fall 2014 at Northwest Missouri State
University as a two semester experiment to determine if a “Big”
Data project could be accomplished with a LittleFe cluster
computer. Described herein are the details of the project, the
hardware, and the software stack. The project itself is based upon
existing projects that make use of the GHCN data set – an ~20GB
raw text data set consisting of daily climactic data spanning over
100 years. The students on this project attempted to bulk load the
data on to a modified LittleFe v4d cluster computer using Hive,
HBase and Hadoop/HDFS. The end result of the project was to
be an interactive website that would allow for display and
animation of the climactic data using Apache Tomcat, Java Server
Pages, Google Maps API, and D3.js. After describing the project
and the approaches to solving this problem, lessons learned and
future applications for this and similar projects are described.
Categories and Subject Descriptors
K.3.2 [Computers and Education]: Computer and Information
Science Education – Computer science education, curriculum,
self-assessment.
General Terms
Algorithms, Design, Experimentation.
Keywords
Computer Science Education, Big Data, LittleFe.
1. INTRODUCTION
In the fall of 2014, several students began their graduate directed
project at Northwest Missouri State University. At the faculty
mentor/client’s request, this project involved development of a
web application and an HBase/Hadoop backend that would allow
for query, display and visualization of global historical climactic
network (GHCN) weather station data as available at the National
Climactic Data Center (NCDC). The overall goals for this project
were multi-faceted – 1) to investigate the viability of the LittleFe
platform for use with Hadoop and HBase, 2) to teach students
about Linux, distributed computing, and processing large data
sets, and 3) to investigate new and interesting big data tools.
The student authors worked with a faculty client (David
Monismith) and were provided with boilerplate code and
significant technical assistance from faculty for this project.
Students worked with Linux virtual machines for testing and with
a modified LittleFe v4d as a deployment system. The GHCN_All
data set was used to represent a “big” data for this project. This
data set contains daily weather data from nearly 90,000 weather
stations for as many as 100 years per location, and is available at
ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/. The size of this data
set is 2.3GB compressed and over 20GB raw plain text. The bulk
of the work on this project included installation of a Hadoop
ecosystem on a LittleFe system, discovery of tools to effectively
work with the data set, development of queries, and development
of a web interface to interact with the data.
Since the Global Historical Climatology Network consists of a
relatively large amount of data as compared to the system
specifications of a LittleFe cluster, Hadoop was chosen as a
backend processing tool and HBase was chosen as a database.
Using Hadoop and HBase, the data is queried and retrieved in a
straightforward manner, however, for data loading and retrieval
SQL-like commands are preferred because of their power.
Therefore, students chose to use SQL middleware to allow for
query access into the database. Initially, students chose to use
Hive to bulk load the GHCN data into HBase with SQL-like
queries. After some research, Apache Phoenix was chosen as a
replacement for Hive for two reasons – 1) a JDBC connection to
Phoenix is available for programmatic access to HBase via SQL-
Permissiontomake digital or hardcopies of all or part of this work for
personal or classroom use is grantedwithout fee providedthat copies are
not made or distributed for profit or commercial advantage and that
copies bear this notice andthe full citation on the first page. To copy
otherwise, or republish, to post on servers or to redistribute to lists,
requires prior specific permission and/or a fee.
XSEDE’15, July 26–30, 2015, St. Louis, MO, USA.
Copyright 2015heldby Owner/Author.Publication Rights Licensed to
ACM 1-58113-000-0/00/0010 …$15.00.
2. like queries and 2) Phoenix reportedly has better performance
than Hive. Thus data sets retrieved with Phoenix via JDBC may
be sent directly to the front end for display.
The front end of the application displays temperature data to the
user with a heat map using visualization tools including Google
Maps API, D3.js, Java Server Pages (JSP), Java, and Apache
Tomcat. The front end provides the user with the ability to select
a date range and a These tools allow for the display of latitude and
longitude values for all stations present in the dataset and for the
display of user-selected attributes via heat maps - showing a
different color tone for different range of values and including a
legend. Weather station data for the front end is provided from
HBase via dynamic JDBC Phoenix queries provided through Java
Server Pages (JSP) and related Java code.
Students and the faculty mentor/client spent a significant amount
of time learning about the tools and dataset used within this
project. After gaining such domain knowledge, the authors
developed a web application to interface with the tools and
database presented herein. Finally, the faculty mentor performed
a self-assessment of the approach taken herein. Therefore, this
paper is organized such that the project background is described in
the second section. Following sections include a detailed project
description in the third section, a description of learning outcomes
of the project in the fourth section, and finally, lessons learned,
including difficulties encountered and conclusions are presented
in the final sections of this paper.
2. BACKGROUND
In this section, background information on the data, tools and
scope of the project is provided. This first includes a description
of the graduate directed projects course to provide the reader with
an understanding of the duration and scope of the project.
Thereafter, a description of the GHCN_ALL dataset is provided,
and descriptions for the various components of the
software/hardware stack used in this project are also provided.
Components used in the software/hardware stack include the
Google Maps API, D3.js, Hadoop/HDFS, HBase, Phoenix, and
Tomcat.
2.1 Graduate Directed Project
Graduate Directed Projects at Northwest Missouri State
University are a two-semester course sequence that serves in place
of a thesis project for Master of Science in Applied Computer
Science students. Students in this course work in teams of
approximately ten students with a faculty mentor and a client to
complete a significant project that encompasses most or all of the
software development lifecycle ranging from requirements
gathering and design to testing and maintenance. Standard
graduate projects at Northwest often involve gathering project
requirements, performing user interface and database design,
developing a website or mobile application that interacts with a
database or another complex tool, performing unit and integration
testing that application, deploying it to a test server, and
performing usability testing on the application.
2.2 Data Set
The GHCN_All dataset includes data representing different
stations, wherein each station includes various daily attributes
such as temperature, precipitation, snowfall, etc. A table
describing the format of the data used within this project from the
GHCN_All dataset follows below.
Table 1. GHCN_All Data File Format
Variable Columns Type Description
Id 1-11 Character
Station identification
code
Year 12-15 Integer Record year
Month 16-17 Integer Record month
Element 18-21 Character
Type of weather
observation
Value1 22-26 Integer
Value on the first day
of the month
Mflag1 27 Character Measurement flag
Qflag1 28 Character Quality flag
Sflag1 29 Character Source flag
… … … …
Value31
to Sflag31
262-269
Integer &
Character
Value & flags on the
31st
day of the month
As previously mentioned, the GCHN_All dataset contains climate
data (daily measurements) for nearly 90,000 weather stations
spanning as far back as 100 years for some stations. Data
measured at each station is stored in a “.dly” file with the station
identifier (“id” in Table 1) as the filename prefix. Each file
contains plain text data with each line in the file containing the
information as shown in Table 1 with one line of data per month.
Daily weather observations include elements such as precipitation
in tenths of millimeters (PRCP), snowfall in millimeters (SNOW),
snow depth (SNWD), minimum temperature (TMIN), and
maximum temperature (TMAX) in tenths of degrees Celsius. As
mentioned above the data was bulk loaded into HBase and is
retrieved from the database as necessary for visualization.
Initially, this data was retrieved on the front end with Apache
Thrift, however, the use of Phoenix in conjunction with JDBC,
Apache Tomcat (a servlet container), and JavaScript, has replaced
the need for Thrift. By using Java Server Pages and JavaScript in
the front end, visualization with line plots and box plots was
achieved with D3.js. In a similar fashion, heat map visualization
was achieved via the Google Maps API. Heat map results are
displayed by selecting a date range and desired attribute. At the
click of a button, weather conditions may be displayed over the
given date range.
2.3 Deployment Platform
In this project, a system called LittleFe3 was used as the
deployment system. This system was used to provide a low-cost,
parallel, distributed computing environment. The cost of this
system was approximately $3000, and it was built using
commercial, off-the-shelf hardware and a custom chassis provided
by Earlham University. The LittleFe3 system makes use of the
BCCD operating system – a Debian Linux variant. This system is
a modified version of Earlham’s LittleFe v4d system, and includes
6 nodes. One of which is the head node (node000), and 5 are
child nodes (node011 through node015). The block diagram
below describes the LittleFe3 system, which has 8GB RAM per
node, 1x512GB SSD on the head node, 5x256GB SSDs (one per
child node), and 6x quad core Celeron J1900 Processors (one per
node).
3. Figure 1: LittleFe3 block diagram
Provided that proper load balancing may be achieved and that
code or data operations can be parallelized, a parallel platform
such as LittleFe3 may provide both speedup and efficiency.
Additionally, on a distributed system such as LittleFe3, both
shared memory and distributed memory parallelism may be
achieved through the use of multiple multi-core systems.
2.4 Software Stack
On the LittleFe3 system, the authors of this paper installed a
Hadoop software stack that included Hadoop, an HDFS file
system, the HBase NoSQL database, Hive, Phoenix, Tomcat,
D3.js, Google Maps API. Theprimary software stack is shown in
the image below.
Figure 2: Software Stack
2.4.1 Hadoop
Hadoop consists of algorithms that may be used to manage
distributed storage and to perform big data processing. Hadoop
was developed by the Apache Software Foundation, and was
written in Java in a manner such that it will operate on many
different hardware platforms. There are a number of different
operation modes for Hadoop including a local/standalone mode, a
pseudo-distributed mode, and a fully distributed mode. Included
within this framework are modules including the Hadoop
Distributed File System(HDFS) and Hadoop MapReduce. In this
project, Hadoop is used because it provides a parallel computation
environment and distributed file system. As the dataset being
used, GHCN, is quite large, Hadoop has proven effective in
processing this data. Currently installation of Hadoop on a
LittleFe system is somewhat complex and requires a significant
amount of effort to install. In particular this includes modifying
the capacity-scheduler, core-site.xml, mapred-site.xml, yarn-
site.xml, masters, and slaves. These files allow for initialization
of system dependent scheduling variables, the HDFS location, the
HDFS replication factor, identification of master and slave nodes,
and memory allocations for various Hadoop components. Values
for these system dependent variables are provided in the
Appendix.
2.4.2 HBase, Hive, and Phoenix
HBase provides random read/write access to HDFS in the form of
a NoSQL column store database. This type of database provides
data access in a form similar to that of a spreadsheet tool wherein
data is stored using rows, columns, and column families (similar
to sheets). Simple operations like get and put allow for direct
access to each cell within the column store provided the row,
column, and/or column family names. Additionally operations
such as scan and list allow for full display of the contents of a
table within the database and of all the tables stored therein,
respectively. Interestingly, HBase provides for such data to be
distributed across multiple systems when used in conjunction with
Hadoop. Therefore, Hbase is well suited to store large distributed
datasets, especially those datasets where such data may be read
and processed relatively many times when compared to the
number of database writes.
Commands such as get and put are quite primitive when compared
to SQL, so many developers prefer to use a middleware tool that
provides a SQL layer over HBase for ease of use. Tools such as
Apache Hive and Phoenix provide a relational database layer on
top of HBase. First, Hive provides data warehouse access to the
distributed storage layer in HDFS. It provides the capability to
bulk load such stored data into HBase. While Hive is quite useful
for bulk loading and data warehousing, Phoenix has proven to be
a more useful tool for this project. Phoenix also provides an SQL
layer over HBase; however, it also provides low-latency JDBC
functionality because it uses the Hbase API directly. This
increases the query performance project. Empirical results in our
project have shown that Phoenix is faster than Hive. Where the
Hive may take seconds to querying small numbers of rows,
Phoenix may takes just seconds to query ten million of row. As
our project may require thousands or hundreds of thousands of
rows to be retrieved from Hbase, Phoenix is preferred because it
takes less time to display results. Theoretical performance reports
from Apache indicate Phoenix may be 50-70 times faster than
Hive.
4. 2.4.3 Google MapsAPI, D3.js,and Tomcat
For front-end work, three web APIs were used in this project –
Google Maps API, D3.js, and Apache Tomcat. The Google Maps
API was used for map and weather data display. D3.js was used
for data visualization of single weather station data over time and
for comparison of such data between several weather stations.
Finally, Apache Tomcat was used to provide container for the
JDBC-enabled data access layer between the front-end and back-
end
Google Maps API was used to provide both a map display layer
for web view and to provide heatmap functionality for the
application described in the next section. The map display layer
functionality provides an important role – it allows for acquisition
of the minimum and maximum longitude and latitude coordinates.
These two values are important in allowing for display of the
appropriate temperature data because they allow for selection of
the appropriate weather stations from the database.
D3.js is a JavaScript library for producing data visualizations, that
is, mapping datasets to images or animations. Using the D3
library, it is possible to produce both static and dynamic
visualizations. These may be interactive and can be produced in
real time using a standard web technology – JavaScript. Within
this project, D3.js was used to create both Box Plots and Line
Plots. Examples of both box plots and line plots as generated in
this project are provided below.
Figure 3: D3.js Box Plot
Notice that the box plot allows for a graphical display numerical
data via quartiles. Included in the diagram above are the
minimum, lower quartile, median, upper quartile, and maximum.
Using such an approach, the user is able to view and analyze the
differences in datasets, namely temperatures, quickly. This
approach was used in this project to allow for graphical display of
temperatures on different days from different weather stations.
Additionally, line plot graphs were generated using D3.js as
shown in the example below.
Such graphs were used to display the variation of factors like
temperature, precipitation, and snowfall for the selected date
range. Notice that the y-axis of the graph is displayed in tenths of
degrees Celsius.
3. PROJECT DESCRIPTION
3.1 Requirements
Students were to develop a web (and possibly a mobile front end)
and a HBase/Hadoop backend that will allow for query, display
and visualization of global historical climactic weather station
data as available at the National Climactic Data Center.
Boilerplate code allowing for direct HBase connectivity and data
structures to represent GHCN data was made available to students
for this project. Students were provided with a copy of the data
set for this project and were made aware of the location of
additional documentation. On beginning the project students were
made aware that they would need to 1) work with a
Hadoop/HBase ecosystem, 2) discover how to bulk load the data
into thedatabase, and 3) develop a web application to display data
retrieved from HBase.
3.2 System Preparation
Prior to the start of the project, the faculty mentor identified two
different means of allowing for Hadoop/HBase connectivity.
These included 1) using Cloudera Virtual Machines on university
lab computers and 2) installing Hadoop from scratch on a LittleFe
system. Students found the Cloudera VMs straightforward to use
through the use of the HUE interface. In preparation for bulk
loading, the students and faculty mentor discovered that when
used on un-clustered desktop systems, such VMs lacked sufficient
compute power to complete bulk loading of data. Students even
tried dividing this data between several VMs on different desktop
computers, but were unsuccessful at processing the data because
they were not able to obtain sole access to lab computers for batch
processing.
The faculty mentor suggested students use the newly built
LittleFe3 machine and install Hadoop and HBase from scratch.
Students agreed to accomplish this task and were provided with
the following resources to install Hadoop 2.6.0 and HBase 0.98.9
on the cluster computer.
Table 2. Hadoop/HBase files provided to students.
Filename Description
startHadoop
Homemade Bash shell script to start Hadoop
on all LittleFe nodes
startNode
ManagersAnd
DataNodes
Homemade Bash shell script to start Hadoop
manager servers on all LittleFe nodes –
called by startHadoop
capacity-
scheduler.xml
Site specific memory/scheduling settings
core-site.xml Contains HDFS URL definition
mapred-
site.xml
Map-Reduce settings for Hadoop
yarn-site.xml
YARN resource negotiation settings for
Hadoop
hdfs-site.xml HDFS settings including replication
masters List of Hadoop master nodes
slaves List of Hadoop worker nodes
hbase-site.xml
Contains HBase HDFS locations and list of
Zookeeper nodes
regionservers List of HBase worker nodes
5. Students were provided with instruction on how to use a Linux
system, how to write shell scripts, and how to perform basic
system administration. They were then asked to install and start
Hadoop and HBase on LittleFe3 and were successful in doing so.
3.3 Bulk Loading
Students worked to bulk load (i.e. load the entirety of a large data
set) into HBase through the use of Hive scripts. This loading took
place in several stages. First data was parsed and converted into a
comma-separated format using Java. Next, the data was uploaded
into the HDFS Metadata store, using Hive. Finally, such data was
transferred from Hive into HBase.
Bulk Loading Process
Parse and convert data into a programmer-
friendly format.
Upload data into HDFS Metadata Store via
Hive.
Transfer Data from Hive to HBase.
Two data sets were bulk loaded into HBase. These included a list
of all stations with the longitude and latitude coordinates for each
station along with some additional data that may be useful in
future development. A database diagram showing the
components of this table is provided below.
Table 3. stations_hbase table.
stations_hbase
rowId
cf:latitude
cf:longitude
cf:elevation
cf:state
cf:name
cf:gsnflag
cf:hcnflag
cf:wmoid
The second table loaded into HBase included the minimum and
maximum temperatures as values and the station name, year, and
month as row identifiers.
****allstations_hbase table goes here
3.4 Web Application Development
Development of the web application to display heatmap data was
a multi-step process. This involved writing JavaScript code to
display Acquiring Data via Phoenix
JDBC Connection
Google Maps API Max/Min Long/Lat
Limited data display
Table 4. Table captions should be placed above the table
Graphics Top In-between Bottom
Tables End Last First
Figures Good Similar Very well
.
4. LEARNING OUTCOMES
Standard learning outcomes
****Learn to work in a team
****Gain real world programming experience in a safe setting (no
risk of being fired)
****Improve communication skills
****Improve programming and systems integration skills
Additional Learning Outcomes
****Command Line Interface
****Shell scripting
****System administration
****SQL queries
****Understanding of various APIs
.
5. LESSONS LEARNED
Initial attempts
****Virtual machine implementation is ok for small data set
testing
****Hive
Time spent teaching students about related topics
****Shell scripting
****Hadoop ecosystem
6. ACKNOWLEDGMENTS
The authors thank Dr. Scott Bell for his support as mentor while
Dr. Monismith assumed the role of a client in the first semester of
the project. The authors also thank the LittleFe team for their
work on LittleFe system and chassis design.
7. REFERENCES
[1] http://littlefe.net/
[2] http://littlefe.net/buildout
[3] http://en.wikipedia.org/wiki/Apache_Hadoop
[4] http://searchcloudcomputing.techtarget.com/definition/Hado
op
[5] http://www.sas.com/en_us/insights/big-data/hadoop.html
[6] https://github.com/mbostock/d3/wiki/Gallery
[7] D3 Book
http://chimera.labs.oreilly.com/books/1230000000345/ch01.
html
[8] Hadoop:
http://hadoop.apache.org/#What+Is+Apache+Hadoop%3F
[9] HBase: http://hbase.apache.org/
[10] D3.js: http://bl.ocks.org/mbostock/4061502
[11] Google maps API:
https://developers.google.com/maps/documentation/javascrip
t/tutorial
6. [12] Apache Thrift:https://thrift.apache.org/
[13] GHCN:http://www.ncdc.noaa.gov/data-access/land-based-
station-data/land-based-datasets/global-historical-
climatology-network-ghcn
[14] .
[15]
[16]
Columns on Last Page Should Be Made As Close As
Possible to Equal Length