Mansi chowkkar programming_in_data_analytics

Global Warming Analysis using Big Data Techniques
Mansi Chowkkar
x18134599
MSc Data Analytics
PDA
National College of Ireland
Abstract— As data is growing rapidly in every field, new
technologies to handle such enormous big data and its
processing is also evolving. Hadoop is one of the popular
technologies due to its properties like distributive, scalable,
and open SRC framework. It is based on Map-reduce which
divide the task into smaller chunks and work. Pig and Hive
are based on Hadoop and they process data faster. All these
tools and technologies are freely available and open source.
In this project Hadoop, HDFS, MapReduce, HBase, Hive,
and Pig are used for analyzing global warming and CO2
emission data. Big data is being used for analysis and it is
accessed and stored using Hadoop distributed system. The
highest CO2 emission countries and the maximum global
warming temperature countries are related is confirmed from
this research. Also, southeast countries are having a major
impact of global warming on average temperature value is
observed from an analysis performed by above mentioned
technologies. keywords: Hadoop, Pig, Hive, HDFS, HBase
I. INTRODUCTION
As computation and technology are advancing, there are
many other problems which are increasing. For example,
increased global warming and pollution are a major concern
which needs attention. All the developing countries have
increased pollution levels and hence increased average
temperature. Earlier there was no uncertainty in temperature
prediction. But due to the global warming effect nowadays
uncertainty for weather prediction has increased [1].
To predict the pattern of each country from last 20/30 years
from the entire world, this study will use global warming
data from all previous years and for each country, city
with average temperature value and change in temperature
uncertainty value. Also to find whether CO2 emission has
any adverse effect on global warming, CO2 emission data
for all country from last 10 years will also be analyzed here.
Since data is very huge, for efficient and accurate analysis
this project will use Big Data Analytics programming
languages. Hadoop is a widely used Big data framework that
mainly uses the MapReduce framework which is popular
in data processing structures. Mapreduce with distributed
Hadoop will be used for analyzing pollution and global
warming issues without hampering the speed [2].
As Hadoop is used everywhere and it is open source,
technologies like Hive, Pig, Spark, etc. are built on top of
Hadoop. Pig and Hive are built on top of Hadoop which
process data queries faster than Map-reduce. Hive provides
SQL query language and transforms it into the task of
Map-reduce [3]. Pig run complex joins operations in real
time data queries. PigLatin comes with a novel debugging
environment which is useful when dealing with huge data
sets. The Pig and Hive are used for complex query execution.
For predicting temperature increase in each of the country
every year, Pig and Hive will be used to get results.
A. Business Queries:
TABLE I: Business query
Query Language Framework
Max value of CO2
emission in each country
Java Mapreduce,
Hadoop, HDFS,
HBase
In every year max value of
CO2
Java Mapreduce,
Hadoop, HDFS,
HBase
Max country count and
its corresponding avg
temperature
Pig Hadoop, HDFS,
HBase, Sqoop
Average of temperature of
all year data for each
country
Pig Hadoop, HDFS,
HBase, Sqoop
List of the countries who
arehaving temperature
greater than 29.5
Hive, HQL Hive, Hadoop,
HDFS, HBase,
Sqoop
List of the countries
with average value of the
temperature
Hive, HQL SQL, Hadoop,
HDFS, HBase,
Sqoop
List of the countries
with average value of the
temperatureHive
Hive, HQL SQL, Hadoop,
HDFS, HBase,
Sqoop
Top 5 countries with
maximum temperature
HQL,Hive SQL, Hadoop,
HDFS, HBase,
Sqoop
II. RESEARCH QUESTION
Analyzing global warming and pollution across the world
using Hadoop technology
III. MOTIVATION
The increasing population and pollution lead to the global
warming issue. There is a need for solving this global
warming issue. For finding a solution to this problem, the
deep study should be performed on available big data on
different parameters. Therefore Big data technology based
effective analysis should be performed with considering
various situations and variables from the data [4]. In Big data,
Hadoop is widely used and it is an open source technology,
therefore many technologies are built on Hadoop that can
be used for data analysis. MapReduce is an example of a

developed parallel technology which is based on Hadoop and
provides better performance in the big data processing. It can
be processed through Hadoop, Hive, Pig, and HDFS systems
for querying on data [?].
IV. LITERATURE REVIEW
There are lots of researches done in this field which
provides an overview for further study. In the study [5], the
author has carried out an experiment to investigate the effect
of top 5 countries global warming into tourist business using
automatic RSS feed big data technology. The temperature
change, global warming, and atmosphere change are studied
on top 5 tourist countries data to show climate change attracts
tourists in Thailand.
In another study [6] climate analysis has been done
using big data. In this study, the author explained big data
challenges in the field of MERRA analytic service which
enables MapReduce analysis over NASA analysis research
studies. Big data cannot move from one place to another
hence big data technology and cloud computing is used in
this study for climate data analysis. MapReduce provides an
approach for analysis with high performance and it has been
used in many research studies on climate data. MapReduce
has proven to be an effective technique for large text data,
complex data, and binary data.
In this technological era, Big data is increasing and
handling of these big data services are facing many problems.
To provide an efficient solution there is a mapping and
reducing technique which can be improved by shuffling
strategy. For counting of words, [7] study has explained
the shuffling technique for MapReduce. The MapReduce
is implemented on Hadoop for word counts. This shuffling
architecture is tested for repetition of words, duplicate entries
of words, and sentences from the paragraph for performance
testing.
[8] studied Apache Pig and Apache Hive configuration
over HDFS and problem faced during the experiment is
explained. Problems during the configuration of different jar
files and versions are discussed. For example, yarn uses a
new method for tracking jobs as compare to a MapReduce
job. This study inspires the usage and installation of newer
versions of big data sources and tools. This study compared
Pivotal HAWQ and Apache Hive for word count analysis
and it is found that Pivotal HAWQ is 7 times faster on
10 million rows of data. The study also confirmed that
there is no difference between Apache Pig and Apache Hive
performance.
V. METHODOLOGY
In this section process flow for the project will be
discussed.
A. Dataset
1. Data Collection: The global warming Data is selected
from the kaggle open source. The average temperature for
all countries from the year 1849 to 2013 including city,
longitude, and latitude is explained in this data set. Another
data for CO2 emission according to year and country with
country code is collected from the Air pollution world data
site.
2. Data Extraction For this project 2 data-sets are
selected, one with 2,37,000 rows and one with 2400 rows.
CO2 emission data is selected for studying co-relation
between increasing global warming temperature and CO2
emission in the world. The first dataset is downloaded
from Kaggle1
and another data is downloaded from the
OurWorldinData2
Website in .csv format. Data consists of
unwanted values, missing values, and null values. The second
dataset for CO2 emission is consists of years as column
names hence in the transpose format.
2. Data Reprocessing:
Data is cleaned using R programming language. In the
global warming data set, operations like extracting from .csv,
removing special characters, replacing or removing NA are
performed. In the CO2 data set data is first cleaned and
then transposed so the year is generated an new column
using API. After transposing, year string is added with extra
characters which are removed using conditional loop. One
of the data processing before and after stage is shown in the
fig(1).
3. Data Exploration and Transformation:
Fig. 1: Before and After cleaning
B. Project Flow
After cleaning of data, it is loaded to MySQL database.
From the MySQL it is loaded to HDFS for further processing
of queries on the dataset.
1. MApReduce Process and HDFS:
Using Apache sqoop data is loaded from MySQL to
HDFS. Pig, Hive, and Java accessedcthe stored data for the
query purpose.
2. MapReduce Process and Apache Hive:
Implementing business query hive is used, data is loaded
to hive using load command. After loading data, queries are
formed and saved in .hql file. Through the hive command
output file is generated and stored in HBase. Hive queries
will answers our three research objectives.
3. MapReduce process and Apache Pig:
To answer some of our business objectives Pig queries
will be used on the global warming data set. Pig is a faster
big data technique hence more than 2,37,000 rows data is
1https://www.kaggle.com/newyork167/
exploring-global-warming/data
2https://ourworldindata.org/co2-and-other-greenhouse-gas-emis

Fig. 2: Flow Diagram
Fig. 3: Data loading to HDFS
used for Pig query using HDFS storage. The load command
is used to load data from HDFS to Pig and then queries are
run using .pig ﬁle execution. The output is stored in HBase.
4. Java query using MapReduce Design:
MapReduce design pattern is used for writing three
classes: mapper for data type mapping with key, the reducer
is used for implementing query logic, and the driver is used
to executing main class and driving process for example
ﬁle reading writing tasks. Eclipse is used for MapReduce
execution using java code.
Fig. 4: Flow Diagram
Fig. 5: Pig query1
C. Technologies and Programming languages used:
Technologies are selected for business queries as per the
requirement and best suitability is discussed here.
MYSQL: It is a database that handles huge data. It works
on client-server modes where data is stored in the server and
it is able to send to the client location. Hence MySQL is used
in this project for initially storing data and then transferring
it to HDFS.
Hadoop: Hadoop and HDFS is are open source data
storing and managing distributed frameworks which are
widely used for big data. The processing of data is done
by MapReduce framework [9]. Hadoop cluster comprises of
many components, for example, MapReHDFS, Yarn, HDFS,
and some libraries which handle failure of the system.
Therefore, to achieve high performance for big data queries
Hadoop distributed system is used in this project.
HBase: HBase runs on top of the Hadoop cluster. It
provides storage for large tables which can be stored as

records. HBase provides red/write access during processing
of data hence in this project it is used for storage purposes.
MapReduce: Hadoop uses MapReduce for processing
large amounts of data. Every job is distributed as a mapping
task and reducing task. The map and reduce function, input,
output file location is required to complete mapreduce job
[10]. MapReduce distributes the task into small tasks hence
it processes data within less time. In this project, MapReduce
is used with mapper, reducer, and driver classes for 2 query
processing.
Sqoop: Sqoop is a interface for transferring big data from
one database to another similar to Cassandra. It internally
uses mapper and reducer functionality to process number
of tasks hence providing high data transfer and processing.
In this project, Sqooop is used for transferring data from
MySQL to HDFS.
Hive: Hive turns Hadoop into data warehouse which
makes query and data analyzing efficient. Hive uses some
of the concepts like tables, entity relationship, columns and
primitive data types from relational database management
system (RDBMS). Hive has its own declarative language
known as HiveQL(HQL) which is very similar to SQL
query language. Functioning of Hive is unique since it stores
schema inside the database but stores its data on HDFS
system [11]. Hive is used to process some of the business
queries using HQL language.
Pig: Pig uses Map-reduce in an indirect way which
was developed by Yahoo. Pig uses PigLatin high level
programming language similar to SQL. Pig has a local and
distributed type of execution environment. For executing
Pig queries, content needs to be saved in input file in
semi-structured format [12]. In this project, Pig is used to
find a solution for some of the complex queries. The input
file is copied to HDFS using sqoop and Pig command is
executed for finding query result from a file. Data is loaded
using PigStorage and output is stored in HDFS.
R Language: R programming language is used for
cleaning data sets in R studio. Different functions of R are
used for example gather(), gsub(), etc for cleaning and then
data is leaded.
Java Language: Hadoop based Mapreduce uses java
programming language for a map and reduce functionality.
In this project Mapper, reducer, and driver functionalities are
implemented in the Java programming language.
Tableau: For visualizing all outputs from queries, Tableau
visualizing tool is used. Different types of graphs are used
for explaining results from the queries.
VI. RESULTS
Query implemented in MapReduce using java
programming
Analysis : fig 6Query shows that united emirates and
Portugal are the country with maximum CO2 emission. From
the graph it has been seen that, Brazil, Italy, India, Greenland
are the countries with comparatively low CO2 emission
value.
Query implemented in MapReduce java programming
Fig. 6: Max CO2 emission for each country
Fig. 7: CO2 emission for every year
Analysis : fig 7This query represents the CO2 emission
value of all the years. From the graph, we can see that from
2006 the rate of CO2 emission is decreasing. Hence it is
observed that in recent years CO2 emission has decreased. To
resolve the global warming issue, countries are now actively
taking action to decrease pollution.
Fig. 8: Country count with max cities in global warming data
Query implemented in Pig
Analysis : fig 8 This graph represents country count for
the global warming data and its corresponding temperature.
It is seen that Chile, India, and Bangladesh countries have
maximum count hence they have more global warming effect
as compare to other countries in the world.
Query implemented in Pig
Analysis : fig 9 The graph represents the average value of
the global warming temperature in all countries. The result
shows that Sudan, Vietnam, and Somalia show the highest
value of avg temperature whereas, South Korea, Germany,
and Ethiopia have a minimum value. Hence, Germany, South

Fig. 9: Average value of global warming temperature
Korea doesn’t show major global warming effects.
Fig. 10: Countries with avg temperature greater than 29.5
Query implemented in Hive
Analysis : fig 10 This query list all the countries with a
global warming temperature greater than 29.5. This shows
that Pakistan, Iraq, Iran, Sudan, and Saudi Arabia are the
warmest countries and have a major global warming effect.
There is a total of 17 countries in the world with a
temperature greater than 29.5 which is a major concern that
needs to be resolved. We can see that all the countries listed
are South Eastern countries.
Fig. 11: Country count which are listed majorly for global warming
Analysis : fig 11 This query shows that Iraq, Saudi Arabia,
Pakistan, India, China, and Turkey are majorly suffering
countries in the global warming problem. These countries
are also listed in the CO2 emission country list. Hence it is
proved that CO2 emission affects the global warming issue.
Fig. 12: Countries with temperature uncertainty in change in
temperature
Analysis: fig 12 This query represents countries that are
most stable in temperature change. India, China, Brazil,
and Turkey show unstable temperatures throughout the year.
Hence these countries exhibit uncertainty in temperature
change. Whereas Egypt, Syria, South Africa, Kenya, and
Japan are more stable countries that have the lowest change
in temperature values.
Fig. 13: Top 5 Countries with maximum temperature reached
Analysis : fig 13 This result shows that London, Istanbul,
and Kiev has reached the maximum temperature in past years
due to the global warming effect. As we know London and
Berlin don’t have a maximum temperature in all months but
they have reached peak value as compare to other countries.
Istanbul is having maximum temperature in all of the months
as we can see from other query results and also reached the
maximum temperature.
From all the above results it has been seen that the
countries with maximum CO2 emission have the major
global warming effect. Hence CO2 is the major contributing
parameter for global warming temperature. It is also seen that
South-East countries suffering majorly from global warming
issues. Change in temperature is the variable which is also
directly proportional to the global warming problem.
VII. CHALLENGES AND LIMITATION
1. Challenges faced during the automation process while
storing output into HBase and automating shell script.

2. Faced delay issue for running query on big data using
MapReduce.
3. Formation of complex query in Hive.
4. Virtual Machine speed for running all technologies.
4. All UI interfaces on open stack
5. Interfacing Machine learning techniques with Virtual
Box.
VIII. CONCLUSION AND FUTURE WORK
The global warming and CO2 data structured data are used
for the analysis of global warming countries and the impact
of CO2 emission. It is found that CO2 emission have an
effect on global warming data. In the future more related data
for example ozone layer value, other pollutants attributes can
be collected for further analysis. The Big Data techniques
used here are Hadoop, Hbase, Hive, Pig, and MapReduce.
It has been observed that Pig performed better in terms of
big query execution time and query complexity. MapReduce
performed well in query processing tasks. In the future, more
techniques can be explored for example Spark and Impala.
REFERENCES
[1] V. Chang, “Towards data analysis for weather cloud computing,”
Knowledge-Based Systems, vol. 127, pp. 29–45, 2017.
[2] K. A. Ismail, M. Abdul Majid, J. Mohamed Zain, and N. A. Abu
Bakar, “Big Data prediction framework for weather Temperature based
on MapReduce algorithm,” ICOS 2016 - 2016 IEEE Conference on
Open Systems, pp. 13–17, 2017.
[3] “Processing performance on apache pig, apache hive and mysql
cluster.,” Proceedings of International Conference on Information,
Communication Technology and System (ICTS) 2014, Information,
Communication Technology and System (ICTS), 2014 International
Conference on, p. 297, 2014.
[4] S. Navadia, P. Yadav, and J. Thomas, “Measuring and Analyzing
Weather Data,” pp. 414–417, 2017.
[5] C. Yaiprasert, “Climate situation in 5 top-rated tourist attractions in
Thailand investigated by using big data RSS feed and programming,”
Walailak Journal of Science and Technology, vol. 15, no. 5,
pp. 371–385, 2018.
[6] J. L. Schnase, D. Q. Duffy, G. S. Tamkin, D. Nadeau, J. H.
Thompson, C. M. Grieg, M. A. McInerney, and W. P. Webster, “Merra
analytic services: Meeting the big data challenges of climate science
through cloud-enabled climate analytics-as-a-service.,” Computers,
Environment and Urban Systems, vol. 61, no. Part B, pp. 198 – 211,
2017.
[7] B. Mandal, S. Sethi, and R. K. Sahoo, “Architecture of efficient word
processing using hadoop mapreduce for big data applications,” in 2015
International Conference on Man and Machine Interfacing (MAMI),
pp. 1–6, Dec 2015.
[8] X. Chen, L. Hu, L. Liu, J. Chang, and D. L. Bone, “Breaking down
hadoop distributed file systems data analytics tools: Apache hive
vs. apache pig vs. pivotal hwaq,” in 2017 IEEE 10th International
Conference on Cloud Computing (CLOUD), pp. 794–797, June 2017.
[9] W.-j. Lu, S. Kawasaki, and J. Sakuma, “Using Fully Homomorphic
Encryption for Statistical Analysis of Categorical, Ordinal and
Numerical Data,” 2017.
[10] Y. Takata, T. Hosaka, and H. Ohnuma, “Boosting Approach To Early
Bankruptcy Prediction From Multiple-Year Financial Statements,”
Asia Pacific Journal of Advanced Business and Social Studies, vol. 3,
no. 2, 2017.
[11] E. L. Lydia and M. B. Swarup, “Big data analysis using hadoop
components like flume, mapreduce, pig and hive.,” International
Journal of Computer Science Engineering Technology, vol. 5, no. 11,
p. 390, 2015.
[12] “Comparison of data processing tools in hadoop.,” 2016
International Conference on Electrical, Electronics, Communication,
Computer and Optimization Techniques (ICEECCOT), Electrical,
Electronics, Communication, Computer and Optimization Techniques
(ICEECCOT), 016 International Conference on, p. 238, 2016.

Mansi chowkkar programming_in_data_analytics

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Mansi chowkkar programming_in_data_analytics

Similar to Mansi chowkkar programming_in_data_analytics (20)

More from MansiChowkkar

More from MansiChowkkar (6)

Recently uploaded

Recently uploaded (20)

Mansi chowkkar programming_in_data_analytics