SlideShare a Scribd company logo
Global Warming Analysis using Big Data Techniques
Mansi Chowkkar
x18134599
MSc Data Analytics
PDA
National College of Ireland
Abstract— As data is growing rapidly in every field, new
technologies to handle such enormous big data and its
processing is also evolving. Hadoop is one of the popular
technologies due to its properties like distributive, scalable,
and open SRC framework. It is based on Map-reduce which
divide the task into smaller chunks and work. Pig and Hive
are based on Hadoop and they process data faster. All these
tools and technologies are freely available and open source.
In this project Hadoop, HDFS, MapReduce, HBase, Hive,
and Pig are used for analyzing global warming and CO2
emission data. Big data is being used for analysis and it is
accessed and stored using Hadoop distributed system. The
highest CO2 emission countries and the maximum global
warming temperature countries are related is confirmed from
this research. Also, southeast countries are having a major
impact of global warming on average temperature value is
observed from an analysis performed by above mentioned
technologies. keywords: Hadoop, Pig, Hive, HDFS, HBase
I. INTRODUCTION
As computation and technology are advancing, there are
many other problems which are increasing. For example,
increased global warming and pollution are a major concern
which needs attention. All the developing countries have
increased pollution levels and hence increased average
temperature. Earlier there was no uncertainty in temperature
prediction. But due to the global warming effect nowadays
uncertainty for weather prediction has increased [1].
To predict the pattern of each country from last 20/30 years
from the entire world, this study will use global warming
data from all previous years and for each country, city
with average temperature value and change in temperature
uncertainty value. Also to find whether CO2 emission has
any adverse effect on global warming, CO2 emission data
for all country from last 10 years will also be analyzed here.
Since data is very huge, for efficient and accurate analysis
this project will use Big Data Analytics programming
languages. Hadoop is a widely used Big data framework that
mainly uses the MapReduce framework which is popular
in data processing structures. Mapreduce with distributed
Hadoop will be used for analyzing pollution and global
warming issues without hampering the speed [2].
As Hadoop is used everywhere and it is open source,
technologies like Hive, Pig, Spark, etc. are built on top of
Hadoop. Pig and Hive are built on top of Hadoop which
process data queries faster than Map-reduce. Hive provides
SQL query language and transforms it into the task of
Map-reduce [3]. Pig run complex joins operations in real
time data queries. PigLatin comes with a novel debugging
environment which is useful when dealing with huge data
sets. The Pig and Hive are used for complex query execution.
For predicting temperature increase in each of the country
every year, Pig and Hive will be used to get results.
A. Business Queries:
TABLE I: Business query
Query Language Framework
Max value of CO2
emission in each country
Java Mapreduce,
Hadoop, HDFS,
HBase
In every year max value of
CO2
Java Mapreduce,
Hadoop, HDFS,
HBase
Max country count and
its corresponding avg
temperature
Pig Hadoop, HDFS,
HBase, Sqoop
Average of temperature of
all year data for each
country
Pig Hadoop, HDFS,
HBase, Sqoop
List of the countries who
arehaving temperature
greater than 29.5
Hive, HQL Hive, Hadoop,
HDFS, HBase,
Sqoop
List of the countries
with average value of the
temperature
Hive, HQL SQL, Hadoop,
HDFS, HBase,
Sqoop
List of the countries
with average value of the
temperatureHive
Hive, HQL SQL, Hadoop,
HDFS, HBase,
Sqoop
Top 5 countries with
maximum temperature
HQL,Hive SQL, Hadoop,
HDFS, HBase,
Sqoop
II. RESEARCH QUESTION
Analyzing global warming and pollution across the world
using Hadoop technology
III. MOTIVATION
The increasing population and pollution lead to the global
warming issue. There is a need for solving this global
warming issue. For finding a solution to this problem, the
deep study should be performed on available big data on
different parameters. Therefore Big data technology based
effective analysis should be performed with considering
various situations and variables from the data [4]. In Big data,
Hadoop is widely used and it is an open source technology,
therefore many technologies are built on Hadoop that can
be used for data analysis. MapReduce is an example of a
developed parallel technology which is based on Hadoop and
provides better performance in the big data processing. It can
be processed through Hadoop, Hive, Pig, and HDFS systems
for querying on data [?].
IV. LITERATURE REVIEW
There are lots of researches done in this field which
provides an overview for further study. In the study [5], the
author has carried out an experiment to investigate the effect
of top 5 countries global warming into tourist business using
automatic RSS feed big data technology. The temperature
change, global warming, and atmosphere change are studied
on top 5 tourist countries data to show climate change attracts
tourists in Thailand.
In another study [6] climate analysis has been done
using big data. In this study, the author explained big data
challenges in the field of MERRA analytic service which
enables MapReduce analysis over NASA analysis research
studies. Big data cannot move from one place to another
hence big data technology and cloud computing is used in
this study for climate data analysis. MapReduce provides an
approach for analysis with high performance and it has been
used in many research studies on climate data. MapReduce
has proven to be an effective technique for large text data,
complex data, and binary data.
In this technological era, Big data is increasing and
handling of these big data services are facing many problems.
To provide an efficient solution there is a mapping and
reducing technique which can be improved by shuffling
strategy. For counting of words, [7] study has explained
the shuffling technique for MapReduce. The MapReduce
is implemented on Hadoop for word counts. This shuffling
architecture is tested for repetition of words, duplicate entries
of words, and sentences from the paragraph for performance
testing.
[8] studied Apache Pig and Apache Hive configuration
over HDFS and problem faced during the experiment is
explained. Problems during the configuration of different jar
files and versions are discussed. For example, yarn uses a
new method for tracking jobs as compare to a MapReduce
job. This study inspires the usage and installation of newer
versions of big data sources and tools. This study compared
Pivotal HAWQ and Apache Hive for word count analysis
and it is found that Pivotal HAWQ is 7 times faster on
10 million rows of data. The study also confirmed that
there is no difference between Apache Pig and Apache Hive
performance.
V. METHODOLOGY
In this section process flow for the project will be
discussed.
A. Dataset
1. Data Collection: The global warming Data is selected
from the kaggle open source. The average temperature for
all countries from the year 1849 to 2013 including city,
longitude, and latitude is explained in this data set. Another
data for CO2 emission according to year and country with
country code is collected from the Air pollution world data
site.
2. Data Extraction For this project 2 data-sets are
selected, one with 2,37,000 rows and one with 2400 rows.
CO2 emission data is selected for studying co-relation
between increasing global warming temperature and CO2
emission in the world. The first dataset is downloaded
from Kaggle1
and another data is downloaded from the
OurWorldinData2
Website in .csv format. Data consists of
unwanted values, missing values, and null values. The second
dataset for CO2 emission is consists of years as column
names hence in the transpose format.
2. Data Reprocessing:
Data is cleaned using R programming language. In the
global warming data set, operations like extracting from .csv,
removing special characters, replacing or removing NA are
performed. In the CO2 data set data is first cleaned and
then transposed so the year is generated an new column
using API. After transposing, year string is added with extra
characters which are removed using conditional loop. One
of the data processing before and after stage is shown in the
fig(1).
3. Data Exploration and Transformation:
Fig. 1: Before and After cleaning
B. Project Flow
After cleaning of data, it is loaded to MySQL database.
From the MySQL it is loaded to HDFS for further processing
of queries on the dataset.
1. MApReduce Process and HDFS:
Using Apache sqoop data is loaded from MySQL to
HDFS. Pig, Hive, and Java accessedcthe stored data for the
query purpose.
2. MapReduce Process and Apache Hive:
Implementing business query hive is used, data is loaded
to hive using load command. After loading data, queries are
formed and saved in .hql file. Through the hive command
output file is generated and stored in HBase. Hive queries
will answers our three research objectives.
3. MapReduce process and Apache Pig:
To answer some of our business objectives Pig queries
will be used on the global warming data set. Pig is a faster
big data technique hence more than 2,37,000 rows data is
1https://www.kaggle.com/newyork167/
exploring-global-warming/data
2https://ourworldindata.org/co2-and-other-greenhouse-gas-emis
Fig. 2: Flow Diagram
Fig. 3: Data loading to HDFS
used for Pig query using HDFS storage. The load command
is used to load data from HDFS to Pig and then queries are
run using .pig file execution. The output is stored in HBase.
4. Java query using MapReduce Design:
MapReduce design pattern is used for writing three
classes: mapper for data type mapping with key, the reducer
is used for implementing query logic, and the driver is used
to executing main class and driving process for example
file reading writing tasks. Eclipse is used for MapReduce
execution using java code.
Fig. 4: Flow Diagram
Fig. 5: Pig query1
C. Technologies and Programming languages used:
Technologies are selected for business queries as per the
requirement and best suitability is discussed here.
MYSQL: It is a database that handles huge data. It works
on client-server modes where data is stored in the server and
it is able to send to the client location. Hence MySQL is used
in this project for initially storing data and then transferring
it to HDFS.
Hadoop: Hadoop and HDFS is are open source data
storing and managing distributed frameworks which are
widely used for big data. The processing of data is done
by MapReduce framework [9]. Hadoop cluster comprises of
many components, for example, MapReHDFS, Yarn, HDFS,
and some libraries which handle failure of the system.
Therefore, to achieve high performance for big data queries
Hadoop distributed system is used in this project.
HBase: HBase runs on top of the Hadoop cluster. It
provides storage for large tables which can be stored as
records. HBase provides red/write access during processing
of data hence in this project it is used for storage purposes.
MapReduce: Hadoop uses MapReduce for processing
large amounts of data. Every job is distributed as a mapping
task and reducing task. The map and reduce function, input,
output file location is required to complete mapreduce job
[10]. MapReduce distributes the task into small tasks hence
it processes data within less time. In this project, MapReduce
is used with mapper, reducer, and driver classes for 2 query
processing.
Sqoop: Sqoop is a interface for transferring big data from
one database to another similar to Cassandra. It internally
uses mapper and reducer functionality to process number
of tasks hence providing high data transfer and processing.
In this project, Sqooop is used for transferring data from
MySQL to HDFS.
Hive: Hive turns Hadoop into data warehouse which
makes query and data analyzing efficient. Hive uses some
of the concepts like tables, entity relationship, columns and
primitive data types from relational database management
system (RDBMS). Hive has its own declarative language
known as HiveQL(HQL) which is very similar to SQL
query language. Functioning of Hive is unique since it stores
schema inside the database but stores its data on HDFS
system [11]. Hive is used to process some of the business
queries using HQL language.
Pig: Pig uses Map-reduce in an indirect way which
was developed by Yahoo. Pig uses PigLatin high level
programming language similar to SQL. Pig has a local and
distributed type of execution environment. For executing
Pig queries, content needs to be saved in input file in
semi-structured format [12]. In this project, Pig is used to
find a solution for some of the complex queries. The input
file is copied to HDFS using sqoop and Pig command is
executed for finding query result from a file. Data is loaded
using PigStorage and output is stored in HDFS.
R Language: R programming language is used for
cleaning data sets in R studio. Different functions of R are
used for example gather(), gsub(), etc for cleaning and then
data is leaded.
Java Language: Hadoop based Mapreduce uses java
programming language for a map and reduce functionality.
In this project Mapper, reducer, and driver functionalities are
implemented in the Java programming language.
Tableau: For visualizing all outputs from queries, Tableau
visualizing tool is used. Different types of graphs are used
for explaining results from the queries.
VI. RESULTS
Query implemented in MapReduce using java
programming
Analysis : fig 6Query shows that united emirates and
Portugal are the country with maximum CO2 emission. From
the graph it has been seen that, Brazil, Italy, India, Greenland
are the countries with comparatively low CO2 emission
value.
Query implemented in MapReduce java programming
Fig. 6: Max CO2 emission for each country
Fig. 7: CO2 emission for every year
Analysis : fig 7This query represents the CO2 emission
value of all the years. From the graph, we can see that from
2006 the rate of CO2 emission is decreasing. Hence it is
observed that in recent years CO2 emission has decreased. To
resolve the global warming issue, countries are now actively
taking action to decrease pollution.
Fig. 8: Country count with max cities in global warming data
Query implemented in Pig
Analysis : fig 8 This graph represents country count for
the global warming data and its corresponding temperature.
It is seen that Chile, India, and Bangladesh countries have
maximum count hence they have more global warming effect
as compare to other countries in the world.
Query implemented in Pig
Analysis : fig 9 The graph represents the average value of
the global warming temperature in all countries. The result
shows that Sudan, Vietnam, and Somalia show the highest
value of avg temperature whereas, South Korea, Germany,
and Ethiopia have a minimum value. Hence, Germany, South
Fig. 9: Average value of global warming temperature
Korea doesn’t show major global warming effects.
Fig. 10: Countries with avg temperature greater than 29.5
Query implemented in Hive
Analysis : fig 10 This query list all the countries with a
global warming temperature greater than 29.5. This shows
that Pakistan, Iraq, Iran, Sudan, and Saudi Arabia are the
warmest countries and have a major global warming effect.
There is a total of 17 countries in the world with a
temperature greater than 29.5 which is a major concern that
needs to be resolved. We can see that all the countries listed
are South Eastern countries.
Fig. 11: Country count which are listed majorly for global warming
Query implemented in Hive
Analysis : fig 11 This query shows that Iraq, Saudi Arabia,
Pakistan, India, China, and Turkey are majorly suffering
countries in the global warming problem. These countries
are also listed in the CO2 emission country list. Hence it is
proved that CO2 emission affects the global warming issue.
Query implemented in Hive
Fig. 12: Countries with temperature uncertainty in change in
temperature
Analysis: fig 12 This query represents countries that are
most stable in temperature change. India, China, Brazil,
and Turkey show unstable temperatures throughout the year.
Hence these countries exhibit uncertainty in temperature
change. Whereas Egypt, Syria, South Africa, Kenya, and
Japan are more stable countries that have the lowest change
in temperature values.
Fig. 13: Top 5 Countries with maximum temperature reached
Query implemented in Hive
Analysis : fig 13 This result shows that London, Istanbul,
and Kiev has reached the maximum temperature in past years
due to the global warming effect. As we know London and
Berlin don’t have a maximum temperature in all months but
they have reached peak value as compare to other countries.
Istanbul is having maximum temperature in all of the months
as we can see from other query results and also reached the
maximum temperature.
From all the above results it has been seen that the
countries with maximum CO2 emission have the major
global warming effect. Hence CO2 is the major contributing
parameter for global warming temperature. It is also seen that
South-East countries suffering majorly from global warming
issues. Change in temperature is the variable which is also
directly proportional to the global warming problem.
VII. CHALLENGES AND LIMITATION
1. Challenges faced during the automation process while
storing output into HBase and automating shell script.
2. Faced delay issue for running query on big data using
MapReduce.
3. Formation of complex query in Hive.
4. Virtual Machine speed for running all technologies.
4. All UI interfaces on open stack
5. Interfacing Machine learning techniques with Virtual
Box.
VIII. CONCLUSION AND FUTURE WORK
The global warming and CO2 data structured data are used
for the analysis of global warming countries and the impact
of CO2 emission. It is found that CO2 emission have an
effect on global warming data. In the future more related data
for example ozone layer value, other pollutants attributes can
be collected for further analysis. The Big Data techniques
used here are Hadoop, Hbase, Hive, Pig, and MapReduce.
It has been observed that Pig performed better in terms of
big query execution time and query complexity. MapReduce
performed well in query processing tasks. In the future, more
techniques can be explored for example Spark and Impala.
REFERENCES
[1] V. Chang, “Towards data analysis for weather cloud computing,”
Knowledge-Based Systems, vol. 127, pp. 29–45, 2017.
[2] K. A. Ismail, M. Abdul Majid, J. Mohamed Zain, and N. A. Abu
Bakar, “Big Data prediction framework for weather Temperature based
on MapReduce algorithm,” ICOS 2016 - 2016 IEEE Conference on
Open Systems, pp. 13–17, 2017.
[3] “Processing performance on apache pig, apache hive and mysql
cluster.,” Proceedings of International Conference on Information,
Communication Technology and System (ICTS) 2014, Information,
Communication Technology and System (ICTS), 2014 International
Conference on, p. 297, 2014.
[4] S. Navadia, P. Yadav, and J. Thomas, “Measuring and Analyzing
Weather Data,” pp. 414–417, 2017.
[5] C. Yaiprasert, “Climate situation in 5 top-rated tourist attractions in
Thailand investigated by using big data RSS feed and programming,”
Walailak Journal of Science and Technology, vol. 15, no. 5,
pp. 371–385, 2018.
[6] J. L. Schnase, D. Q. Duffy, G. S. Tamkin, D. Nadeau, J. H.
Thompson, C. M. Grieg, M. A. McInerney, and W. P. Webster, “Merra
analytic services: Meeting the big data challenges of climate science
through cloud-enabled climate analytics-as-a-service.,” Computers,
Environment and Urban Systems, vol. 61, no. Part B, pp. 198 – 211,
2017.
[7] B. Mandal, S. Sethi, and R. K. Sahoo, “Architecture of efficient word
processing using hadoop mapreduce for big data applications,” in 2015
International Conference on Man and Machine Interfacing (MAMI),
pp. 1–6, Dec 2015.
[8] X. Chen, L. Hu, L. Liu, J. Chang, and D. L. Bone, “Breaking down
hadoop distributed file systems data analytics tools: Apache hive
vs. apache pig vs. pivotal hwaq,” in 2017 IEEE 10th International
Conference on Cloud Computing (CLOUD), pp. 794–797, June 2017.
[9] W.-j. Lu, S. Kawasaki, and J. Sakuma, “Using Fully Homomorphic
Encryption for Statistical Analysis of Categorical, Ordinal and
Numerical Data,” 2017.
[10] Y. Takata, T. Hosaka, and H. Ohnuma, “Boosting Approach To Early
Bankruptcy Prediction From Multiple-Year Financial Statements,”
Asia Pacific Journal of Advanced Business and Social Studies, vol. 3,
no. 2, 2017.
[11] E. L. Lydia and M. B. Swarup, “Big data analysis using hadoop
components like flume, mapreduce, pig and hive.,” International
Journal of Computer Science Engineering Technology, vol. 5, no. 11,
p. 390, 2015.
[12] “Comparison of data processing tools in hadoop.,” 2016
International Conference on Electrical, Electronics, Communication,
Computer and Optimization Techniques (ICEECCOT), Electrical,
Electronics, Communication, Computer and Optimization Techniques
(ICEECCOT), 016 International Conference on, p. 238, 2016.

More Related Content

What's hot

CLIM Program: Remote Sensing Workshop, High Performance Computing and Spatial...
CLIM Program: Remote Sensing Workshop, High Performance Computing and Spatial...CLIM Program: Remote Sensing Workshop, High Performance Computing and Spatial...
CLIM Program: Remote Sensing Workshop, High Performance Computing and Spatial...
The Statistical and Applied Mathematical Sciences Institute
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
Revolution Analytics
 
[FOSS4G KOREA 2014]Hadoop 상에서 MapReduce를 이용한 Spatial Big Data 집계와 시스템 구축
[FOSS4G KOREA 2014]Hadoop 상에서 MapReduce를 이용한 Spatial Big Data 집계와 시스템 구축[FOSS4G KOREA 2014]Hadoop 상에서 MapReduce를 이용한 Spatial Big Data 집계와 시스템 구축
[FOSS4G KOREA 2014]Hadoop 상에서 MapReduce를 이용한 Spatial Big Data 집계와 시스템 구축
Kwang Woo NAM
 
Survey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization MethodsSurvey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization Methods
paperpublications3
 
Improving access to geospatial Big Data in the hydrology domain
Improving access to geospatial Big Data in the hydrology domainImproving access to geospatial Big Data in the hydrology domain
Improving access to geospatial Big Data in the hydrology domain
Claudia Vitolo
 
HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...
HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...
HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...
Xiao Qin
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
Revolution Analytics
 
Scaling AncestryDNA with the Hadoop Ecosystem. Presented at the San Jose Hado...
Scaling AncestryDNA with the Hadoop Ecosystem. Presented at the San Jose Hado...Scaling AncestryDNA with the Hadoop Ecosystem. Presented at the San Jose Hado...
Scaling AncestryDNA with the Hadoop Ecosystem. Presented at the San Jose Hado...
William Yetman
 
WBDB 2015 Performance Evaluation of Spark SQL using BigBench
WBDB 2015 Performance Evaluation of Spark SQL using BigBenchWBDB 2015 Performance Evaluation of Spark SQL using BigBench
WBDB 2015 Performance Evaluation of Spark SQL using BigBench
t_ivanov
 
Improving performance of apriori algorithm using hadoop
Improving performance of apriori algorithm using hadoopImproving performance of apriori algorithm using hadoop
Improving performance of apriori algorithm using hadoop
eSAT Journals
 
Finding URL pattern with MapReduce and Apache Hadoop
Finding URL pattern with MapReduce and Apache HadoopFinding URL pattern with MapReduce and Apache Hadoop
Finding URL pattern with MapReduce and Apache HadoopNushrat
 
IJET-V2I6P25
IJET-V2I6P25IJET-V2I6P25
Big Data Processing: Performance Gain Through In-Memory Computation
Big Data Processing: Performance Gain Through In-Memory ComputationBig Data Processing: Performance Gain Through In-Memory Computation
Big Data Processing: Performance Gain Through In-Memory Computation
UT, San Antonio
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
Carol McDonald
 
Hadoop Summit 2010 Benchmarking And Optimizing Hadoop
Hadoop Summit 2010 Benchmarking And Optimizing HadoopHadoop Summit 2010 Benchmarking And Optimizing Hadoop
Hadoop Summit 2010 Benchmarking And Optimizing HadoopYahoo Developer Network
 
BDSE 2015 Evaluation of Big Data Platforms with HiBench
BDSE 2015 Evaluation of Big Data Platforms with HiBenchBDSE 2015 Evaluation of Big Data Platforms with HiBench
BDSE 2015 Evaluation of Big Data Platforms with HiBench
t_ivanov
 
Analysing of big data using map reduce
Analysing of big data using map reduceAnalysing of big data using map reduce
Analysing of big data using map reduce
Paladion Networks
 
Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)
PyData
 
A data aware caching 2415
A data aware caching 2415A data aware caching 2415
A data aware caching 2415
SANTOSH WAYAL
 

What's hot (20)

CLIM Program: Remote Sensing Workshop, High Performance Computing and Spatial...
CLIM Program: Remote Sensing Workshop, High Performance Computing and Spatial...CLIM Program: Remote Sensing Workshop, High Performance Computing and Spatial...
CLIM Program: Remote Sensing Workshop, High Performance Computing and Spatial...
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
 
[FOSS4G KOREA 2014]Hadoop 상에서 MapReduce를 이용한 Spatial Big Data 집계와 시스템 구축
[FOSS4G KOREA 2014]Hadoop 상에서 MapReduce를 이용한 Spatial Big Data 집계와 시스템 구축[FOSS4G KOREA 2014]Hadoop 상에서 MapReduce를 이용한 Spatial Big Data 집계와 시스템 구축
[FOSS4G KOREA 2014]Hadoop 상에서 MapReduce를 이용한 Spatial Big Data 집계와 시스템 구축
 
Survey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization MethodsSurvey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization Methods
 
Improving access to geospatial Big Data in the hydrology domain
Improving access to geospatial Big Data in the hydrology domainImproving access to geospatial Big Data in the hydrology domain
Improving access to geospatial Big Data in the hydrology domain
 
HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...
HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...
HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
 
Scaling AncestryDNA with the Hadoop Ecosystem. Presented at the San Jose Hado...
Scaling AncestryDNA with the Hadoop Ecosystem. Presented at the San Jose Hado...Scaling AncestryDNA with the Hadoop Ecosystem. Presented at the San Jose Hado...
Scaling AncestryDNA with the Hadoop Ecosystem. Presented at the San Jose Hado...
 
WBDB 2015 Performance Evaluation of Spark SQL using BigBench
WBDB 2015 Performance Evaluation of Spark SQL using BigBenchWBDB 2015 Performance Evaluation of Spark SQL using BigBench
WBDB 2015 Performance Evaluation of Spark SQL using BigBench
 
Improving performance of apriori algorithm using hadoop
Improving performance of apriori algorithm using hadoopImproving performance of apriori algorithm using hadoop
Improving performance of apriori algorithm using hadoop
 
Neo4j vs giraph
Neo4j vs giraphNeo4j vs giraph
Neo4j vs giraph
 
Finding URL pattern with MapReduce and Apache Hadoop
Finding URL pattern with MapReduce and Apache HadoopFinding URL pattern with MapReduce and Apache Hadoop
Finding URL pattern with MapReduce and Apache Hadoop
 
IJET-V2I6P25
IJET-V2I6P25IJET-V2I6P25
IJET-V2I6P25
 
Big Data Processing: Performance Gain Through In-Memory Computation
Big Data Processing: Performance Gain Through In-Memory ComputationBig Data Processing: Performance Gain Through In-Memory Computation
Big Data Processing: Performance Gain Through In-Memory Computation
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Hadoop Summit 2010 Benchmarking And Optimizing Hadoop
Hadoop Summit 2010 Benchmarking And Optimizing HadoopHadoop Summit 2010 Benchmarking And Optimizing Hadoop
Hadoop Summit 2010 Benchmarking And Optimizing Hadoop
 
BDSE 2015 Evaluation of Big Data Platforms with HiBench
BDSE 2015 Evaluation of Big Data Platforms with HiBenchBDSE 2015 Evaluation of Big Data Platforms with HiBench
BDSE 2015 Evaluation of Big Data Platforms with HiBench
 
Analysing of big data using map reduce
Analysing of big data using map reduceAnalysing of big data using map reduce
Analysing of big data using map reduce
 
Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)
 
A data aware caching 2415
A data aware caching 2415A data aware caching 2415
A data aware caching 2415
 

Similar to Mansi chowkkar programming_in_data_analytics

B017320612
B017320612B017320612
B017320612
IOSR Journals
 
Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoop
Victoria López
 
Hadoop Cluster Analysis and Assessment
Hadoop Cluster Analysis and AssessmentHadoop Cluster Analysis and Assessment
62_Tazeen_Sayed_Hadoop_Ecosystem.pptx
62_Tazeen_Sayed_Hadoop_Ecosystem.pptx62_Tazeen_Sayed_Hadoop_Ecosystem.pptx
62_Tazeen_Sayed_Hadoop_Ecosystem.pptx
TazeenSayed3
 
IJSRED-V2I3P43
IJSRED-V2I3P43IJSRED-V2I3P43
IJSRED-V2I3P43
IJSRED
 
Big data
Big dataBig data
Big data
revathireddyb
 
Big data
Big dataBig data
Big data
revathireddyb
 
Web Oriented FIM for large scale dataset using Hadoop
Web Oriented FIM for large scale dataset using HadoopWeb Oriented FIM for large scale dataset using Hadoop
Web Oriented FIM for large scale dataset using Hadoop
dbpublications
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache HadoopChristopher Pezza
 
Enhancing Big Data Analysis by using Map-reduce Technique
Enhancing Big Data Analysis by using Map-reduce TechniqueEnhancing Big Data Analysis by using Map-reduce Technique
Enhancing Big Data Analysis by using Map-reduce Technique
journalBEEI
 
Big Data & Hadoop. Simone Leo (CRS4)
Big Data & Hadoop. Simone Leo (CRS4)Big Data & Hadoop. Simone Leo (CRS4)
Big Data & Hadoop. Simone Leo (CRS4)
CRS4 Research Center in Sardinia
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop Technology
Manish Borkar
 
IRJET - Weather Log Analysis based on Hadoop Technology
IRJET - Weather Log Analysis based on Hadoop TechnologyIRJET - Weather Log Analysis based on Hadoop Technology
IRJET - Weather Log Analysis based on Hadoop Technology
IRJET Journal
 
2. Develop a MapReduce program to calculate the frequency of a given word in ...
2. Develop a MapReduce program to calculate the frequency of a given word in ...2. Develop a MapReduce program to calculate the frequency of a given word in ...
2. Develop a MapReduce program to calculate the frequency of a given word in ...
Prof. Maulik Trivedi
 
SURVEY ON BIG DATA PROCESSING USING HADOOP, MAP REDUCE
SURVEY ON BIG DATA PROCESSING USING HADOOP, MAP REDUCESURVEY ON BIG DATA PROCESSING USING HADOOP, MAP REDUCE
SURVEY ON BIG DATA PROCESSING USING HADOOP, MAP REDUCE
AM Publications,India
 
LOAD BALANCING LARGE DATA SETS IN A HADOOP CLUSTER
LOAD BALANCING LARGE DATA SETS IN A HADOOP CLUSTERLOAD BALANCING LARGE DATA SETS IN A HADOOP CLUSTER
LOAD BALANCING LARGE DATA SETS IN A HADOOP CLUSTER
ijdpsjournal
 
Worldranking universities final documentation
Worldranking universities final documentationWorldranking universities final documentation
Worldranking universities final documentation
Bhadra Gowdra
 
IJET-V3I2P14
IJET-V3I2P14IJET-V3I2P14

Similar to Mansi chowkkar programming_in_data_analytics (20)

B017320612
B017320612B017320612
B017320612
 
Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoop
 
Hadoop Cluster Analysis and Assessment
Hadoop Cluster Analysis and AssessmentHadoop Cluster Analysis and Assessment
Hadoop Cluster Analysis and Assessment
 
62_Tazeen_Sayed_Hadoop_Ecosystem.pptx
62_Tazeen_Sayed_Hadoop_Ecosystem.pptx62_Tazeen_Sayed_Hadoop_Ecosystem.pptx
62_Tazeen_Sayed_Hadoop_Ecosystem.pptx
 
IJSRED-V2I3P43
IJSRED-V2I3P43IJSRED-V2I3P43
IJSRED-V2I3P43
 
Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
Web Oriented FIM for large scale dataset using Hadoop
Web Oriented FIM for large scale dataset using HadoopWeb Oriented FIM for large scale dataset using Hadoop
Web Oriented FIM for large scale dataset using Hadoop
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Enhancing Big Data Analysis by using Map-reduce Technique
Enhancing Big Data Analysis by using Map-reduce TechniqueEnhancing Big Data Analysis by using Map-reduce Technique
Enhancing Big Data Analysis by using Map-reduce Technique
 
GHCNPaper3
GHCNPaper3GHCNPaper3
GHCNPaper3
 
Big Data & Hadoop. Simone Leo (CRS4)
Big Data & Hadoop. Simone Leo (CRS4)Big Data & Hadoop. Simone Leo (CRS4)
Big Data & Hadoop. Simone Leo (CRS4)
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop Technology
 
IRJET - Weather Log Analysis based on Hadoop Technology
IRJET - Weather Log Analysis based on Hadoop TechnologyIRJET - Weather Log Analysis based on Hadoop Technology
IRJET - Weather Log Analysis based on Hadoop Technology
 
2. Develop a MapReduce program to calculate the frequency of a given word in ...
2. Develop a MapReduce program to calculate the frequency of a given word in ...2. Develop a MapReduce program to calculate the frequency of a given word in ...
2. Develop a MapReduce program to calculate the frequency of a given word in ...
 
SURVEY ON BIG DATA PROCESSING USING HADOOP, MAP REDUCE
SURVEY ON BIG DATA PROCESSING USING HADOOP, MAP REDUCESURVEY ON BIG DATA PROCESSING USING HADOOP, MAP REDUCE
SURVEY ON BIG DATA PROCESSING USING HADOOP, MAP REDUCE
 
LOAD BALANCING LARGE DATA SETS IN A HADOOP CLUSTER
LOAD BALANCING LARGE DATA SETS IN A HADOOP CLUSTERLOAD BALANCING LARGE DATA SETS IN A HADOOP CLUSTER
LOAD BALANCING LARGE DATA SETS IN A HADOOP CLUSTER
 
Worldranking universities final documentation
Worldranking universities final documentationWorldranking universities final documentation
Worldranking universities final documentation
 
IJET-V3I2P14
IJET-V3I2P14IJET-V3I2P14
IJET-V3I2P14
 

More from MansiChowkkar

M sc research_project_report_x18134599
M sc research_project_report_x18134599M sc research_project_report_x18134599
M sc research_project_report_x18134599
MansiChowkkar
 
X18134599 mansi chowkkar
X18134599 mansi chowkkarX18134599 mansi chowkkar
X18134599 mansi chowkkar
MansiChowkkar
 
Regression project
Regression projectRegression project
Regression project
MansiChowkkar
 
Data visualisation magzine
Data visualisation magzineData visualisation magzine
Data visualisation magzine
MansiChowkkar
 
Safe machinelearning
Safe machinelearningSafe machinelearning
Safe machinelearning
MansiChowkkar
 
Mansi_BreastCancerDetection
Mansi_BreastCancerDetectionMansi_BreastCancerDetection
Mansi_BreastCancerDetection
MansiChowkkar
 

More from MansiChowkkar (6)

M sc research_project_report_x18134599
M sc research_project_report_x18134599M sc research_project_report_x18134599
M sc research_project_report_x18134599
 
X18134599 mansi chowkkar
X18134599 mansi chowkkarX18134599 mansi chowkkar
X18134599 mansi chowkkar
 
Regression project
Regression projectRegression project
Regression project
 
Data visualisation magzine
Data visualisation magzineData visualisation magzine
Data visualisation magzine
 
Safe machinelearning
Safe machinelearningSafe machinelearning
Safe machinelearning
 
Mansi_BreastCancerDetection
Mansi_BreastCancerDetectionMansi_BreastCancerDetection
Mansi_BreastCancerDetection
 

Recently uploaded

Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
ewymefz
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
Tiktokethiodaily
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
benishzehra469
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
vcaxypu
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
mbawufebxi
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
MaleehaSheikh2
 

Recently uploaded (20)

Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
 

Mansi chowkkar programming_in_data_analytics

  • 1. Global Warming Analysis using Big Data Techniques Mansi Chowkkar x18134599 MSc Data Analytics PDA National College of Ireland Abstract— As data is growing rapidly in every field, new technologies to handle such enormous big data and its processing is also evolving. Hadoop is one of the popular technologies due to its properties like distributive, scalable, and open SRC framework. It is based on Map-reduce which divide the task into smaller chunks and work. Pig and Hive are based on Hadoop and they process data faster. All these tools and technologies are freely available and open source. In this project Hadoop, HDFS, MapReduce, HBase, Hive, and Pig are used for analyzing global warming and CO2 emission data. Big data is being used for analysis and it is accessed and stored using Hadoop distributed system. The highest CO2 emission countries and the maximum global warming temperature countries are related is confirmed from this research. Also, southeast countries are having a major impact of global warming on average temperature value is observed from an analysis performed by above mentioned technologies. keywords: Hadoop, Pig, Hive, HDFS, HBase I. INTRODUCTION As computation and technology are advancing, there are many other problems which are increasing. For example, increased global warming and pollution are a major concern which needs attention. All the developing countries have increased pollution levels and hence increased average temperature. Earlier there was no uncertainty in temperature prediction. But due to the global warming effect nowadays uncertainty for weather prediction has increased [1]. To predict the pattern of each country from last 20/30 years from the entire world, this study will use global warming data from all previous years and for each country, city with average temperature value and change in temperature uncertainty value. Also to find whether CO2 emission has any adverse effect on global warming, CO2 emission data for all country from last 10 years will also be analyzed here. Since data is very huge, for efficient and accurate analysis this project will use Big Data Analytics programming languages. Hadoop is a widely used Big data framework that mainly uses the MapReduce framework which is popular in data processing structures. Mapreduce with distributed Hadoop will be used for analyzing pollution and global warming issues without hampering the speed [2]. As Hadoop is used everywhere and it is open source, technologies like Hive, Pig, Spark, etc. are built on top of Hadoop. Pig and Hive are built on top of Hadoop which process data queries faster than Map-reduce. Hive provides SQL query language and transforms it into the task of Map-reduce [3]. Pig run complex joins operations in real time data queries. PigLatin comes with a novel debugging environment which is useful when dealing with huge data sets. The Pig and Hive are used for complex query execution. For predicting temperature increase in each of the country every year, Pig and Hive will be used to get results. A. Business Queries: TABLE I: Business query Query Language Framework Max value of CO2 emission in each country Java Mapreduce, Hadoop, HDFS, HBase In every year max value of CO2 Java Mapreduce, Hadoop, HDFS, HBase Max country count and its corresponding avg temperature Pig Hadoop, HDFS, HBase, Sqoop Average of temperature of all year data for each country Pig Hadoop, HDFS, HBase, Sqoop List of the countries who arehaving temperature greater than 29.5 Hive, HQL Hive, Hadoop, HDFS, HBase, Sqoop List of the countries with average value of the temperature Hive, HQL SQL, Hadoop, HDFS, HBase, Sqoop List of the countries with average value of the temperatureHive Hive, HQL SQL, Hadoop, HDFS, HBase, Sqoop Top 5 countries with maximum temperature HQL,Hive SQL, Hadoop, HDFS, HBase, Sqoop II. RESEARCH QUESTION Analyzing global warming and pollution across the world using Hadoop technology III. MOTIVATION The increasing population and pollution lead to the global warming issue. There is a need for solving this global warming issue. For finding a solution to this problem, the deep study should be performed on available big data on different parameters. Therefore Big data technology based effective analysis should be performed with considering various situations and variables from the data [4]. In Big data, Hadoop is widely used and it is an open source technology, therefore many technologies are built on Hadoop that can be used for data analysis. MapReduce is an example of a
  • 2. developed parallel technology which is based on Hadoop and provides better performance in the big data processing. It can be processed through Hadoop, Hive, Pig, and HDFS systems for querying on data [?]. IV. LITERATURE REVIEW There are lots of researches done in this field which provides an overview for further study. In the study [5], the author has carried out an experiment to investigate the effect of top 5 countries global warming into tourist business using automatic RSS feed big data technology. The temperature change, global warming, and atmosphere change are studied on top 5 tourist countries data to show climate change attracts tourists in Thailand. In another study [6] climate analysis has been done using big data. In this study, the author explained big data challenges in the field of MERRA analytic service which enables MapReduce analysis over NASA analysis research studies. Big data cannot move from one place to another hence big data technology and cloud computing is used in this study for climate data analysis. MapReduce provides an approach for analysis with high performance and it has been used in many research studies on climate data. MapReduce has proven to be an effective technique for large text data, complex data, and binary data. In this technological era, Big data is increasing and handling of these big data services are facing many problems. To provide an efficient solution there is a mapping and reducing technique which can be improved by shuffling strategy. For counting of words, [7] study has explained the shuffling technique for MapReduce. The MapReduce is implemented on Hadoop for word counts. This shuffling architecture is tested for repetition of words, duplicate entries of words, and sentences from the paragraph for performance testing. [8] studied Apache Pig and Apache Hive configuration over HDFS and problem faced during the experiment is explained. Problems during the configuration of different jar files and versions are discussed. For example, yarn uses a new method for tracking jobs as compare to a MapReduce job. This study inspires the usage and installation of newer versions of big data sources and tools. This study compared Pivotal HAWQ and Apache Hive for word count analysis and it is found that Pivotal HAWQ is 7 times faster on 10 million rows of data. The study also confirmed that there is no difference between Apache Pig and Apache Hive performance. V. METHODOLOGY In this section process flow for the project will be discussed. A. Dataset 1. Data Collection: The global warming Data is selected from the kaggle open source. The average temperature for all countries from the year 1849 to 2013 including city, longitude, and latitude is explained in this data set. Another data for CO2 emission according to year and country with country code is collected from the Air pollution world data site. 2. Data Extraction For this project 2 data-sets are selected, one with 2,37,000 rows and one with 2400 rows. CO2 emission data is selected for studying co-relation between increasing global warming temperature and CO2 emission in the world. The first dataset is downloaded from Kaggle1 and another data is downloaded from the OurWorldinData2 Website in .csv format. Data consists of unwanted values, missing values, and null values. The second dataset for CO2 emission is consists of years as column names hence in the transpose format. 2. Data Reprocessing: Data is cleaned using R programming language. In the global warming data set, operations like extracting from .csv, removing special characters, replacing or removing NA are performed. In the CO2 data set data is first cleaned and then transposed so the year is generated an new column using API. After transposing, year string is added with extra characters which are removed using conditional loop. One of the data processing before and after stage is shown in the fig(1). 3. Data Exploration and Transformation: Fig. 1: Before and After cleaning B. Project Flow After cleaning of data, it is loaded to MySQL database. From the MySQL it is loaded to HDFS for further processing of queries on the dataset. 1. MApReduce Process and HDFS: Using Apache sqoop data is loaded from MySQL to HDFS. Pig, Hive, and Java accessedcthe stored data for the query purpose. 2. MapReduce Process and Apache Hive: Implementing business query hive is used, data is loaded to hive using load command. After loading data, queries are formed and saved in .hql file. Through the hive command output file is generated and stored in HBase. Hive queries will answers our three research objectives. 3. MapReduce process and Apache Pig: To answer some of our business objectives Pig queries will be used on the global warming data set. Pig is a faster big data technique hence more than 2,37,000 rows data is 1https://www.kaggle.com/newyork167/ exploring-global-warming/data 2https://ourworldindata.org/co2-and-other-greenhouse-gas-emis
  • 3. Fig. 2: Flow Diagram Fig. 3: Data loading to HDFS used for Pig query using HDFS storage. The load command is used to load data from HDFS to Pig and then queries are run using .pig file execution. The output is stored in HBase. 4. Java query using MapReduce Design: MapReduce design pattern is used for writing three classes: mapper for data type mapping with key, the reducer is used for implementing query logic, and the driver is used to executing main class and driving process for example file reading writing tasks. Eclipse is used for MapReduce execution using java code. Fig. 4: Flow Diagram Fig. 5: Pig query1 C. Technologies and Programming languages used: Technologies are selected for business queries as per the requirement and best suitability is discussed here. MYSQL: It is a database that handles huge data. It works on client-server modes where data is stored in the server and it is able to send to the client location. Hence MySQL is used in this project for initially storing data and then transferring it to HDFS. Hadoop: Hadoop and HDFS is are open source data storing and managing distributed frameworks which are widely used for big data. The processing of data is done by MapReduce framework [9]. Hadoop cluster comprises of many components, for example, MapReHDFS, Yarn, HDFS, and some libraries which handle failure of the system. Therefore, to achieve high performance for big data queries Hadoop distributed system is used in this project. HBase: HBase runs on top of the Hadoop cluster. It provides storage for large tables which can be stored as
  • 4. records. HBase provides red/write access during processing of data hence in this project it is used for storage purposes. MapReduce: Hadoop uses MapReduce for processing large amounts of data. Every job is distributed as a mapping task and reducing task. The map and reduce function, input, output file location is required to complete mapreduce job [10]. MapReduce distributes the task into small tasks hence it processes data within less time. In this project, MapReduce is used with mapper, reducer, and driver classes for 2 query processing. Sqoop: Sqoop is a interface for transferring big data from one database to another similar to Cassandra. It internally uses mapper and reducer functionality to process number of tasks hence providing high data transfer and processing. In this project, Sqooop is used for transferring data from MySQL to HDFS. Hive: Hive turns Hadoop into data warehouse which makes query and data analyzing efficient. Hive uses some of the concepts like tables, entity relationship, columns and primitive data types from relational database management system (RDBMS). Hive has its own declarative language known as HiveQL(HQL) which is very similar to SQL query language. Functioning of Hive is unique since it stores schema inside the database but stores its data on HDFS system [11]. Hive is used to process some of the business queries using HQL language. Pig: Pig uses Map-reduce in an indirect way which was developed by Yahoo. Pig uses PigLatin high level programming language similar to SQL. Pig has a local and distributed type of execution environment. For executing Pig queries, content needs to be saved in input file in semi-structured format [12]. In this project, Pig is used to find a solution for some of the complex queries. The input file is copied to HDFS using sqoop and Pig command is executed for finding query result from a file. Data is loaded using PigStorage and output is stored in HDFS. R Language: R programming language is used for cleaning data sets in R studio. Different functions of R are used for example gather(), gsub(), etc for cleaning and then data is leaded. Java Language: Hadoop based Mapreduce uses java programming language for a map and reduce functionality. In this project Mapper, reducer, and driver functionalities are implemented in the Java programming language. Tableau: For visualizing all outputs from queries, Tableau visualizing tool is used. Different types of graphs are used for explaining results from the queries. VI. RESULTS Query implemented in MapReduce using java programming Analysis : fig 6Query shows that united emirates and Portugal are the country with maximum CO2 emission. From the graph it has been seen that, Brazil, Italy, India, Greenland are the countries with comparatively low CO2 emission value. Query implemented in MapReduce java programming Fig. 6: Max CO2 emission for each country Fig. 7: CO2 emission for every year Analysis : fig 7This query represents the CO2 emission value of all the years. From the graph, we can see that from 2006 the rate of CO2 emission is decreasing. Hence it is observed that in recent years CO2 emission has decreased. To resolve the global warming issue, countries are now actively taking action to decrease pollution. Fig. 8: Country count with max cities in global warming data Query implemented in Pig Analysis : fig 8 This graph represents country count for the global warming data and its corresponding temperature. It is seen that Chile, India, and Bangladesh countries have maximum count hence they have more global warming effect as compare to other countries in the world. Query implemented in Pig Analysis : fig 9 The graph represents the average value of the global warming temperature in all countries. The result shows that Sudan, Vietnam, and Somalia show the highest value of avg temperature whereas, South Korea, Germany, and Ethiopia have a minimum value. Hence, Germany, South
  • 5. Fig. 9: Average value of global warming temperature Korea doesn’t show major global warming effects. Fig. 10: Countries with avg temperature greater than 29.5 Query implemented in Hive Analysis : fig 10 This query list all the countries with a global warming temperature greater than 29.5. This shows that Pakistan, Iraq, Iran, Sudan, and Saudi Arabia are the warmest countries and have a major global warming effect. There is a total of 17 countries in the world with a temperature greater than 29.5 which is a major concern that needs to be resolved. We can see that all the countries listed are South Eastern countries. Fig. 11: Country count which are listed majorly for global warming Query implemented in Hive Analysis : fig 11 This query shows that Iraq, Saudi Arabia, Pakistan, India, China, and Turkey are majorly suffering countries in the global warming problem. These countries are also listed in the CO2 emission country list. Hence it is proved that CO2 emission affects the global warming issue. Query implemented in Hive Fig. 12: Countries with temperature uncertainty in change in temperature Analysis: fig 12 This query represents countries that are most stable in temperature change. India, China, Brazil, and Turkey show unstable temperatures throughout the year. Hence these countries exhibit uncertainty in temperature change. Whereas Egypt, Syria, South Africa, Kenya, and Japan are more stable countries that have the lowest change in temperature values. Fig. 13: Top 5 Countries with maximum temperature reached Query implemented in Hive Analysis : fig 13 This result shows that London, Istanbul, and Kiev has reached the maximum temperature in past years due to the global warming effect. As we know London and Berlin don’t have a maximum temperature in all months but they have reached peak value as compare to other countries. Istanbul is having maximum temperature in all of the months as we can see from other query results and also reached the maximum temperature. From all the above results it has been seen that the countries with maximum CO2 emission have the major global warming effect. Hence CO2 is the major contributing parameter for global warming temperature. It is also seen that South-East countries suffering majorly from global warming issues. Change in temperature is the variable which is also directly proportional to the global warming problem. VII. CHALLENGES AND LIMITATION 1. Challenges faced during the automation process while storing output into HBase and automating shell script.
  • 6. 2. Faced delay issue for running query on big data using MapReduce. 3. Formation of complex query in Hive. 4. Virtual Machine speed for running all technologies. 4. All UI interfaces on open stack 5. Interfacing Machine learning techniques with Virtual Box. VIII. CONCLUSION AND FUTURE WORK The global warming and CO2 data structured data are used for the analysis of global warming countries and the impact of CO2 emission. It is found that CO2 emission have an effect on global warming data. In the future more related data for example ozone layer value, other pollutants attributes can be collected for further analysis. The Big Data techniques used here are Hadoop, Hbase, Hive, Pig, and MapReduce. It has been observed that Pig performed better in terms of big query execution time and query complexity. MapReduce performed well in query processing tasks. In the future, more techniques can be explored for example Spark and Impala. REFERENCES [1] V. Chang, “Towards data analysis for weather cloud computing,” Knowledge-Based Systems, vol. 127, pp. 29–45, 2017. [2] K. A. Ismail, M. Abdul Majid, J. Mohamed Zain, and N. A. Abu Bakar, “Big Data prediction framework for weather Temperature based on MapReduce algorithm,” ICOS 2016 - 2016 IEEE Conference on Open Systems, pp. 13–17, 2017. [3] “Processing performance on apache pig, apache hive and mysql cluster.,” Proceedings of International Conference on Information, Communication Technology and System (ICTS) 2014, Information, Communication Technology and System (ICTS), 2014 International Conference on, p. 297, 2014. [4] S. Navadia, P. Yadav, and J. Thomas, “Measuring and Analyzing Weather Data,” pp. 414–417, 2017. [5] C. Yaiprasert, “Climate situation in 5 top-rated tourist attractions in Thailand investigated by using big data RSS feed and programming,” Walailak Journal of Science and Technology, vol. 15, no. 5, pp. 371–385, 2018. [6] J. L. Schnase, D. Q. Duffy, G. S. Tamkin, D. Nadeau, J. H. Thompson, C. M. Grieg, M. A. McInerney, and W. P. Webster, “Merra analytic services: Meeting the big data challenges of climate science through cloud-enabled climate analytics-as-a-service.,” Computers, Environment and Urban Systems, vol. 61, no. Part B, pp. 198 – 211, 2017. [7] B. Mandal, S. Sethi, and R. K. Sahoo, “Architecture of efficient word processing using hadoop mapreduce for big data applications,” in 2015 International Conference on Man and Machine Interfacing (MAMI), pp. 1–6, Dec 2015. [8] X. Chen, L. Hu, L. Liu, J. Chang, and D. L. Bone, “Breaking down hadoop distributed file systems data analytics tools: Apache hive vs. apache pig vs. pivotal hwaq,” in 2017 IEEE 10th International Conference on Cloud Computing (CLOUD), pp. 794–797, June 2017. [9] W.-j. Lu, S. Kawasaki, and J. Sakuma, “Using Fully Homomorphic Encryption for Statistical Analysis of Categorical, Ordinal and Numerical Data,” 2017. [10] Y. Takata, T. Hosaka, and H. Ohnuma, “Boosting Approach To Early Bankruptcy Prediction From Multiple-Year Financial Statements,” Asia Pacific Journal of Advanced Business and Social Studies, vol. 3, no. 2, 2017. [11] E. L. Lydia and M. B. Swarup, “Big data analysis using hadoop components like flume, mapreduce, pig and hive.,” International Journal of Computer Science Engineering Technology, vol. 5, no. 11, p. 390, 2015. [12] “Comparison of data processing tools in hadoop.,” 2016 International Conference on Electrical, Electronics, Communication, Computer and Optimization Techniques (ICEECCOT), Electrical, Electronics, Communication, Computer and Optimization Techniques (ICEECCOT), 016 International Conference on, p. 238, 2016.