Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

project_report

241 views

Published on

  • Be the first to comment

  • Be the first to like this

project_report

  1. 1. ACKNOWLEDGEMENT I am extremely grateful to Banaras Hindu University, Varanasi for providing me the excellent environment to undergo my summer internship. I would like to express my heartful thanks to Mr KarmVeer Singh (System Engineer,C.C.,B.H.U.) for accepting me as a trainee in the summer internship program and also sharing his knowledge and experience with me. I would like to thanks all the officers and staff members of Banaras Hindu University, Varanasi for providing their full cooperation during the internship. The acknowledgement would be incomplete without mentioning the enormous amount of help I received from the web community in one way or the other. Date- Varanasi Abhijeet Saxena Rajat Agrawal DIT University JECRC University Dehradun Jaipur
  2. 2. INTRODUCTION It’s not easy to measure the total volume of data stored electronically, but an IDC estimate put the size of the “digital universe” at 0.18 zettabytes in 2006 and in 2010, the digital universe was 1.2 zettabytes. A zettabyte is 1021 bytes, or equivalently one thousand exabytes, one million petabytes, or one billion terabytes. Data is generated everywhere today, be it forecasting, media, research etc. The volume of data being made publicly available increases every year, too. Organizations no longer have to merely manage their own data but also be able to extract value from that data. Take, for example, the Astrometry.net project, which watches the Astrometry group on Flickr for new photos of the night sky. It analyzes each image and identifies which part of the sky it is from, as well as any interesting celestial bodies, such as stars or galaxies. This project shows the kind of things that are possible when data (in this case, tagged photographic images) is made available and used for something (image analysis) that was not anticipated by the creator. The problem is simple: while the storage capacities of hard drives have increased massively over the years, access speeds—the rate at which data can be read from drives have not kept up. One typical drive from 1990 could store 1,370 MB of data and had a transfer speed of 4.4 MB/s, § so you could read all the data from a full drive in around five minutes. Over 20 years later, one terabyte drives are the norm, but the transfer speed is around 100 MB/s, so it takes more than two and a half hours to read all the data off the disk. Reading the data from a single drive takes a lot of time. Imagine having all that data distributed among 100 disks, each disk holding 1/10th of the actual data. Working in parallel, we could reduce the data access time substantially. However, parallel working causes many problems. The first problem that arises is hardware failure. Addition of more hardware devices increases the chance of device failure. The second problem is that most analysis tasks need to be able to combine the data in some way; data read from one disk
  3. 3. may need to be combined with the data from any of the other 99 disks. Various distributed systems allow data to be combined from multiple sources, but doing this correctly is notoriously challenging.
  4. 4. PROBLEM STATEMENT River Ganga is the lifeline of the Indo-Gangetic plain. It acts as a source of life for people residing in the states of Uttarakhand, Uttar Pradesh, Bihar and West Bengal. It not only gives life but also is one of the main river for the discharge of industrial effluents. Due to the degrading condition of the river water, many projects were launched for the cleanlines of the river. One of the major projects was Ganga Action Plan. Efforts were made to clean the water and an annual report was sent to the ministry informing them of the condition. The pressure of people kept increasing and so more and more money was spent over the cleanliness of the river. However, the report for last ten years speaks a different story. The climatic conditions of these region directly effect the water of the river. The rainfall per year is responsible for how much the river cleans itself on its own. The temperature of the region directly effects the temperature of the river. Any change in the quality of water directly effects the population relying on it. So, we decided that the change in birth rate and death rate of the region would be studied co-relating it to the change in water quality. This would give an idea of how much has the water changed. Since, the water was affected by the climatic changes, rainfall and temperature were taken into consideration and climatic changes were taken into account as well.
  5. 5. OBJECTIVES  Calculating the maximum and minimum avergae for rainfall and temperature of the region.  Calculating the maximum and minimum pH, temperature, feacal and total coliform levels for water of river Ganga.  Calculating the total change in birth rate and death rate of the region  Developing an efficient program for the calculation and processing of the data that could be extended for processing 100 years data. DATABASE
  6. 6. Three databases were taken into consideration-  Climatic Database  Water quality Database  Population change Database Climatic Database The climate of the region was measured on two parameters- 1. Rainfall- The rainfall data belonged to the Central North East region. The data for 20 years was taken into consideration ,i.e 1987- 2007. The data consisted the monthly rainfall for each year. The data was in two parts, maximum and minimum rainfall. The average rainall for each year was calculated and the maximum and minimum average was reported for the past 20 years 2. Temperature- The temperature data belonged to the North Central region. The data for 20 years was taken into consideration ,i.e 1987- 2007. The data consisted the monthly average temperature for each year. The data was in two parts, maximum and minimum temperature. The average temperature for each year was calculated and the maximum and minimum average was reported for the past 20 years Water Quality Database The data regarding river Ganga consisted of various parameters for the year 2002 to 2012. The parameters taken into consideration for the study included pH, temperature, faecal coliform and total coliform.  pH indicates the acidic levels of water.  Temperature of the water  Faecal coliform indicates the pathogen level in water due to human and animal waste.  Total coliform indicate the total pathogen level in the water.
  7. 7. Population Database The population database consisted of the census report for various states and their cities. The database for only three states were considered, i.e. Uttarakhand, Uttar Pradesh and West Bengal. The census report consisted of birth rate , death rate and various other parameters for the year 2010- 2011 and 2011-2012. The parameters taken into consideration were change in birth rate and change in death rate.
  8. 8. DEVELOPMENT AND TESTING The database contained enourmously large amount of data. Processing such data using standard algorithms would take a lot of time. So, to reduce the amount of time taken to process them, MAP REDUCE algorithm were used. Map-reduce programs were created to process each database. Map-reduce programs run on Hadoop framework. Each module was run on Hadoop framework consisting of a single node. The output was stored in a file on HFS (Hadoop File System). Testing requirements:- 1. Linux System 2. Hadoop Framework 3. JDK version 1.6 or above Testing procedure (for one module): 1. First of all the java program was compiled and a jar file was created. javac -classpath <hadoop library path> -d <directory name> <source file name> jar -xcf <archive name> -C <directory name> <class file directory> 2. After creation of jar file, the input file was placed in the hadoop file system. touch <filename> hadoop fs -put <source directory>/<filename> <destination directory>/<filename> 3. After the input file is placed in the Hadoop file system, the jar file is executed. hadoop jar <jar file name> <class name> <input file directory> <output file directory> 4. After the execution is complete, the output files are generated in the output directory. The output can be viewed using cat command. cat file name The same procedure was applied to all the modules listed.
  9. 9. -MaxTemp -MinTemp -MaxRainfall -MinRainfall -PHScale -FaecalColiform -TotalColiform
  10. 10. Result Analysis and Conclusion Rainfall and temperature have no direct affect on the water quality of a river except for the water temperature. However, the parameters that we took into consideration were indirectly affected by rainfall and temperature. The maximum water temperature was observed in year 2005, reaching to 39 degree celcius in some areas. The maximum average temperature in the region has remained quite constant varying between 30.11-31.45. It was observed that water temperature has risen but due to industrial effluents being discharged in the river without proper treatment. The pH level of the water reached a maximum of 9.1 in years 2011 and 2012. pH increase is an indication that during rainfall the water containing bases from the land runs-off in the river, thereby causing a change in the pH levels. This rainwater run-off is also the major reason for an increase in the coliform levels. The coliform levels in year 2002 and 2003 show not much of a varied levels, however, that is not the case with rainfall. The coliform count in 2002 ranged from 300-25 x 105.Year 2002 experienced the minimum average rainfall of 772.17 per sq km among the 20 years considered in the database. The coliform count in year 2003 ranged from 47-45 x 105.Year 2003 had the maximum rainfall of 1177 per sq km. This rainfall statistics contradict the normal belief that rainfall water carry the pathogen from land into river water. Water is the main resource for sustaince of life in the states of Bihar, Uttar Pradesh and Uttarakhand. However, the water quality in the past years has seen a declining trend. The increase in pH level and coliform count is an indicator that water is not fit anymore. The change in birth rate for these states suggest the same. The state of Bihar had birth rate change of -15.1 between the years 2011-2012 & 2010-2011. This decline in birth rate was also seen in Uttar Pradesh and Uttarakhand with a change of -34.1 and -4 respectively. The death rate shows a contradictory picture. The death rate in Bihar has
  11. 11. declined and the same is seen in Uttar Pradesh and Uttarakhand. STATE Change in death rate for year 2011-2012 & 2010-2011 BIHAR -9.5 UTTAR PRADESH -14.3 UTTARAKHAND -2.4 It can be said that improvement in the medical facilities can be considered as factor that has led to a decline in the death rate in these regions. In the end, we can conclude that the sample data is too small and the study should be extended on a large scale data for any significant conclusion.
  12. 12. -MaxTemperature Module package MaxTemperature; import org.apache.hadoop.fs.Path; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.io.Text; import java.io.IOException; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; /** * * @author abhijeet */ public class Maxtemp { public static void main(String[] args) throws Exception { if(args.length!=2) { System.err.println("Usage: Maxtemperature <input path> <output path>"); System.exit(-1); } Job job=new Job(); job.setJarByClass(MaxtempMain.class); job.setJobName("Maxtemp"); FileInputFormat.addInputPath(job,new Path(args[0])); FileOutputFormat.setOutputPath(job,new Path(args[1])); job.setMapperClass(MaxtempMap.class); job.setReducerClass(MaxtempReduce.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class); System.exit(job.waitForCompletion(true) ? 0:1); } public class MaxtempMap extends Mapper<LongWritable, Text, Text, Text> //mapper class- - - - - <Keyin, Valuein, Keyout, Valueout> { @Override public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException //map method <keyin, valuein, (keyout, valueout)> { String line=value.toString(); //input line String[] parts=line.split("t"); //splitting line into parts String year=line.substring(5,9); //getting year String k= "Max"; int sum= 0; for(int i=1; i<13; i++) sum=sum+Integer.parseInt(parts[i]); //calculating avg temp of year float avg=sum/12;
  13. 13. String yav=Float.toString(avg)+"t"+year; //combining year and avg into one... //key value pair being sent to next function context.write(new Text(k),new Text(yav)); } } public class MaxtempReduce extends Reducer<Text, Text, Text, Text> // REduce class<Input key, input value, Output key, Output value> { @Override public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException //reduce method(input key, list of values related to key, output values) { float maxValue=Float.MIN_VALUE; String year = null; float avg; for(Text value:values) { String val=value.toString(); String[] prt=val.split("t"); avg=Float.parseFloat(prt[1]); if(avg>maxValue) { maxValue=avg; year=prt[0]; } } String val=year+"-"+Float.toString(maxValue); context.write(key,new Text(val)); } } }
  14. 14. -MinTemperature Module package MinTemperature; import java.io.IOException; import org.apache.hadoop.fs.Path; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.io.FloatWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; /** * * @author abhijeet */ public class MintempMain { public static void main(String[] args) throws Exception { if(args.length!=2) { System.err.println("Usage: Maxtemp <input path> <output path>"); System.exit(-1); } Job job=new Job(); job.setJarByClass(MintempMain.class); job.setJobName("Mintemp"); FileInputFormat.addInputPath(job,new Path(args[0])); FileOutputFormat.setOutputPath(job,new Path(args[1])); job.setMapperClass(MintempMap.class); job.setReducerClass(MintempReduce.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(FloatWritable.class); System.exit(job.waitForCompletion(true) ? 0:1); } public class MintempMap extends Mapper<LongWritable, Text, Text, Text> //mapper class- - - - - <Keyin, Valuein, Keyout, Valueout> { @Override public void map(LongWritable key, Text value, Mapper.Context context) throws IOException, InterruptedException //map method <keyin, valuein, (keyout, valueout)> { String line=value.toString(); //input line String[] parts=line.split("t"); //splitting line into parts String year=line.substring(5,9); //getting year String k= "Min"; int sum= 0; for(int i=1; i<13; i++) sum=sum+Integer.parseInt(parts[i]); float avg=sum/12;
  15. 15. String yav=Float.toString(avg)+"t"+year; //key value pair being sent to next function context.write(new Text(k),new Text(yav)); } } ublic class MintempReduce extends Reducer<Text, Text, Text, Text> // REduce class<Input key, input value, Output key, Output value> { public void reduce(Text key, Iterable<Text> values, Reducer.Context context) throws IOException, InterruptedException //reduce method(input key, list of values related to key, output values) { float minValue=Float.MAX_VALUE; String year = null; float avg; for(Text value:values) { String val=value.toString(); String[] prt=val.split("t"); avg=Float.parseFloat(prt[1]); if(avg<minValue) { minValue=avg; year=prt[0]; } } String val=year+"-"+Float.toString(minValue); context.write(key,new Text(val)); } } }
  16. 16. -BirthRate Module import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.FloatWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import java.io.IOException; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; /** * * @author abhijeet */ public class Birthrate { public static void main(String[] args) throws Exception { if(args.length!=2) { System.err.println("Usage: Birthrate <input path> <output path>"); System.exit(-1); } Job job=new Job(); job.setJarByClass(Birthrate.class); job.setJobName("Birthrate"); FileInputFormat.addInputPath(job,new Path(args[0])); FileOutputFormat.setOutputPath(job,new Path(args[1])); job.setMapperClass(BirthMap.class); job.setReducerClass(BirthReduce.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(FloatWritable.class); System.exit(job.waitForCompletion(true) ? 0:1); } public class BirthMap extends Mapper<LongWritable, Text, Text, FloatWritable> { public void Map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line=value.toString(); String[] parts=line.split(","); context.write(new Text(parts[0]),new FloatWritable(Float.parseFloat(parts[4]))); } } public class BirthReduce extends Reducer<Text,FloatWritable,Text,FloatWritable> { public void Reduce(Text key, Iterable<FloatWritable> values, Context context) throws IOException, InterruptedException { Float sum = null; for(FloatWritable value:values) {
  17. 17. String val=value.toString(); sum=sum+Float.parseFloat(val); } context.write(key,new FloatWritable(sum)); } } }
  18. 18. -DeathRate Module import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.FloatWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import java.io.IOException; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; /** * * @author abhijeet */ public class Deathrate { public static void main(String[] args) throws Exception { if(args.length!=2) { System.err.println("Usage: Deathrate <input path> <output path>"); System.exit(-1); } Job job=new Job(); job.setJarByClass(Deathrate.class); job.setJobName("Deathrate"); FileInputFormat.addInputPath(job,new Path(args[0])); FileOutputFormat.setOutputPath(job,new Path(args[1])); job.setMapperClass(DeathMap.class); job.setReducerClass(DeathReduce.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(FloatWritable.class); System.exit(job.waitForCompletion(true) ? 0:1); } public class DeathMap extends Mapper<LongWritable, Text, Text, FloatWritable> { public void Map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line=value.toString(); String[] parts=line.split(","); context.write(new Text(parts[0]),new FloatWritable(Float.parseFloat(parts[7]))); } } public class DeathReduce extends Reducer<Text,FloatWritable,Text,FloatWritable> { public void Reduce(Text key, Iterable<FloatWritable> values, Reducer.Context context) throws IOException, InterruptedException { Float sum = null; for(FloatWritable value:values) { String val=value.toString();
  19. 19. sum=sum+Float.parseFloat(val); } context.write(key,new FloatWritable(sum)); } } }

×