Your SlideShare is downloading. ×
0
Big Data and Hadoop - An Introduction
Big Data and Hadoop - An Introduction
Big Data and Hadoop - An Introduction
Big Data and Hadoop - An Introduction
Big Data and Hadoop - An Introduction
Big Data and Hadoop - An Introduction
Big Data and Hadoop - An Introduction
Big Data and Hadoop - An Introduction
Big Data and Hadoop - An Introduction
Big Data and Hadoop - An Introduction
Big Data and Hadoop - An Introduction
Big Data and Hadoop - An Introduction
Big Data and Hadoop - An Introduction
Big Data and Hadoop - An Introduction
Big Data and Hadoop - An Introduction
Big Data and Hadoop - An Introduction
Big Data and Hadoop - An Introduction
Big Data and Hadoop - An Introduction
Big Data and Hadoop - An Introduction
Big Data and Hadoop - An Introduction
Big Data and Hadoop - An Introduction
Big Data and Hadoop - An Introduction
Big Data and Hadoop - An Introduction
Big Data and Hadoop - An Introduction
Big Data and Hadoop - An Introduction
Big Data and Hadoop - An Introduction
Big Data and Hadoop - An Introduction
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Big Data and Hadoop - An Introduction

677

Published on

Big Data and need for hadoop

Big Data and need for hadoop

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
677
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. HADOOP- Nagarjuna K- nagarjunak@outlook.com
  • 2. Why and What Hadoop ?A tool to process big data
  • 3. What is BIG Data ?Facebook, Google+ etc.,  Whatever we do getting stored in form of data or inform of logsMachines too generate lots of data  Cameras, Mobiles, softwares like STAAD Pro, automated machines in industries etc.,We are having a online discussion now , certainly your reading of this presentation is recorded in data.
  • 4. What is BIG Data ? ..continued Exponential growth of data  challenges to Google, Yahoo, Microsoft, Amazon Need to go through TBs and PBs of data ?  Which websites and books were popular ?  What kind of Ads appeal to them ? Existing tools became inadequate to process such large data sets.
  • 5. Why is the data so BIG ? Till Couple of decade back  Floppy disks From then on  CD/DVD Drives Half a decade back  Hard drives (500 GB) Now  Hard Drives(I TB) are available in abundance
  • 6. Why is the data so BIG ?So WHAT ?Even the technology to read has taken a leap.
  • 7. Why is the data so BIG ? Data Time toYear Device Volume Transfer process speed Optical Drive1990 1370 MB 4.4 MB/s 5 minutes 1 TB SATA2012 1 TB 100 MB/s 2.5 Hrs Drives
  • 8. How to handle such BIG ? BIG elephant Numerous small chicken ?
  • 9. How to handle such BIG ?Concept of Torrents  Reduce time to read by reading it from multiple sources simultaneously.  Imagine if we had 100 drives, each holding one hundredth of the data. Working in parallel, we could read the data in less than two minutes.
  • 10. How to handle such BIG ? -- Issues How to handle a system up and downs ? How to combine the data from all the systems ?
  • 11. Problem1 : System’s Ups and Downs Commodity hard ware for data storage and analysis Chances of failure are very high So, have a redundant copy of the same data across some machines In case of eventuality of one machine, you have the other Google came up with a file system  GFS (Google File System) which implemented all these details.
  • 12. GFS Divides data into chunks and stores in the file System Can store data in ranges of PBs also
  • 13. Problem 2 : How to combine the data ? Analyze data across different machines , But how do we merge them to get a meaningful outcome ? Yes, all (some) of the data has to travel across network. Then only merging of the data can occur. Doing this is notoriously challenging Again Google  Map—Reduce
  • 14. Map Reduce Provides a programming model  abstracts the problem of disk reads and writes transforming in to a computation of keys and values. Two phases  Map  Reduce
  • 15. So what is Hadoop ?An operating system ?Provides 1. A reliable shared storage system 1. Analysis system
  • 16. History of Hadoop Google was the first to launch GFS and MapReduce They published a paper in 2004 announcing the world a brand new technology This technology was well proven in Google by 2004 itself MapReduce paper by Google
  • 17. History of Hadoop Doug Cutting saw an opportunity and led the charge to develop an open source version of this MapReduce system called Hadoop . Soon after, Yahoo and others rallied around to support this effort. Now Hadoop is core part in :  Facebook, Yahoo, LinkedIn, Twitter …
  • 18. History of HadoopGFS  HDFSMapReduce  MapReduce
  • 19. HDFS -- A BriefDesign  Streaming very large files on commodity cluster.1. Very Large Files MBs to PBs2. Streaming Write once read many approach After huge data being placed  We tend to use the data not modify it Time to read the whole data is more important3. Commodity Cluster No High end Servers Yes, high chance of failure (But HDFS is tolerant enoguh) Replication is done
  • 20. MapReduce -- A BriefLarge scale data processing in parallel.MapReduce provides: Automatic parallelization and distribution Fault-tolerance I/O scheduling Status and monitoringTwo phases in MapReduce  Map  Reduce
  • 21. MapReduce -- A Brief Map phase  map (in_key, in_value) -> list(out_key, intermediate_value)  Processes input key/value pair  Produces set of intermediate pairs Reduce Phase  reduce (out_key, list(intermediate_value)) -> list(out_value)  Combines all intermediate values for a particular key  Produces a set of merged output values (usually just one)
  • 22. MapReduce -- A Brief
  • 23. Hadoop Cluster
  • 24. Hadoop Ecosystems
  • 25. Version of HadoopWe will deal with either of  Apache hadoop-0.20  Cloudera hadoop - cdh3
  • 26. Pre-Requisites Core-Java Acquaintance with LINUX will help. Linux installation on your machines.
  • 27. Thank you  Please email your suggestions to nagarjunak@outlook.com

×