Next generation technologies (The best way to jump into a parade is to jump in front of one that is already going) We are going to talk about the framework that backs up the technological infrastructure of the biggest players of internet world, some of them are embedded in the following image: These are just some biggest name; there are lots more in this list. Here we are talking about next generation computer technology, which has scalability, tolerance and much more features. Theterm cloud will not unheard for you but here I am going to talk about a super technological termsthat will be back bone of cloud or distributed computing. Now you may be thing what is thattechnology right? The technology that we are going to discuss is called “Hadoop”. The best thingabout the technology is its open source and readily available where you can contribute, experiment,and use.As apache web site says “The Apache™ Hadoop™ project develops open-source software for reliable,scalable, distributed computing.The Apache Hadoop software library is a framework that allows for the distributed processing oflarge data sets across clusters of computers using a simple programming model. It is designed toscale up from single servers to thousands of machines, each offering local computation and storage.Rather than rely on hardware to deliver high-availability, the library itself is designed to detect andhandle failures at the application layer, so delivering a highly-available service on top of a cluster ofcomputers, each of which may be prone to failures.Let’s talk about some best features first: High scalability. High availability. High performance. Handling Multi-dimensional data storage. Handling Distributed storage.Let’s first look on some scenarios in the internet world:
How much data can you think of which need to process by a internet player? Do you know howmuch data twitter process daily? It about 7 Tb per day. How much time will it take to process thismuch of data for a general computerAbout 4 hr. that is just for reading, not processing, can you think about processing all twitter datawill not it take years. So here comes Hadoop in play which sorts apetabyte in 16.25Hr and a terabyte of data in 62 seconds. Is not it good choice yes sure it is. Likewisethink about the amount of data Facebook, Google, amazon need to process daily.The best thing about Hadoop setup is, you don’t need special costly and high end servers rather youcan make a cluster out of Hadoop using commodity computers. Keep adding computers and keepincreasing storage and processing power.So ultimately here are some point for “Why Hadoop?” • Need to process Multi Petabyte Datasets • Expensive to build reliability in each application. • Nodes fail every day – Failure is expected, rather than exceptional. – The number of nodes in a cluster is not constant. • Need common infrastructure – Efficient, reliable, Open Source Apache License • The above goals are same as Condor, but – Workloads are IO bound and not CPU boundHadoop basically depends of following concept: 1. Hadoop – common (Base) Hadoop Common is a set of utilities that support the Hadoop subprojects. Hadoop Common includes FileSystem, RPC, and serialization libraries.
2. HDFS ( Hadoop File System)(File System) Hadoop Distributed File System (HDFS™) is the primary storage system used by Hadoop applications. HDFS creates multiple replicas of data blocks and distributes them on compute nodes throughout a cluster to enable reliable, extremely rapid computations. 3. Map-Reduce (Code) Hadoop Map-Reduce is a programming model and software framework for writing applications that rapidly process vast amounts of data in parallel on large clusters of compute nodes.So what it is used for: 1. Internet scale data : a. Web Logs: years of logs Terabytes per day. b. Web search- all the webpages present on this earth. c. Social data- all the data, messages, images, tweets, scraps, wall posts generated on Facebook, Twitter, and other social media. 2. Cutting edge analytics: a. Machine learning, data mining. 3. Enterprise applications: a. Network instrumentation, mobile logs. b. Video and audio processing. c. Text mining. 4. And lots more.Lets see the timeline:
References:http://hadoop.apache.org, http://developer.yahoo.com/hadoop/This is the best place where you can find all information about Hadoop. On this website youll findlots of wiki pages links and ongoing links, from which you can get lot of information about Hadoopon how to get started with Hadoop, and all how where how to questions and their answers.Just visit this site is explore it and experiment with the next-generation technology that is going tobe the backbone of Internet.In the next coming articles, well talk about some other technologies related Hadoop likeHBase, Hive,Avro, Cassandra, Chukwa, Mahout, Pig, Zookeeper. ∞ Shashwat Shriparv