Next generation technologies
         (The best way to jump into a parade is to jump in front of one that is already going)




                                                                             We are going to talk about
                                                                             the framework that backs
                                                                             up the technological
                                                                             infrastructure of the
                                                                             biggest players of internet
                                                                             world, some of them are
                                                                             embedded in the
                                                                             following image:
                                                                             These are just some
                                                                             biggest name; there are
                                                                             lots more in this list.
                                                                             Here we are talking about
                                                                             next generation computer
                                                                             technology, which has
                                                                             scalability, tolerance and
                                                                             much more features. The
term cloud will not unheard for you but here I am going to talk about a super technological terms
that will be back bone of cloud or distributed computing. Now you may be thing what is that
technology right? The technology that we are going to discuss is called “Hadoop”. The best thing
about the technology is its open source and readily available where you can contribute, experiment,
and use.
As apache web site says “The Apache™ Hadoop™ project develops open-source software for reliable,
scalable, distributed computing.
The Apache Hadoop software library is a framework that allows for the distributed processing of
large data sets across clusters of computers using a simple programming model. It is designed to
scale up from single servers to thousands of machines, each offering local computation and storage.
Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and
handle failures at the application layer, so delivering a highly-available service on top of a cluster of
computers, each of which may be prone to failures.

Let’s talk about some best features first:

       High scalability.
       High availability.
       High performance.
       Handling Multi-dimensional data storage.
       Handling Distributed storage.

Let’s first look on some scenarios in the internet world:
How much data can you think of which need to process by a internet player? Do you know how
much data twitter process daily? It about 7 Tb per day. How much time will it take to process this
much of data for a general computer




About 4 hr. that is just for reading, not processing, can you think about processing all twitter data
will not it take years.




                                                  So here comes Hadoop in play which sorts a
petabyte in 16.25Hr and a terabyte of data in 62 seconds. Is not it good choice yes sure it is. Likewise
think about the amount of data Facebook, Google, amazon need to process daily.
The best thing about Hadoop setup is, you don’t need special costly and high end servers rather you
can make a cluster out of Hadoop using commodity computers. Keep adding computers and keep
increasing storage and processing power.




So ultimately here are some point for “Why Hadoop?”

    •   Need to process Multi Petabyte Datasets
    •   Expensive to build reliability in each application.
    •   Nodes fail every day
        – Failure is expected, rather than exceptional.
        – The number of nodes in a cluster is not constant.
    •   Need common infrastructure
        – Efficient, reliable, Open Source Apache License
    •   The above goals are same as Condor, but
            – Workloads are IO bound and not CPU bound

Hadoop basically depends of following concept:
   1. Hadoop – common (Base)
               Hadoop Common is a set of utilities that support the Hadoop subprojects. Hadoop
      Common includes FileSystem, RPC, and serialization libraries.
2. HDFS ( Hadoop File System)(File System)
              Hadoop Distributed File System (HDFS™) is the primary storage system used by
       Hadoop applications. HDFS creates multiple replicas of data blocks and distributes them on
       compute nodes throughout a cluster to enable reliable, extremely rapid computations.

    3. Map-Reduce (Code)
               Hadoop Map-Reduce is a programming model and software framework for writing
       applications that rapidly process vast amounts of data in parallel on large clusters of
       compute nodes.

So what it is used for:

    1. Internet scale data :
           a. Web Logs: years of logs Terabytes per day.
           b. Web search- all the webpages present on this earth.
           c. Social data- all the data, messages, images, tweets, scraps, wall posts generated on
               Facebook, Twitter, and other social media.

    2. Cutting edge analytics:
           a. Machine learning, data mining.

    3. Enterprise applications:
           a. Network instrumentation, mobile logs.
           b. Video and audio processing.
           c. Text mining.

    4. And lots more.

Let's see the timeline:
References:
http://hadoop.apache.org, http://developer.yahoo.com/hadoop/
This is the best place where you can find all information about Hadoop. On this website you'll find
lots of wiki pages links and ongoing links, from which you can get lot of information about Hadoop
on how to get started with Hadoop, and all how where how to questions and their answers.
Just visit this site is explore it and experiment with the next-generation technology that is going to
be the backbone of Internet.

In the next coming articles, we'll talk about some other technologies related Hadoop likeHBase, Hive,
Avro, Cassandra, Chukwa, Mahout, Pig, Zookeeper.




                                                                                                  ∞
                                                                                 Shashwat Shriparv

Next generation technology

  • 1.
    Next generation technologies (The best way to jump into a parade is to jump in front of one that is already going) We are going to talk about the framework that backs up the technological infrastructure of the biggest players of internet world, some of them are embedded in the following image: These are just some biggest name; there are lots more in this list. Here we are talking about next generation computer technology, which has scalability, tolerance and much more features. The term cloud will not unheard for you but here I am going to talk about a super technological terms that will be back bone of cloud or distributed computing. Now you may be thing what is that technology right? The technology that we are going to discuss is called “Hadoop”. The best thing about the technology is its open source and readily available where you can contribute, experiment, and use. As apache web site says “The Apache™ Hadoop™ project develops open-source software for reliable, scalable, distributed computing. The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures. Let’s talk about some best features first:  High scalability.  High availability.  High performance.  Handling Multi-dimensional data storage.  Handling Distributed storage. Let’s first look on some scenarios in the internet world:
  • 2.
    How much datacan you think of which need to process by a internet player? Do you know how much data twitter process daily? It about 7 Tb per day. How much time will it take to process this much of data for a general computer About 4 hr. that is just for reading, not processing, can you think about processing all twitter data will not it take years. So here comes Hadoop in play which sorts a petabyte in 16.25Hr and a terabyte of data in 62 seconds. Is not it good choice yes sure it is. Likewise think about the amount of data Facebook, Google, amazon need to process daily. The best thing about Hadoop setup is, you don’t need special costly and high end servers rather you can make a cluster out of Hadoop using commodity computers. Keep adding computers and keep increasing storage and processing power. So ultimately here are some point for “Why Hadoop?” • Need to process Multi Petabyte Datasets • Expensive to build reliability in each application. • Nodes fail every day – Failure is expected, rather than exceptional. – The number of nodes in a cluster is not constant. • Need common infrastructure – Efficient, reliable, Open Source Apache License • The above goals are same as Condor, but – Workloads are IO bound and not CPU bound Hadoop basically depends of following concept: 1. Hadoop – common (Base) Hadoop Common is a set of utilities that support the Hadoop subprojects. Hadoop Common includes FileSystem, RPC, and serialization libraries.
  • 3.
    2. HDFS (Hadoop File System)(File System) Hadoop Distributed File System (HDFS™) is the primary storage system used by Hadoop applications. HDFS creates multiple replicas of data blocks and distributes them on compute nodes throughout a cluster to enable reliable, extremely rapid computations. 3. Map-Reduce (Code) Hadoop Map-Reduce is a programming model and software framework for writing applications that rapidly process vast amounts of data in parallel on large clusters of compute nodes. So what it is used for: 1. Internet scale data : a. Web Logs: years of logs Terabytes per day. b. Web search- all the webpages present on this earth. c. Social data- all the data, messages, images, tweets, scraps, wall posts generated on Facebook, Twitter, and other social media. 2. Cutting edge analytics: a. Machine learning, data mining. 3. Enterprise applications: a. Network instrumentation, mobile logs. b. Video and audio processing. c. Text mining. 4. And lots more. Let's see the timeline:
  • 4.
    References: http://hadoop.apache.org, http://developer.yahoo.com/hadoop/ This isthe best place where you can find all information about Hadoop. On this website you'll find lots of wiki pages links and ongoing links, from which you can get lot of information about Hadoop on how to get started with Hadoop, and all how where how to questions and their answers. Just visit this site is explore it and experiment with the next-generation technology that is going to be the backbone of Internet. In the next coming articles, we'll talk about some other technologies related Hadoop likeHBase, Hive, Avro, Cassandra, Chukwa, Mahout, Pig, Zookeeper. ∞ Shashwat Shriparv