-:Spark:-Goal:- Extend the Map-reduce model to better support two common classes of analyticapps: » Iterative algorithms (machine learning, graphs). » Interactive data mining (Using Hive, Pig,http://aws.amazon.com/articles/Elastic- MapReduce/4926593393724923).Developed By:-UC Berkeley AMPLab.Advantages:- Spark is an open source cluster computing system that aims to make data analytic faster. your job can load data into memory and query it repeatedly much quicker than with disk-based systems like Hadoop Map-reduce. Spark provides clean, concise APIs in both Scala and Java. Spark is a new engine, it can access any data source supported by Hadoop, making it easy to run over existing data. The Companies who are all using spark Conviva, Klout, Quantifind.
Spark is open source under a BSD license.Disadvantages:- Only for the problems i)Iterative algorithms (machine learning, graphs), ii)Interactive data mining. Depend on Hadoop and HadoopFileSystem(HDFS) and only supports the File- systems through the HDFS. A common business scenario is the need to store and query large data sets, (http://aws.amazon.com/articles/Elastic-MapReduce/4926593393724923). Only few people are using. resources are less, and still no stable versions current version is spark.0.6.2.Use-Case:-(WordCount) val file = spark.textFile(“hdfs://…”) val counts = file.flatMap(line => line.split(” “)) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile(“hdfs://…”)Reference-links:-http://aws.amazon.com/articles/Elastic-MapReduce/4926593393724923http://spark-project.org/documentation/
-:Presto:-Goal:- Presto extends the freely available R(on top of Hadoop) software withlanguage primitives for scalability, distributed parallelism and continuousanalytic. Extension of Hadoop again. Solution for the following problems. i)matrix operations, and ii)graph algorithms, which manipulate the graph’s adjacency matrixAdvantages:- 20 times faster than Hadoop/Map-reduce, (http://www.hpl.hp.com/research/presto.htm). Presto efficiently executes complex algorithms such as machine learning, graph processing. By extending R, Presto allows programmers to leverage optimized math libraries and reuse the many freely available R.Disadvantages. It is a licensed software from HP company. No solution for all type of problems. Not customized also.
Only R language support. Spark project is olny query based, not useful for manipulations. -:Cloudera Impala :- Sql operation on top of Hadoop query based like sql (select * from table). It is not useful for our Use-case. Useful with Hbase, Hive, Pig.
-:Apache Hadoop:- Apache Hadoop is an open-source software framework that supports data- computation on distributed environment. It supports the running of applications on large clusters of commodity hardware. Hadoop implements a computational paradigm named MapReduce, where the application is divided into many small fragments of work, each of which may
be executed or re-executed on any node in the cluster. Both map/reduce and the distributed file system are designed so that node failures are automatically handled by the framework. Hadoop is written in the Java programming language and is a top-level Apache project being built and used by a global community of contributors. Linear scaling in the ideal case. It used to design for cheap, commodity hardware It will support all types of file-systems. There are a number of companies offering commercial implementations and/or providing support for Hadoop. Cloudera offers CDH (Clouderas Distribution including Apache Hadoop) and Cloudera Enterprise. IBM offers WebSphere eXtreme Scale (formerly ObjectGrid) which includes two styles of the HADOOP map-reduce pattern in its "agents" a.k.a. DataGrid APIs. EMC released EMC Greenplum Community Edition and EMC Greenplum HD Enterprise Edition in May 2011. The community edition. Besides Facebook and Yahoo!, many other organizations are using Hadoop to
run large distributed computations. Amazon.com,Ancestry.com,Akamai,AmericanAirlines,AOL,Apple,eBay, Hortonworks ,Federal Reserve Board of Governors ,Foursquare ,Fox Interactive Media ,Gemvara ,Google ,Hewlett-Packard ,IBM etc. We can scale-up our storage and our computation by increasing the Memory and storage. Main advantage of Hadoop is “batch processing”.
Disadvantages:- It is not best suite for large number of small files. It is not best for OLTP data transfer(Example:-Online banking transactions). Cluster management is hard:- In the cluster, operations like debugging, distributing software, collection logs etc are too hard. http://cs.smith.edu/dftwiki/index.php/Hadoop_Tutorial_2.2_-- _Running_C%2B%2B_Programs_on_HadoopReference-links:-http://hadoop-sandy.blogspot.in/2012/12/understanding-hadoop-clusters-and.html