IAC 2024 - IA Fast Track to Search Focused AI Solutions
Spark,Hadoop,Presto Comparition
1. -:Spark:-
Goal:-
Extend the Map-reduce model to better support two common classes of analytic
apps:
» Iterative algorithms (machine learning, graphs).
» Interactive data mining (Using Hive, Pig,
http://aws.amazon.com/articles/Elastic- MapReduce/4926593393724923).
Developed By:-UC Berkeley AMPLab.
Advantages:-
Spark is an open source cluster computing system that aims to make data
analytic faster.
your job can load data into memory and query it repeatedly much quicker than
with disk-based systems like Hadoop Map-reduce.
Spark provides clean, concise APIs in both Scala and Java.
Spark is a new engine, it can access any data source supported by Hadoop,
making it easy to run over existing data.
The Companies who are all using spark Conviva, Klout, Quantifind.
2. Spark is open source under a BSD license.
Disadvantages:-
Only for the problem's i)Iterative algorithms (machine learning, graphs),
ii)Interactive data mining.
Depend on Hadoop and HadoopFileSystem(HDFS) and only supports the File-
systems through the HDFS.
A common business scenario is the need to store and query large data sets,
(http://aws.amazon.com/articles/Elastic-MapReduce/4926593393724923).
Only few people are using.
resources are less, and still no stable version's current version is
spark.0.6.2.
Use-Case:-(WordCount)
val file = spark.textFile(“hdfs://…”)
val counts = file.flatMap(line => line.split(” “))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile(“hdfs://…”)
Reference-links:-
http://aws.amazon.com/articles/Elastic-MapReduce/4926593393724923
http://spark-project.org/documentation/
3. -:Presto:-
Goal:-
Presto extends the freely available R(on top of Hadoop) software with
language primitives for scalability, distributed parallelism and continuous
analytic.
Extension of Hadoop again.
Solution for the following problems.
i)matrix operations, and
ii)graph algorithms, which manipulate the graph’s adjacency matrix
Advantages:-
20 times faster than Hadoop/Map-reduce,
(http://www.hpl.hp.com/research/presto.htm).
Presto efficiently executes complex algorithms such as machine learning,
graph processing.
By extending R, Presto allows programmers to leverage optimized math
libraries and reuse the many freely available R.
Disadvantages.
It is a licensed software from HP company.
No solution for all type of problems.
Not customized also.
4. Only R language support.
Spark project is olny query based, not useful for manipulations.
-:Cloudera Impala :-
Sql operation on top of Hadoop
query based like sql (select * from table).
It is not useful for our Use-case.
Useful with Hbase, Hive, Pig.
5. -:Apache Hadoop:-
Apache Hadoop is an open-source software framework that supports data-
computation on distributed environment.
It supports the running of applications on large clusters of commodity
hardware.
Hadoop implements a computational paradigm named MapReduce, where the
application is divided into many small fragments of work, each of which may
6. be executed or re-executed on any node in the cluster.
Both map/reduce and the distributed file system are designed so that node
failures are automatically handled by the framework.
Hadoop is written in the Java programming language and is a top-level Apache
project being built and used by a global community of contributors.
Linear scaling in the ideal case. It used to design for cheap, commodity
hardware
It will support all types of file-systems.
There are a number of companies offering commercial implementations and/or
providing support for Hadoop.
Cloudera offers CDH (Cloudera's Distribution including Apache Hadoop) and
Cloudera Enterprise.
IBM offers WebSphere eXtreme Scale (formerly ObjectGrid)[56] which
includes two styles of the HADOOP map-reduce pattern in its "agents"
a.k.a. DataGrid APIs.
EMC released EMC Greenplum Community Edition and EMC Greenplum HD
Enterprise Edition in May 2011. The community edition.
Besides Facebook and Yahoo!, many other organizations are using Hadoop to
7. run large distributed computations.
Amazon.com,Ancestry.com,Akamai,AmericanAirlines,AOL,Apple[32],eBay,
Hortonworks ,Federal Reserve Board of Governors ,Foursquare ,Fox
Interactive Media ,Gemvara ,Google ,Hewlett-Packard ,IBM etc.
We can scale-up our storage and our computation by increasing the Memory and
storage.
Main advantage of Hadoop is “batch processing”.
8.
9.
10. Disadvantages:-
It is not best suite for large number of small files.
It is not best for OLTP data transfer(Example:-Online banking transactions).
Cluster management is hard:- In the cluster, operations like debugging,
distributing software, collection logs etc are too hard.
http://cs.smith.edu/dftwiki/index.php/Hadoop_Tutorial_2.2_--
_Running_C%2B%2B_Programs_on_Hadoop
Reference-links:-
http://hadoop-sandy.blogspot.in/2012/12/understanding-hadoop-clusters-and.html