Your SlideShare is downloading. ×
Spark,Hadoop,Presto Comparition
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

Spark,Hadoop,Presto Comparition

7,243
views

Published on

Spark,Hadoop,presto Details with Advantages and disadvantages

Spark,Hadoop,presto Details with Advantages and disadvantages

Published in: Technology

0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
7,243
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
113
Comments
0
Likes
2
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. -:Spark:-Goal:- Extend the Map-reduce model to better support two common classes of analyticapps: » Iterative algorithms (machine learning, graphs). » Interactive data mining (Using Hive, Pig,http://aws.amazon.com/articles/Elastic- MapReduce/4926593393724923).Developed By:-UC Berkeley AMPLab.Advantages:-  Spark is an open source cluster computing system that aims to make data analytic faster.  your job can load data into memory and query it repeatedly much quicker than with disk-based systems like Hadoop Map-reduce.  Spark provides clean, concise APIs in both Scala and Java.  Spark is a new engine, it can access any data source supported by Hadoop, making it easy to run over existing data.  The Companies who are all using spark Conviva, Klout, Quantifind.
  • 2.  Spark is open source under a BSD license.Disadvantages:-  Only for the problems i)Iterative algorithms (machine learning, graphs), ii)Interactive data mining.  Depend on Hadoop and HadoopFileSystem(HDFS) and only supports the File- systems through the HDFS.  A common business scenario is the need to store and query large data sets, (http://aws.amazon.com/articles/Elastic-MapReduce/4926593393724923).  Only few people are using.  resources are less, and still no stable versions current version is spark.0.6.2.Use-Case:-(WordCount) val file = spark.textFile(“hdfs://…”) val counts = file.flatMap(line => line.split(” “)) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile(“hdfs://…”)Reference-links:-http://aws.amazon.com/articles/Elastic-MapReduce/4926593393724923http://spark-project.org/documentation/
  • 3. -:Presto:-Goal:- Presto extends the freely available R(on top of Hadoop) software withlanguage primitives for scalability, distributed parallelism and continuousanalytic. Extension of Hadoop again. Solution for the following problems. i)matrix operations, and ii)graph algorithms, which manipulate the graph’s adjacency matrixAdvantages:-  20 times faster than Hadoop/Map-reduce, (http://www.hpl.hp.com/research/presto.htm).  Presto efficiently executes complex algorithms such as machine learning, graph processing.  By extending R, Presto allows programmers to leverage optimized math libraries and reuse the many freely available R.Disadvantages.  It is a licensed software from HP company.  No solution for all type of problems.  Not customized also.
  • 4.  Only R language support. Spark project is olny query based, not useful for manipulations. -:Cloudera Impala :- Sql operation on top of Hadoop query based like sql (select * from table). It is not useful for our Use-case. Useful with Hbase, Hive, Pig.
  • 5. -:Apache Hadoop:- Apache Hadoop is an open-source software framework that supports data- computation on distributed environment. It supports the running of applications on large clusters of commodity hardware. Hadoop implements a computational paradigm named MapReduce, where the application is divided into many small fragments of work, each of which may
  • 6. be executed or re-executed on any node in the cluster. Both map/reduce and the distributed file system are designed so that node failures are automatically handled by the framework. Hadoop is written in the Java programming language and is a top-level Apache project being built and used by a global community of contributors. Linear scaling in the ideal case. It used to design for cheap, commodity hardware It will support all types of file-systems. There are a number of companies offering commercial implementations and/or providing support for Hadoop.  Cloudera offers CDH (Clouderas Distribution including Apache Hadoop) and Cloudera Enterprise.  IBM offers WebSphere eXtreme Scale (formerly ObjectGrid)[56] which includes two styles of the HADOOP map-reduce pattern in its "agents" a.k.a. DataGrid APIs.  EMC released EMC Greenplum Community Edition and EMC Greenplum HD Enterprise Edition in May 2011. The community edition.  Besides Facebook and Yahoo!, many other organizations are using Hadoop to
  • 7. run large distributed computations.  Amazon.com,Ancestry.com,Akamai,AmericanAirlines,AOL,Apple[32],eBay, Hortonworks ,Federal Reserve Board of Governors ,Foursquare ,Fox Interactive Media ,Gemvara ,Google ,Hewlett-Packard ,IBM etc. We can scale-up our storage and our computation by increasing the Memory and storage. Main advantage of Hadoop is “batch processing”.
  • 8. Disadvantages:-  It is not best suite for large number of small files.  It is not best for OLTP data transfer(Example:-Online banking transactions).  Cluster management is hard:- In the cluster, operations like debugging, distributing software, collection logs etc are too hard.  http://cs.smith.edu/dftwiki/index.php/Hadoop_Tutorial_2.2_-- _Running_C%2B%2B_Programs_on_HadoopReference-links:-http://hadoop-sandy.blogspot.in/2012/12/understanding-hadoop-clusters-and.html
  • 9. http://en.wikipedia.org/wiki/Apache_Hadoophttp://wiki.apache.org/hadoop/