4. Alin Blidisel - Big Data: Beyond MapReduce
4
SPARK - INTRODUCTION
- was created by Matei Zaharia at Berkley
- was introduced by Apache Software Foundation for speeding up the
Hadoop computational process
- is not a modified version of Hadoop
- in-memory cluster computing
- own cluster computation management
- designed to cover a wide range of workloads such as batch
applications, iterative algorithms, interactive queries and streaming
6. Alin Blidisel - Big Data: Beyond MapReduce
6
FEATURES OF APACHE SPARK
- Lighting Fast Processing (10 to 100 faster then Hadoop)
- Ease of Use as it supports multiple languages
- Support for Sophisticated Analytics
- Real Time Stream Processing
- Ability to Integrate with Hadoop and Existing HadoopData
- Active and Expanding Community (more than 250 developers have contributed to Spark already)
7. Alin Blidisel - Big Data: Beyond MapReduce
RESILIENT DISTRIBUTED DATASETS (RDDS)
- fault-tolerant collection of elements that can be operated on in parallel (distributed and immutable)
- two ways to create RDDs:
- parallelizing an existing collection in your driver program
- referencing a dataset in an external storage system, such as a shared
filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat
- persistence (MEMORY_ONLY*, MEMORY_AND_DISK*, DISK_ONLY, OFF_HEAP)
7