Apache Spark - An Open Source Cluster Computing System for Fast Data Analytics

•Download as PPTX, PDF•

0 likes•87 views

Srikrishna k

Apache Spark

Software

 Run programs up to 100x faster than Hadoop
MapReduce in memory, or 10x faster on disk.
 Apache Spark has an advanced DAG execution
engine that supports cyclic data flow and in-
memory computing.

 Write applications quickly in Java, Scala,
Python, R.
 Spark offers over 80 high-level operators that
make it easy to build parallel apps. And you
can use it interactively from the Scala, Python
and R shells

 Compound SQL, streaming, and complex
analytics.
 Spark powers a stack of libraries including SQL
and DataFrames,MLlib for machine
learning, GraphX, and Spark Streaming. You
can combine these libraries seamlessly in the
same application.

 Spark runs on Hadoop, Mesos, standalone, or
in the cloud. It can access diverse data sources
including HDFS, Cassandra, HBase, and S3.
Spark
HDFS,Hbase
Hadoop
Spark SQL
Hive

 Spark uses different data storage model, resilient
distributed datasets (RDD), uses a clever way of
guaranteeing fault tolerance that minimizes
network I/O
 Spark has become another data processing engine
in Hadoop ecosystem and which is good for all
businesses and community as it provides more
capability to Hadoop stack.
 Spark enables applications in Hadoop clusters to
run up to 100x faster in memory, and 10x faster
even when running on disk. Spark makes it
possible by reducing number of read/write to disc.
It stores this intermediate processing data in-
memory.

 Spark SQL is a component on top of Spark
Core that introduces a new data abstraction
called SchemaRDD, which provides support
for structured and semi-structured data.

 Iterative Algorithms in Machine Learning
 Interactive Data Mining and Data Processing
 Spark is a fully Apache Hive-compatible data
warehousing system that can run 100x faster than
Hive.
 Stream processing: Log processing and Fraud
detection in live streams for alerts, aggregates and
analysis
 Sensor data processing: Where data is fetched and
joined from multiple sources, in-memory dataset
really helpful as they are easy and fast to process.

 Spark provides an interactive shell − a
powerful tool to analyze data interactively. It is
available in either Scala or Python language.
Spark’s primary abstraction is a distributed
collection of items called a Resilient Distributed
Dataset (RDD). RDDs can be created from
Hadoop Input Formats (such as HDFS files) or
by transforming other RDDs.

 RDD transformations returns pointer to new RDD
and allows you to create dependencies between
RDDs. Each RDD in dependency chain (String of
Dependencies) has a function for calculating its
data and has a pointer (dependency) to its parent
RDD.
 Spark is lazy, so nothing will be executed unless
you call some transformation or action that will
trigger job creation and execution

What's hot

Heart ProposalEdward Yoon

Low latency access of bigdata using spark and sharkPradeep Kumar G.S

Hive and querying dataKarthigaGunasekaran1

Analysing big data with cluster service and RLushi Chen

Performance of Spark vs MapReduceEdureka!

Using Machine Learning with HDInsightEng Teong Cheah

Introduction to Apache hadoopOmar Jaber

Evolution of spark framework for simplifying data analysis.Anirudh Gangwar

Hadoop - A Very Short Introductiondewang_mistry

Hadoop map reduceVijayMohan Vasu

Hadoop & Complex Systems ResearchDr. Mirko Kämpf

Big data vahidamiri-tabriz-13960226-datastack.irdatastack

Big Data Processing with Spark and Scala Edureka!

Hadoop - A big data initiativeMansi Mehra

Spark for big data analyticsEdureka!

Hadoopreddivarihareesh

Hadoop Architecture Ganesh B

Lightening Fast Big Data Analytics using Apache SparkManish Gupta

Introduction to Apache SparkSamy Dindane

Cred_hadoop_presenatationAshish Saraf

What's hot (20)

Heart Proposal

Low latency access of bigdata using spark and shark

Hive and querying data

Analysing big data with cluster service and R

Performance of Spark vs MapReduce

Using Machine Learning with HDInsight

Introduction to Apache hadoop

Evolution of spark framework for simplifying data analysis.

Hadoop - A Very Short Introduction

Hadoop map reduce

Hadoop & Complex Systems Research

Big data vahidamiri-tabriz-13960226-datastack.ir

Big Data Processing with Spark and Scala

Hadoop - A big data initiative

Spark for big data analytics

Hadoop

Hadoop Architecture

Lightening Fast Big Data Analytics using Apache Spark

Introduction to Apache Spark

Cred_hadoop_presenatation

Viewers also liked

Groovydemo 160721051742Srikrishna k

Apache Spark Introduction @ University College LondonVitthal Gogate

Apache Spark & ScalaEdureka!

Introduction to Spark InternalsPietro Michiardi

Introduction to Apache Sparkdatamantra

Introduction to Apache Spark Developer TrainingCloudera, Inc.

Apache Spark ArchitectureAlexey Grishchenko

Viewers also liked (7)

Groovydemo 160721051742

Apache Spark Introduction @ University College London

Apache Spark & Scala

Introduction to Spark Internals

Introduction to Apache Spark

Introduction to Apache Spark Developer Training

Apache Spark Architecture

Similar to Apache Spark - An Open Source Cluster Computing System for Fast Data Analytics

SparkPaperSuraj Thapaliya

Apache sparkDona Mary Philip

Apache Spark Introductionsudhakara st

Exploiting Apache Spark's Potential Changing Enormous Information Investigati...rajeshseo5

Apache Spark PDFNaresh Rupareliya

Introduction to sparkHome

Apache Spark vs. Hadoop Is Spark Set to Replace Hadoop.pdfMounikaPolabathina

Apache Spark NotesVenkateswaran Kandasamy

APACHE SPARK.pptxDeepaThirumurugan

Spark_Part 1Shashi Prakash

RDBMS vs Hadoop vs SparkLaxmi8

An Introduction to Apache SparkElvis Saravia

Big Data Technology Stack : NutshellKhalid Imran

In15orlesss hadoopWorapol Alex Pongpech, PhD

Spark SQL | Apache SparkEdureka!

Big Data Processing With SparkEdureka!

Cassandra Lunch #89: Semi-Structured Data in CassandraAnant Corporation

Started with-apache-sparkHappiest Minds Technologies

What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn

Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...Dataconomy Media

Similar to Apache Spark - An Open Source Cluster Computing System for Fast Data Analytics (20)

SparkPaper

Apache spark

Apache Spark Introduction

Exploiting Apache Spark's Potential Changing Enormous Information Investigati...

Apache Spark PDF

Introduction to spark

Apache Spark vs. Hadoop Is Spark Set to Replace Hadoop.pdf

Apache Spark Notes

APACHE SPARK.pptx

Spark_Part 1

RDBMS vs Hadoop vs Spark

An Introduction to Apache Spark

Big Data Technology Stack : Nutshell

In15orlesss hadoop

Spark SQL | Apache Spark

Big Data Processing With Spark

Cassandra Lunch #89: Semi-Structured Data in Cassandra

Started with-apache-spark

What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...

Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...

Recently uploaded

Professional Resume Template for Software DevelopersVinodh Ram

Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig

BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp

办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样umasea

Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service9953056974 Low Rate Call Girls In Saket, Delhi NCR

Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3

Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel

ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin

Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase

Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed

Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ

KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app

MYjobs Presentation Django-based projectAnoyGreter

What is Fashion PLM and Why Do You Need ItWave PLM

Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin

(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700

Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ

The Evolution of Karaoke From Analog to App.pdfPower Karaoke

Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...OnePlan Solutions

React Server Component in Next.js by Hanief UtamaHanief Utama

Recently uploaded (20)

Professional Resume Template for Software Developers

Automate your Kamailio Test Calls - Kamailio World 2024

BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE

办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样

Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service

Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data

Unveiling the Future: Sylius 2.0 New Features

ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...

Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024

Unveiling Design Patterns: A Visual Guide with UML Diagrams

Cloud Data Center Network Construction - IEEE

KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx

MYjobs Presentation Django-based project

What is Fashion PLM and Why Do You Need It

Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide

(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...

Cloud Management Software Platforms: OpenStack

The Evolution of Karaoke From Analog to App.pdf

Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...

React Server Component in Next.js by Hanief Utama

Apache Spark - An Open Source Cluster Computing System for Fast Data Analytics

1. INTRODUCTION Apache spark is an open source cluster computing system that focus data analytics fast and both to run and fast to write. Apache Spark is a fast, in-memory data processing engine with smart and expressive development APIs in Scala, Java, Python, and R that allow data workers to efficiently execute machine learning algorithms that require fast iterative access to datasets .

2.  Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.  Apache Spark has an advanced DAG execution engine that supports cyclic data flow and in- memory computing.

3.  Write applications quickly in Java, Scala, Python, R.  Spark offers over 80 high-level operators that make it easy to build parallel apps. And you can use it interactively from the Scala, Python and R shells

4.  Compound SQL, streaming, and complex analytics.  Spark powers a stack of libraries including SQL and DataFrames,MLlib for machine learning, GraphX, and Spark Streaming. You can combine these libraries seamlessly in the same application.

5.  Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, and S3. Spark HDFS,Hbase Hadoop Spark SQL Hive

6.  Spark uses different data storage model, resilient distributed datasets (RDD), uses a clever way of guaranteeing fault tolerance that minimizes network I/O  Spark has become another data processing engine in Hadoop ecosystem and which is good for all businesses and community as it provides more capability to Hadoop stack.  Spark enables applications in Hadoop clusters to run up to 100x faster in memory, and 10x faster even when running on disk. Spark makes it possible by reducing number of read/write to disc. It stores this intermediate processing data in- memory.

7.  Spark SQL is a component on top of Spark Core that introduces a new data abstraction called SchemaRDD, which provides support for structured and semi-structured data.

8.  Iterative Algorithms in Machine Learning  Interactive Data Mining and Data Processing  Spark is a fully Apache Hive-compatible data warehousing system that can run 100x faster than Hive.  Stream processing: Log processing and Fraud detection in live streams for alerts, aggregates and analysis  Sensor data processing: Where data is fetched and joined from multiple sources, in-memory dataset really helpful as they are easy and fast to process.

9.  Spark provides an interactive shell − a powerful tool to analyze data interactively. It is available in either Scala or Python language. Spark’s primary abstraction is a distributed collection of items called a Resilient Distributed Dataset (RDD). RDDs can be created from Hadoop Input Formats (such as HDFS files) or by transforming other RDDs.

10.  RDD transformations returns pointer to new RDD and allows you to create dependencies between RDDs. Each RDD in dependency chain (String of Dependencies) has a function for calculating its data and has a pointer (dependency) to its parent RDD.  Spark is lazy, so nothing will be executed unless you call some transformation or action that will trigger job creation and execution

Apache Spark - An Open Source Cluster Computing System for Fast Data Analytics

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to Apache Spark - An Open Source Cluster Computing System for Fast Data Analytics

Similar to Apache Spark - An Open Source Cluster Computing System for Fast Data Analytics (20)

More from Srikrishna k

More from Srikrishna k (15)

Recently uploaded

Recently uploaded (20)

Apache Spark - An Open Source Cluster Computing System for Fast Data Analytics