SlideShare a Scribd company logo
1 of 10
INTRODUCTION
Apache spark is an open source cluster computing system
that focus data analytics fast and both to run and fast to
write.
Apache Spark is a fast, in-memory data processing engine
with smart and expressive development APIs in Scala, Java,
Python, and R that allow data workers to efficiently execute
machine learning algorithms that require fast iterative access
to datasets .
 Run programs up to 100x faster than Hadoop
MapReduce in memory, or 10x faster on disk.
 Apache Spark has an advanced DAG execution
engine that supports cyclic data flow and in-
memory computing.
 Write applications quickly in Java, Scala,
Python, R.
 Spark offers over 80 high-level operators that
make it easy to build parallel apps. And you
can use it interactively from the Scala, Python
and R shells
 Compound SQL, streaming, and complex
analytics.
 Spark powers a stack of libraries including SQL
and DataFrames,MLlib for machine
learning, GraphX, and Spark Streaming. You
can combine these libraries seamlessly in the
same application.
 Spark runs on Hadoop, Mesos, standalone, or
in the cloud. It can access diverse data sources
including HDFS, Cassandra, HBase, and S3.
Spark
HDFS,Hbase
Hadoop
Spark SQL
Hive
 Spark uses different data storage model, resilient
distributed datasets (RDD), uses a clever way of
guaranteeing fault tolerance that minimizes
network I/O
 Spark has become another data processing engine
in Hadoop ecosystem and which is good for all
businesses and community as it provides more
capability to Hadoop stack.
 Spark enables applications in Hadoop clusters to
run up to 100x faster in memory, and 10x faster
even when running on disk. Spark makes it
possible by reducing number of read/write to disc.
It stores this intermediate processing data in-
memory.
 Spark SQL is a component on top of Spark
Core that introduces a new data abstraction
called SchemaRDD, which provides support
for structured and semi-structured data.
 Iterative Algorithms in Machine Learning
 Interactive Data Mining and Data Processing
 Spark is a fully Apache Hive-compatible data
warehousing system that can run 100x faster than
Hive.
 Stream processing: Log processing and Fraud
detection in live streams for alerts, aggregates and
analysis
 Sensor data processing: Where data is fetched and
joined from multiple sources, in-memory dataset
really helpful as they are easy and fast to process.
 Spark provides an interactive shell − a
powerful tool to analyze data interactively. It is
available in either Scala or Python language.
Spark’s primary abstraction is a distributed
collection of items called a Resilient Distributed
Dataset (RDD). RDDs can be created from
Hadoop Input Formats (such as HDFS files) or
by transforming other RDDs.
 RDD transformations returns pointer to new RDD
and allows you to create dependencies between
RDDs. Each RDD in dependency chain (String of
Dependencies) has a function for calculating its
data and has a pointer (dependency) to its parent
RDD.
 Spark is lazy, so nothing will be executed unless
you call some transformation or action that will
trigger job creation and execution

More Related Content

What's hot

Mule database-connectors
Mule database-connectorsMule database-connectors
Mule database-connectorsAnand kalla
 
CloudHub networking guide
CloudHub networking guideCloudHub networking guide
CloudHub networking guideShanky Gupta
 
Mule microsoft environment
Mule  microsoft environmentMule  microsoft environment
Mule microsoft environmentcharan teja R
 
Mule MMC Integration with LDAP
Mule MMC Integration with LDAPMule MMC Integration with LDAP
Mule MMC Integration with LDAPSanjeet Pandey
 
Mule Soft ESB - SAP Outbound
Mule Soft ESB - SAP OutboundMule Soft ESB - SAP Outbound
Mule Soft ESB - SAP Outboundakashdprajapati
 
Mule -solutions for data integration
Mule -solutions for data integrationMule -solutions for data integration
Mule -solutions for data integrationD.Rajesh Kumar
 
Best practices for multi saa s integrations
Best practices for multi saa s integrationsBest practices for multi saa s integrations
Best practices for multi saa s integrationsD.Rajesh Kumar
 
Anypoint data gateway
Anypoint data gatewayAnypoint data gateway
Anypoint data gatewayMohammed246
 
Database and Web Integration
Database and Web IntegrationDatabase and Web Integration
Database and Web IntegrationSabahtHussein
 
Developing Oracle Connector Using Mule
Developing Oracle Connector Using MuleDeveloping Oracle Connector Using Mule
Developing Oracle Connector Using Muleirfan1008
 
MuleSoft CloudHub FAQ
MuleSoft CloudHub FAQMuleSoft CloudHub FAQ
MuleSoft CloudHub FAQShanky Gupta
 
Biztalk And Oracle Integration
Biztalk And Oracle IntegrationBiztalk And Oracle Integration
Biztalk And Oracle Integrationkaushiksin
 
SQL Server Integration Services
SQL Server Integration ServicesSQL Server Integration Services
SQL Server Integration ServicesRobert MacLean
 
Oracle connector
Oracle connectorOracle connector
Oracle connectorMohammed246
 
Mule salesforce integration solutions
Mule  salesforce integration solutionsMule  salesforce integration solutions
Mule salesforce integration solutionshimajareddys
 
Mule anypoint connector dev kit
Mule  anypoint connector dev kitMule  anypoint connector dev kit
Mule anypoint connector dev kitD.Rajesh Kumar
 

What's hot (20)

Mule connectors
Mule  connectorsMule  connectors
Mule connectors
 
Mule database-connectors
Mule database-connectorsMule database-connectors
Mule database-connectors
 
CloudHub networking guide
CloudHub networking guideCloudHub networking guide
CloudHub networking guide
 
Mule microsoft environment
Mule  microsoft environmentMule  microsoft environment
Mule microsoft environment
 
Mule MMC Integration with LDAP
Mule MMC Integration with LDAPMule MMC Integration with LDAP
Mule MMC Integration with LDAP
 
Mule Soft ESB - SAP Outbound
Mule Soft ESB - SAP OutboundMule Soft ESB - SAP Outbound
Mule Soft ESB - SAP Outbound
 
Mule microsoft
Mule  microsoftMule  microsoft
Mule microsoft
 
Anypoint data gateway
Anypoint data gatewayAnypoint data gateway
Anypoint data gateway
 
Mule -solutions for data integration
Mule -solutions for data integrationMule -solutions for data integration
Mule -solutions for data integration
 
Best practices for multi saa s integrations
Best practices for multi saa s integrationsBest practices for multi saa s integrations
Best practices for multi saa s integrations
 
Anypoint data gateway
Anypoint data gatewayAnypoint data gateway
Anypoint data gateway
 
Database and Web Integration
Database and Web IntegrationDatabase and Web Integration
Database and Web Integration
 
Developing Oracle Connector Using Mule
Developing Oracle Connector Using MuleDeveloping Oracle Connector Using Mule
Developing Oracle Connector Using Mule
 
MuleSoft CloudHub FAQ
MuleSoft CloudHub FAQMuleSoft CloudHub FAQ
MuleSoft CloudHub FAQ
 
Mule anypoint b2 b
Mule  anypoint b2 bMule  anypoint b2 b
Mule anypoint b2 b
 
Biztalk And Oracle Integration
Biztalk And Oracle IntegrationBiztalk And Oracle Integration
Biztalk And Oracle Integration
 
SQL Server Integration Services
SQL Server Integration ServicesSQL Server Integration Services
SQL Server Integration Services
 
Oracle connector
Oracle connectorOracle connector
Oracle connector
 
Mule salesforce integration solutions
Mule  salesforce integration solutionsMule  salesforce integration solutions
Mule salesforce integration solutions
 
Mule anypoint connector dev kit
Mule  anypoint connector dev kitMule  anypoint connector dev kit
Mule anypoint connector dev kit
 

Similar to Apache spark

Lighting up Big Data Analytics with Apache Spark in Azure
Lighting up Big Data Analytics with Apache Spark in AzureLighting up Big Data Analytics with Apache Spark in Azure
Lighting up Big Data Analytics with Apache Spark in AzureJen Stirrup
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introductionsudhakara st
 
Exploiting Apache Spark's Potential Changing Enormous Information Investigati...
Exploiting Apache Spark's Potential Changing Enormous Information Investigati...Exploiting Apache Spark's Potential Changing Enormous Information Investigati...
Exploiting Apache Spark's Potential Changing Enormous Information Investigati...rajeshseo5
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to sparkHome
 
Apache Spark vs. Hadoop Is Spark Set to Replace Hadoop.pdf
Apache Spark vs. Hadoop Is Spark Set to Replace Hadoop.pdfApache Spark vs. Hadoop Is Spark Set to Replace Hadoop.pdf
Apache Spark vs. Hadoop Is Spark Set to Replace Hadoop.pdfMounikaPolabathina
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkSamy Dindane
 
RDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs SparkRDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs SparkLaxmi8
 
An Introduction to Apache Spark
An Introduction to Apache SparkAn Introduction to Apache Spark
An Introduction to Apache SparkElvis Saravia
 
Big Data Technology Stack : Nutshell
Big Data Technology Stack : NutshellBig Data Technology Stack : Nutshell
Big Data Technology Stack : NutshellKhalid Imran
 
Spark SQL | Apache Spark
Spark SQL | Apache SparkSpark SQL | Apache Spark
Spark SQL | Apache SparkEdureka!
 
Big Data Processing With Spark
Big Data Processing With SparkBig Data Processing With Spark
Big Data Processing With SparkEdureka!
 
Cassandra Lunch #89: Semi-Structured Data in Cassandra
Cassandra Lunch #89: Semi-Structured Data in CassandraCassandra Lunch #89: Semi-Structured Data in Cassandra
Cassandra Lunch #89: Semi-Structured Data in CassandraAnant Corporation
 

Similar to Apache spark (20)

SparkPaper
SparkPaperSparkPaper
SparkPaper
 
Apache spark
Apache sparkApache spark
Apache spark
 
Lighting up Big Data Analytics with Apache Spark in Azure
Lighting up Big Data Analytics with Apache Spark in AzureLighting up Big Data Analytics with Apache Spark in Azure
Lighting up Big Data Analytics with Apache Spark in Azure
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
 
Exploiting Apache Spark's Potential Changing Enormous Information Investigati...
Exploiting Apache Spark's Potential Changing Enormous Information Investigati...Exploiting Apache Spark's Potential Changing Enormous Information Investigati...
Exploiting Apache Spark's Potential Changing Enormous Information Investigati...
 
Apache Spark PDF
Apache Spark PDFApache Spark PDF
Apache Spark PDF
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
Apache Spark vs. Hadoop Is Spark Set to Replace Hadoop.pdf
Apache Spark vs. Hadoop Is Spark Set to Replace Hadoop.pdfApache Spark vs. Hadoop Is Spark Set to Replace Hadoop.pdf
Apache Spark vs. Hadoop Is Spark Set to Replace Hadoop.pdf
 
Apache Spark Notes
Apache Spark NotesApache Spark Notes
Apache Spark Notes
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
 
Spark_Part 1
Spark_Part 1Spark_Part 1
Spark_Part 1
 
RDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs SparkRDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs Spark
 
An Introduction to Apache Spark
An Introduction to Apache SparkAn Introduction to Apache Spark
An Introduction to Apache Spark
 
Big Data Technology Stack : Nutshell
Big Data Technology Stack : NutshellBig Data Technology Stack : Nutshell
Big Data Technology Stack : Nutshell
 
In15orlesss hadoop
In15orlesss hadoopIn15orlesss hadoop
In15orlesss hadoop
 
Spark SQL | Apache Spark
Spark SQL | Apache SparkSpark SQL | Apache Spark
Spark SQL | Apache Spark
 
Big Data Processing With Spark
Big Data Processing With SparkBig Data Processing With Spark
Big Data Processing With Spark
 
Cassandra Lunch #89: Semi-Structured Data in Cassandra
Cassandra Lunch #89: Semi-Structured Data in CassandraCassandra Lunch #89: Semi-Structured Data in Cassandra
Cassandra Lunch #89: Semi-Structured Data in Cassandra
 
Hadoop vs Apache Spark
Hadoop vs Apache SparkHadoop vs Apache Spark
Hadoop vs Apache Spark
 

More from Ramakrishna kapa

More from Ramakrishna kapa (20)

Load balancer in mule
Load balancer in muleLoad balancer in mule
Load balancer in mule
 
Anypoint connectors
Anypoint connectorsAnypoint connectors
Anypoint connectors
 
Batch processing
Batch processingBatch processing
Batch processing
 
Msmq connectivity
Msmq connectivityMsmq connectivity
Msmq connectivity
 
Scopes in mule
Scopes in muleScopes in mule
Scopes in mule
 
Data weave more operations
Data weave more operationsData weave more operations
Data weave more operations
 
Basic math operations using dataweave
Basic math operations using dataweaveBasic math operations using dataweave
Basic math operations using dataweave
 
Dataweave types operators
Dataweave types operatorsDataweave types operators
Dataweave types operators
 
Operators in mule dataweave
Operators in mule dataweaveOperators in mule dataweave
Operators in mule dataweave
 
Data weave in mule
Data weave in muleData weave in mule
Data weave in mule
 
Servicenow connector
Servicenow connectorServicenow connector
Servicenow connector
 
Introduction to testing mule
Introduction to testing muleIntroduction to testing mule
Introduction to testing mule
 
Choice flow control
Choice flow controlChoice flow control
Choice flow control
 
Message enricher example
Message enricher exampleMessage enricher example
Message enricher example
 
Mule exception strategies
Mule exception strategiesMule exception strategies
Mule exception strategies
 
Anypoint connector basics
Anypoint connector basicsAnypoint connector basics
Anypoint connector basics
 
Mule global elements
Mule global elementsMule global elements
Mule global elements
 
Mule message structure and varibles scopes
Mule message structure and varibles scopesMule message structure and varibles scopes
Mule message structure and varibles scopes
 
How to create an api in mule
How to create an api in muleHow to create an api in mule
How to create an api in mule
 
Log4j is a reliable, fast and flexible
Log4j is a reliable, fast and flexibleLog4j is a reliable, fast and flexible
Log4j is a reliable, fast and flexible
 

Recently uploaded

Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - InfographicHr365.us smith
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityNeo4j
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based projectAnoyGreter
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptkotipi9215
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 

Recently uploaded (20)

Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort ServiceHot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - Infographic
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered Sustainability
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based project
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.ppt
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 

Apache spark

  • 1. INTRODUCTION Apache spark is an open source cluster computing system that focus data analytics fast and both to run and fast to write. Apache Spark is a fast, in-memory data processing engine with smart and expressive development APIs in Scala, Java, Python, and R that allow data workers to efficiently execute machine learning algorithms that require fast iterative access to datasets .
  • 2.  Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.  Apache Spark has an advanced DAG execution engine that supports cyclic data flow and in- memory computing.
  • 3.  Write applications quickly in Java, Scala, Python, R.  Spark offers over 80 high-level operators that make it easy to build parallel apps. And you can use it interactively from the Scala, Python and R shells
  • 4.  Compound SQL, streaming, and complex analytics.  Spark powers a stack of libraries including SQL and DataFrames,MLlib for machine learning, GraphX, and Spark Streaming. You can combine these libraries seamlessly in the same application.
  • 5.  Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, and S3. Spark HDFS,Hbase Hadoop Spark SQL Hive
  • 6.  Spark uses different data storage model, resilient distributed datasets (RDD), uses a clever way of guaranteeing fault tolerance that minimizes network I/O  Spark has become another data processing engine in Hadoop ecosystem and which is good for all businesses and community as it provides more capability to Hadoop stack.  Spark enables applications in Hadoop clusters to run up to 100x faster in memory, and 10x faster even when running on disk. Spark makes it possible by reducing number of read/write to disc. It stores this intermediate processing data in- memory.
  • 7.  Spark SQL is a component on top of Spark Core that introduces a new data abstraction called SchemaRDD, which provides support for structured and semi-structured data.
  • 8.  Iterative Algorithms in Machine Learning  Interactive Data Mining and Data Processing  Spark is a fully Apache Hive-compatible data warehousing system that can run 100x faster than Hive.  Stream processing: Log processing and Fraud detection in live streams for alerts, aggregates and analysis  Sensor data processing: Where data is fetched and joined from multiple sources, in-memory dataset really helpful as they are easy and fast to process.
  • 9.  Spark provides an interactive shell − a powerful tool to analyze data interactively. It is available in either Scala or Python language. Spark’s primary abstraction is a distributed collection of items called a Resilient Distributed Dataset (RDD). RDDs can be created from Hadoop Input Formats (such as HDFS files) or by transforming other RDDs.
  • 10.  RDD transformations returns pointer to new RDD and allows you to create dependencies between RDDs. Each RDD in dependency chain (String of Dependencies) has a function for calculating its data and has a pointer (dependency) to its parent RDD.  Spark is lazy, so nothing will be executed unless you call some transformation or action that will trigger job creation and execution