Spark is an open-source cluster computing framework that provides high performance for both batch and streaming data processing. It addresses limitations of other distributed processing systems like MapReduce by providing in-memory computing capabilities and supporting a more general programming model. Spark core provides basic functionalities and serves as the foundation for higher-level modules like Spark SQL, MLlib, GraphX, and Spark Streaming. RDDs are Spark's basic abstraction for distributed datasets, allowing immutable distributed collections to be operated on in parallel. Key benefits of Spark include speed through in-memory computing, ease of use through its APIs, and a unified engine supporting multiple workloads.
An Engine to process big data in faster(than MR), easy and extremely scalable way. An Open Source, parallel, in-memory processing, cluster computing framework. Solution for loading, processing and end to end analyzing large scale data. Iterative and Interactive : Scala, Java, Python, R and with Command line interface.
Apache Spark is a lightning-fast cluster computing technology, designed for fast computation. It extends the MapReduce model of Hadoop to efficiently use it for more types of computations, which includes interactive queries and stream processing. This slide shares some basic knowledge about Apache Spark.
How Apache Spark fits into the Big Data landscapePaco Nathan
Boulder/Denver Spark Meetup, 2014-10-02 @ Datalogix
http://www.meetup.com/Boulder-Denver-Spark-Meetup/events/207581832/
Apache Spark is intended as a general purpose engine that supports combinations of Batch, Streaming, SQL, ML, Graph, etc., for apps written in Scala, Java, Python, Clojure, R, etc.
This talk provides an introduction to Spark — how it provides so much better performance, and why — and then explores how Spark fits into the Big Data landscape — e.g., other systems with which Spark pairs nicely — and why Spark is needed for the work ahead.
An Engine to process big data in faster(than MR), easy and extremely scalable way. An Open Source, parallel, in-memory processing, cluster computing framework. Solution for loading, processing and end to end analyzing large scale data. Iterative and Interactive : Scala, Java, Python, R and with Command line interface.
Apache Spark is a lightning-fast cluster computing technology, designed for fast computation. It extends the MapReduce model of Hadoop to efficiently use it for more types of computations, which includes interactive queries and stream processing. This slide shares some basic knowledge about Apache Spark.
How Apache Spark fits into the Big Data landscapePaco Nathan
Boulder/Denver Spark Meetup, 2014-10-02 @ Datalogix
http://www.meetup.com/Boulder-Denver-Spark-Meetup/events/207581832/
Apache Spark is intended as a general purpose engine that supports combinations of Batch, Streaming, SQL, ML, Graph, etc., for apps written in Scala, Java, Python, Clojure, R, etc.
This talk provides an introduction to Spark — how it provides so much better performance, and why — and then explores how Spark fits into the Big Data landscape — e.g., other systems with which Spark pairs nicely — and why Spark is needed for the work ahead.
Learning spark ch01 - Introduction to Data Analysis with Sparkphanleson
Learning spark ch01 - Introduction to Data Analysis with Spark
References to Spark Course
Course : Introduction to Big Data with Apache Spark : http://ouo.io/Mqc8L5
Course : Spark Fundamentals I : http://ouo.io/eiuoV
Course : Functional Programming Principles in Scala : http://ouo.io/rh4vv
This slide introduces Hadoop Spark.
Just to help you construct an idea of Spark regarding its architecture, data flow, job scheduling, and programming.
Not all technical details are included.
We are a company driven by inquisitive data scientists, having developed a pragmatic and interdisciplinary approach, which has evolved over the decades working with over 100 clients across multiple industries. Combining several Data Science techniques from statistics, machine learning, deep learning, decision science, cognitive science, and business intelligence, with our ecosystem of technology platforms, we have produced unprecedented solutions. Welcome to the Data Science Analytics team that can do it all, from architecture to algorithms.
Our practice delivers data driven solutions, including Descriptive Analytics, Diagnostic Analytics, Predictive Analytics, and Prescriptive Analytics. We employ a number of technologies in the area of Big Data and Advanced Analytics such as DataStax (Cassandra), Databricks (Spark), Cloudera, Hortonworks, MapR, R, SAS, Matlab, SPSS and Advanced Data Visualizations.
This presentation is designed for Spark Enthusiasts to get started and details of the course are below.
1. Introduction to Apache Spark
2. Functional Programming + Scala
3. Spark Core
4. Spark SQL + Parquet
5. Advanced Libraries
6. Tips & Tricks
7. Where do I go from here?
In this slidecast, Alex Gorbachev from Pythian presents a Practical Introduction to Hadoop. This is a great primer for viewers who want to get the big picture on how Hadoop works with Big Data and how this approach differs from relational databases.
Watch the presentation: http://inside-bigdata.com/slidecast-a-practical-introduction-to-hadoop/
Download the audio:
Apache Spark is a lightning-fast cluster computing technology, designed for fast computation. It extends the MapReduce model of Hadoop to efficiently use it for more types of computations, which includes interactive queries and stream processing.
Spark is one of Hadoop's subproject developed in 2009 in UC Berkeley's AMPLab by Matei Zaharia. It was Open Sourced in 2010 under a BSD license. It was donated to Apache software foundation in 2013, and now Apache Spark has become a top-level Apache project from Feb-2014.
This document shares some basic knowledge about Apache Spark.
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn
This presentation about Apache Spark covers all the basics that a beginner needs to know to get started with Spark. It covers the history of Apache Spark, what is Spark, the difference between Hadoop and Spark. You will learn the different components in Spark, and how Spark works with the help of architecture. You will understand the different cluster managers on which Spark can run. Finally, you will see the various applications of Spark and a use case on Conviva. Now, let's get started with what is Apache Spark.
Below topics are explained in this Spark presentation:
1. History of Spark
2. What is Spark
3. Hadoop vs Spark
4. Components of Apache Spark
5. Spark architecture
6. Applications of Spark
7. Spark usecase
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
Simplilearn’s Apache Spark and Scala certification training are designed to:
1. Advance your expertise in the Big Data Hadoop Ecosystem
2. Help you master essential Apache and Spark skills, such as Spark Streaming, Spark SQL, machine learning programming, GraphX programming and Shell Scripting Spark
3. Help you land a Hadoop developer job requiring Apache Spark expertise by giving you a real-life industry project coupled with 30 demos
What skills will you learn?
By completing this Apache Spark and Scala course you will be able to:
1. Understand the limitations of MapReduce and the role of Spark in overcoming these limitations
2. Understand the fundamentals of the Scala programming language and its features
3. Explain and master the process of installing Spark as a standalone cluster
4. Develop expertise in using Resilient Distributed Datasets (RDD) for creating applications in Spark
5. Master Structured Query Language (SQL) using SparkSQL
6. Gain a thorough understanding of Spark streaming features
7. Master and describe the features of Spark ML programming and GraphX programming
Who should take this Scala course?
1. Professionals aspiring for a career in the field of real-time big data analytics
2. Analytics professionals
3. Research professionals
4. IT developers and testers
5. Data scientists
6. BI and reporting professionals
7. Students who wish to gain a thorough understanding of Apache Spark
Learn more at https://www.simplilearn.com/big-data-and-analytics/apache-spark-scala-certification-training
Learning spark ch01 - Introduction to Data Analysis with Sparkphanleson
Learning spark ch01 - Introduction to Data Analysis with Spark
References to Spark Course
Course : Introduction to Big Data with Apache Spark : http://ouo.io/Mqc8L5
Course : Spark Fundamentals I : http://ouo.io/eiuoV
Course : Functional Programming Principles in Scala : http://ouo.io/rh4vv
This slide introduces Hadoop Spark.
Just to help you construct an idea of Spark regarding its architecture, data flow, job scheduling, and programming.
Not all technical details are included.
We are a company driven by inquisitive data scientists, having developed a pragmatic and interdisciplinary approach, which has evolved over the decades working with over 100 clients across multiple industries. Combining several Data Science techniques from statistics, machine learning, deep learning, decision science, cognitive science, and business intelligence, with our ecosystem of technology platforms, we have produced unprecedented solutions. Welcome to the Data Science Analytics team that can do it all, from architecture to algorithms.
Our practice delivers data driven solutions, including Descriptive Analytics, Diagnostic Analytics, Predictive Analytics, and Prescriptive Analytics. We employ a number of technologies in the area of Big Data and Advanced Analytics such as DataStax (Cassandra), Databricks (Spark), Cloudera, Hortonworks, MapR, R, SAS, Matlab, SPSS and Advanced Data Visualizations.
This presentation is designed for Spark Enthusiasts to get started and details of the course are below.
1. Introduction to Apache Spark
2. Functional Programming + Scala
3. Spark Core
4. Spark SQL + Parquet
5. Advanced Libraries
6. Tips & Tricks
7. Where do I go from here?
In this slidecast, Alex Gorbachev from Pythian presents a Practical Introduction to Hadoop. This is a great primer for viewers who want to get the big picture on how Hadoop works with Big Data and how this approach differs from relational databases.
Watch the presentation: http://inside-bigdata.com/slidecast-a-practical-introduction-to-hadoop/
Download the audio:
Apache Spark is a lightning-fast cluster computing technology, designed for fast computation. It extends the MapReduce model of Hadoop to efficiently use it for more types of computations, which includes interactive queries and stream processing.
Spark is one of Hadoop's subproject developed in 2009 in UC Berkeley's AMPLab by Matei Zaharia. It was Open Sourced in 2010 under a BSD license. It was donated to Apache software foundation in 2013, and now Apache Spark has become a top-level Apache project from Feb-2014.
This document shares some basic knowledge about Apache Spark.
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn
This presentation about Apache Spark covers all the basics that a beginner needs to know to get started with Spark. It covers the history of Apache Spark, what is Spark, the difference between Hadoop and Spark. You will learn the different components in Spark, and how Spark works with the help of architecture. You will understand the different cluster managers on which Spark can run. Finally, you will see the various applications of Spark and a use case on Conviva. Now, let's get started with what is Apache Spark.
Below topics are explained in this Spark presentation:
1. History of Spark
2. What is Spark
3. Hadoop vs Spark
4. Components of Apache Spark
5. Spark architecture
6. Applications of Spark
7. Spark usecase
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
Simplilearn’s Apache Spark and Scala certification training are designed to:
1. Advance your expertise in the Big Data Hadoop Ecosystem
2. Help you master essential Apache and Spark skills, such as Spark Streaming, Spark SQL, machine learning programming, GraphX programming and Shell Scripting Spark
3. Help you land a Hadoop developer job requiring Apache Spark expertise by giving you a real-life industry project coupled with 30 demos
What skills will you learn?
By completing this Apache Spark and Scala course you will be able to:
1. Understand the limitations of MapReduce and the role of Spark in overcoming these limitations
2. Understand the fundamentals of the Scala programming language and its features
3. Explain and master the process of installing Spark as a standalone cluster
4. Develop expertise in using Resilient Distributed Datasets (RDD) for creating applications in Spark
5. Master Structured Query Language (SQL) using SparkSQL
6. Gain a thorough understanding of Spark streaming features
7. Master and describe the features of Spark ML programming and GraphX programming
Who should take this Scala course?
1. Professionals aspiring for a career in the field of real-time big data analytics
2. Analytics professionals
3. Research professionals
4. IT developers and testers
5. Data scientists
6. BI and reporting professionals
7. Students who wish to gain a thorough understanding of Apache Spark
Learn more at https://www.simplilearn.com/big-data-and-analytics/apache-spark-scala-certification-training
Apache Spark presentation at HasGeek FifthElelephant
https://fifthelephant.talkfunnel.com/2015/15-processing-large-data-with-apache-spark
Covering Big Data Overview, Spark Overview, Spark Internals and its supported libraries
Big Data Processing with Apache Spark 2014mahchiev
Apache Spark™ is a fast and general engine for large-scale data processing. It has gained enormous popularity recently with its speed and ease of use and is currently replacing traditional Hadoop MapReduce. We'll talk about:
1. What is Big Data ?
2. The Map-Reduce paradigm
3. What does Apache Spark do?
4. Finally, we'll make a quick demo
In this one day workshop, we will introduce Spark at a high level context. Spark is fundamentally different than writing MapReduce jobs so no prior Hadoop experience is needed. You will learn how to interact with Spark on the command line and conduct rapid in-memory data analyses. We will then work on writing Spark applications to perform large cluster-based analyses including SQL-like aggregations, machine learning applications, and graph algorithms. The course will be conducted in Python using PySpark.
Large Scale Data Analytics with Spark and Cassandra on the DSE PlatformDataStax Academy
In this talk will show how Large Scale Data Analytics can be done with Spark and Cassandra on the DataStax Enterprise Platform. First we will give an overview of what is the Spark Cassandra Connector and how it enables working with large data sets. Then we will use the Spark Notebook to show live examples in the browser of interacting with the data. The example will load a large Movies Database from Cassandra into Spark and then show how that data can be transformed and analyzed using Spark.
Secrets of Spark's success - Deenar Toraskar, Think Reactive huguk
This talk will cover the design and implementation decisions that have been key to the success of Apache Spark over other competing cluster computing frameworks. It will be delving into the whitepaper behind Spark and cover the design of Spark RDDs, the abstraction enables the Spark execution engine to be extended to support a wide variety of use cases: Spark SQL, Spark Streaming, MLib and GraphX. RDDs allow Spark to outperform existing models by up to 100x in multi-pass analytics.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
Generating a custom Ruby SDK for your web service or Rails API using Smithyg2nightmarescribd
Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
4. Major limitations with available
distributed models
1. Difficulty in programming directly in MapReduce
2. No support for in-memory computation in
MapReduce
3. MR uses batch processing (does not fit every
use-case ).
4. Flink is not ready for production level projects.
5. Flink primarily works on streaming data.
6. Storm is slower than Spark.
6. What is Spark?
• Spark is the open standard for flexible in-
memory data processing for batch, real-time,
and advanced analytics.
• Powerful open source processing engine built
around speed, ease of use, and sophisticated
analytics.
• First high-level programing framework for fast,
distributed data processing.
7. Some key points about Spark
• Handles batch, interactive, and real-time
within a single framework
(MR for Batch and Flink for Stream)
• Native integration with Java, Python, Scala
and R
• More general: map/reduce is just one set of
supported constructs
9. How does Spark Work?
• Often used in tandem with a distributed storage system to
write the data processed and a cluster manager to manage
the distribution of the application across the cluster.
• Spark currently supports three kinds of cluster managers:
1. The manager included in Spark, called the Standalone Cluster Manager,
which requires Spark to be installed in each node of a cluster.
2. Apache Mesos
3. Hadoop YARN.
12. Spark Ecosystem
Spark Core API
R SQL Python Scala Java
Spark SQL Streaming MLlib GraphX
Programming languages used in Spark Source:
13. Spark Core
• Spark Core, the main data processing framework in the Spark ecosystem
• Spark Core is the underlying general execution engine for the Spark
platform that all other functionality is built on top of.
• It provides in-memory computing capabilities to deliver speed, a
generalized execution model to support a wide variety of applications, and
Java, Scala, and Python APIs for ease of development.
• In addition to Spark Core, the Spark ecosystem includes a number of other
first-party components for more specific data processing tasks, including
Spark SQL, Spark MLLib, Spark ML, and Graph X.
• These components have many of the same generic performance
considerations as the core. However, some of them have unique
considerations - like SQL’s different optimizer.
17. Spark SQL
• Spark SQL is a Spark module for structured data processing.
• Provides a programming abstraction called Data Frames & can also act as
distributed SQL query engine.
• Defines an interface for a semi-structured data type,
called DataFrames and a typed version called Dataset.
• Very important component for Spark performance, and almost all that can
be accomplished with Spark core can be applied to Spark SQL.
• DataFrames and Datasets interfaces are the future of Spark performance,
with more efficient storage options, advanced optimizer, and direct
operations on serialized data.
• Datasets was introduced in Spark 1.6, DataFrames in Spark 1.3, and the
SQL engine in Spark 1.0.
18. • Spark SQL supports structured queries in batch and streaming
modes (with the latter as a separate module of Spark SQL
called Structured Streaming).
• As of Spark 2.0, Spark SQL is now de facto the primary and
feature-rich interface to Spark’s underlying in-memory
distributed platform (hiding Spark Core’s RDDs behind higher-
level abstractions).
19. Spark SQL’s different APIs
• Dataset API (formerly DataFrame API) with a strongly-typed
LINQ-like Query DSL that Scala programmers will likely find
very appealing to use.
• Structured Streaming API (aka Streaming Datasets) for
continuous incremental execution of structured queries.
• Non-programmers will likely use SQL as their query language
through direct integration with Hive
• JDBC/ODBC fans can use JDBC interface (through Thrift
JDBC/ODBC Server) and connect their tools to Spark’s
distributed query engine.
22. Machine Learning
• Spark has two machine learning packages, ML and MLlib.
• Spark ML is still in the early stages, but since Spark 1.2, it provides a
higher-level API than MLlib that helps users create practical machine
learning pipelines more easily.
• Spark MLLib is built on top of RDDs, on the other hand ML is build on top
of SparkSQL data frames.
• Spark community plans to move over to ML deprecating MLlib.
• Spark ML and MLLib have some unique performance considerations,
especially when working with large data sizes and caching.
23. Spark Streaming
• Running on top of Spark, Spark Streaming enables powerful interactive
and analytical applications across both streaming and historical data,
while inheriting
• Uses the scheduling of the Spark Core for streaming analytics on mini
batches of data.
• Has a number of unique considerations such as the window sizes used for
batches.
• Running on top of Spark, it enables powerful interactive and analytical
applications across both streaming and historical data, while inheriting
Spark’s ease of use and fault tolerance characteristics.
• Readily integrates with a wide variety of popular data sources, including
HDFS, Flume, Kafka, and Twitter.
24. Graph X
• GraphX is a graph computation engine built on top of Spark that enables
users to interactively build, transform and reason about graph structured
data at scale.
• Comes complete with a library of common algorithms.
• Least mature components of Spark.
• Typed graph functionality will start to be introduced on top of the Dataset
API in upcoming version.
25. Spark Model of Parallel Computing:
RDDs
• Spark revolves around the concept of a resilient distributed dataset (RDD),
which is a fault-tolerant collection of elements partitioned across
machines, that can be operated on in parallel.
• Each RDD is split into multiple partitions, which may be computed on
different nodes of the cluster.
• RDDs are distributed data-sets that can stay in-memory or fall back to disk
gracefully.
• RDDs are resilient because they have a long lineage. Whenever there's a
failure in the system, they can re-compute themselves using the prior
information using lineage.
• RDDs are a representation of lazily evaluated statically typed distributed
collections.
26. • Spark stores data in RDDs on different partitions. They help with
rearranging the computations and optimizing the data processing.
• RDDs are immutable. We can modify an RDD with a transformation but
the transformation returns a new RDD whereas the original RDD remains
the same.
• In addition to Spark Core, the Spark ecosystem includes a number of other
first-party components for more specific data processing tasks, including
Spark SQL, Spark MLLib, Spark ML, and Graph X.
27. RDD Operations
• RDD supports two types of operations:
– Transformation: Transformations don't return a single value,
they return a new RDD. Nothing gets evaluated when
Transformation function is called, it just takes an RDD and return
a new RDD.
Few of the Transformation functions are map, filter, flatMap,
groupByKey, reduceByKey, aggregateByKey, pipe, and coalesce.
– Action: Action operation evaluates and returns a new value.
When an Action function is called on a RDD object, all the data
processing queries are computed at that time and the result
value is returned.
Few of the Actions are reduce, collect, count, first, take,
countByKey, and foreach.
28. Lazy Evaluation
• Evaluation of RDDs is completely lazy.
• Spark does not begin computing the partitions until and
action is called.
• Actions trigger the scheduler, which builds a directed acyclic
graph (called the DAG), based on the dependencies between
RDD transformations.
29. PERFORMANCE & USABILITY
ADVANTAGES OF LAZY EVALUATION
• Allows Spark to chain together operations that don’t
require communication with the driver to avoid doing
multiple passes through the data.
• As each partition of the data contains the dependency
information needed to re-calculate the partition, Spark is
fault-tolerant
• RDD contains all the dependency information required to
replicate each of its partitions.
• In case o failure when a partition is lost, the RDD has
enough information about its lineage to recompute it, and
that computation can be parallelized to make recovery
faster.
30. IN-MEMORY STORAGE & MEMORY MANAGEMENT
• Spark has option of storing the data on slave nodes on loaded into
memory. So its performance it very good for iterative computations
compare to MapReduce.
• Spark offers three options for memory management:
1. In memory as de-serialized Java objects: memory storage is the fastest
but not memory efficient, as it needs the data to be as objects.
2. As serialized data: slower, since serialized data is more CPU-intensive to
read often more memory efficient, since it allows the user to choose a
more efficient representation for data than as Java objects
3. On Disk: obviously slower for repeated computations, but can be more
fault-tolerant for long strings of transformations and may be the only
feasible option for enormous computations.
31. IMMUTABILITY AND THE RDD INTERFACE
• Spark has a RDD interface whose properties are followed by
RDD of every type.
• RDD properties include dependences & information about
data locality that are needed for the execution engine to
compute that RDD
• RDDs can be created in two ways:
(1) by transforming an existing RDD or
(2) from a Spark Context(by passing a list or reading files)
36. What are the benefits of Spark?
• Speed-Engineered from the bottom-up for performance, Spark can
be 100x faster than Hadoop for large scale data processing by exploiting in
memory computing and other optimizations. Spark is also fast when data
is stored on disk, and currently holds the world record for large-scale on-
disk sorting. Run programs up to 100x faster than Hadoop MapReduce in
memory, or 10x faster on disk.
• Ease of Use-Spark has easy-to-use APIs for operating on large datasets.
This includes a collection of over 100 operators for transforming data and
familiar data frame APIs for manipulating semi-structured data. Write
applications quickly in Java, Scala, Python, R.
• A Unified Engine-Spark comes packaged with higher-level libraries,
including support for SQL queries, streaming data, machine learning and
graph processing. These standard libraries increase developer productivity
and can be seamlessly combined to create complex workflows.
37. When to use Spark?
• Faster Batch Applications: You can now deploy batch applications that run 10-
100x faster in production environments with the added benefit of easy code
maintenance.
• Complex ETL Data Pipelines: You can leverage the complete Spark stack to
build complex ETL pipelines that can merge streaming, machine learning and
sql operations all in one program.
• Real-time Operational Analytics :You can leverage MapR-DB/HBase and/or
Spark Streaming functionality to build real-time operational dashboards or
time-series analytics over data ingested at high speeds.
Example:
• Credit Card Fraud Detection
• Network Security
• Genomic Sequencing
38. When Not to use Spark?
• Spark was not designed as a multi-user environment. Spark users are
required to know whether the memory they have access to is sufficient for
a dataset. Adding more users further complicates this since the users will
have to coordinate memory usage to run projects concurrently. Due to
this, users will want to consider an alternate engine, such as Apache Hive,
for large, batch projects.