Apache Spark is an open-source unified analytics engine for large-scale data processing. It provides high-level APIs in Scala, Java, Python, and R, and an optimized engine that supports general computation graphs for data analysis. Some key components of Apache Spark include Resilient Distributed Datasets (RDDs), DataFrames, Datasets, and Spark SQL for structured data processing. Spark also supports streaming, machine learning via MLlib, and graph processing with GraphX.
An Engine to process big data in faster(than MR), easy and extremely scalable way. An Open Source, parallel, in-memory processing, cluster computing framework. Solution for loading, processing and end to end analyzing large scale data. Iterative and Interactive : Scala, Java, Python, R and with Command line interface.
Apache Spark is a lightning-fast cluster computing technology, designed for fast computation. It extends the MapReduce model of Hadoop to efficiently use it for more types of computations, which includes interactive queries and stream processing.
Spark is one of Hadoop's subproject developed in 2009 in UC Berkeley's AMPLab by Matei Zaharia. It was Open Sourced in 2010 under a BSD license. It was donated to Apache software foundation in 2013, and now Apache Spark has become a top-level Apache project from Feb-2014.
This document shares some basic knowledge about Apache Spark.
Apache Spark is a lightning-fast cluster computing technology, designed for fast computation. It extends the MapReduce model of Hadoop to efficiently use it for more types of computations, which includes interactive queries and stream processing. This slide shares some basic knowledge about Apache Spark.
An introduction into Spark ML plus how to go beyond when you get stuckData Con LA
Abstract:-
This talk will introduce Spark new machine learning frame work (Spark ML) and how to train basic models with it. A companion Jupyter notebook for people to follow along with will be provided. Once we've got the basics down we'll look at what to do when we find we need more than the tools available in Spark ML (and I'll try and convince people to contribute to my latest side project -- Sparkling ML).
Bio:-
Holden Karau is a transgender Canadian, Apache Spark committer, an active open source contributor, and coauthor of Learning Spark and High Performance Spark. When not in San Francisco working as a software development engineer at IBM’s Spark Technology Center, Holden speaks internationally about Spark and holds office hours at coffee shops at home and abroad. She makes frequent contributions to Spark, specializing in PySpark and machine learning. Prior to IBM, she worked on a variety of distributed, search, and classification problems at Alpine, Databricks, Google, Foursquare, and Amazon. She holds a bachelor of mathematics in computer science from the University of Waterloo. Outside of computers she enjoys scootering and playing with fire.
An Engine to process big data in faster(than MR), easy and extremely scalable way. An Open Source, parallel, in-memory processing, cluster computing framework. Solution for loading, processing and end to end analyzing large scale data. Iterative and Interactive : Scala, Java, Python, R and with Command line interface.
Apache Spark is a lightning-fast cluster computing technology, designed for fast computation. It extends the MapReduce model of Hadoop to efficiently use it for more types of computations, which includes interactive queries and stream processing.
Spark is one of Hadoop's subproject developed in 2009 in UC Berkeley's AMPLab by Matei Zaharia. It was Open Sourced in 2010 under a BSD license. It was donated to Apache software foundation in 2013, and now Apache Spark has become a top-level Apache project from Feb-2014.
This document shares some basic knowledge about Apache Spark.
Apache Spark is a lightning-fast cluster computing technology, designed for fast computation. It extends the MapReduce model of Hadoop to efficiently use it for more types of computations, which includes interactive queries and stream processing. This slide shares some basic knowledge about Apache Spark.
An introduction into Spark ML plus how to go beyond when you get stuckData Con LA
Abstract:-
This talk will introduce Spark new machine learning frame work (Spark ML) and how to train basic models with it. A companion Jupyter notebook for people to follow along with will be provided. Once we've got the basics down we'll look at what to do when we find we need more than the tools available in Spark ML (and I'll try and convince people to contribute to my latest side project -- Sparkling ML).
Bio:-
Holden Karau is a transgender Canadian, Apache Spark committer, an active open source contributor, and coauthor of Learning Spark and High Performance Spark. When not in San Francisco working as a software development engineer at IBM’s Spark Technology Center, Holden speaks internationally about Spark and holds office hours at coffee shops at home and abroad. She makes frequent contributions to Spark, specializing in PySpark and machine learning. Prior to IBM, she worked on a variety of distributed, search, and classification problems at Alpine, Databricks, Google, Foursquare, and Amazon. She holds a bachelor of mathematics in computer science from the University of Waterloo. Outside of computers she enjoys scootering and playing with fire.
Your data is getting bigger while your boss is getting anxious to have insights! This tutorial covers Apache Spark that makes data analytics fast to write and fast to run. Tackle big datasets quickly through a simple API in Python, and learn one programming paradigm in order to deploy interactive, batch, and streaming applications while connecting to data sources incl. HDFS, Hive, JSON, and S3.
How Apache Spark fits into the Big Data landscapePaco Nathan
Boulder/Denver Spark Meetup, 2014-10-02 @ Datalogix
http://www.meetup.com/Boulder-Denver-Spark-Meetup/events/207581832/
Apache Spark is intended as a general purpose engine that supports combinations of Batch, Streaming, SQL, ML, Graph, etc., for apps written in Scala, Java, Python, Clojure, R, etc.
This talk provides an introduction to Spark — how it provides so much better performance, and why — and then explores how Spark fits into the Big Data landscape — e.g., other systems with which Spark pairs nicely — and why Spark is needed for the work ahead.
Real-time analysis starts with transforming raw data into structured records. Typically this is done with bespoke business logic custom written for each use case. Joey Echeverria presents a configuration-based, reusable library for streaming ETL that can be embedded in real-time stream-processing systems and demonstrates its real-world use cases with Apache Kafka and Apache Hadoop.
A Tale of Two APIs: Using Spark Streaming In ProductionLightbend
Fast Data architectures are the answer to the increasing need for the enterprise to process and analyze continuous streams of data to accelerate decision making and become reactive to the particular characteristics of their market.
Apache Spark is a popular framework for data analytics. Its capabilities include SQL-based analytics, dataflow processing, graph analytics and a rich library of built-in machine learning algorithms. These libraries can be combined to address a wide range of requirements for large-scale data analytics.
To address Fast Data flows, Spark offers two API's: The mature Spark Streaming and its younger sibling, Structured Streaming. In this talk, we are going to introduce both APIs. Using practical examples, you will get a taste of each one and obtain guidance on how to choose the right one for your application.
End-to-end Data Governance with Apache Avro and AtlasDataWorks Summit
Aeolus is Comcast’s new internal Big Data system for providing access to an integrated view of a wide variety of high-quality, near-real-time and batch data. Such integration can enable data scientists to uncover otherwise hidden trends, anomalies, and powerful predictors of business successes and failures. But integrating data across silos in a large enterprise is fraught with peril. There typically are few standards on naming conventions and data representation, and spotty documentation at best. The old rule of thumb often applies: 70% of the analysts’ time goes into data wrangling, while only 30% goes toward the actual analyses and simulations. The goal of the Athene Data Governance Platform within Aeolus is to invert this ratio. This talk will explain how Comcast is using Apache Avro and Atlas for end-to-end data governance, the challenges faced, and methods used to address these challenges.
Avro provides a lingua franca for data representation, data integration, and schema evolution. All data published for community consumption must have an associated avro schema in Atlas. Every step in its journey through Aeolus, in flight or at rest, is captured in Atlas. Atlas’ extensibility has allowed us to add or update various entity types (e.g., avro schemas, kafka topics, object store pseudo-directories) and lineage types (e.g., storing streaming data in object storage; embellishing and re-publishing streaming data; performing aggregations and other transformations on data at rest; and evolution of schemas with compatibility flags). Transformation services notify Atlas of lineage links via custom asynchronous kafka messaging.
Atlas provides self-service data discovery and lineage browsing and querying, via full-text search, DSL query language, or gremlin graph query language. Example queries: “Where is data from kafka topic X stored?” “Display the journey of data currently stored in pseudo-directory X since it entered the Aeolus system”. “Show me all earlier versions of schema S, and whether they are forward/backward compatible with each other.”
Running Non-MapReduce Big Data Applications on Apache Hadoophitesh1892
Apache Hadoop has become popular from its specialization in the execution of MapReduce programs. However, it has been hard to leverage existing Hadoop infrastructure for various other processing paradigms such as real-time streaming, graph processing and message-passing. That was true until the introduction of Apache Hadoop YARN in Apache Hadoop 2.0. YARN supports running arbitrary processing paradigms on the same Hadoop cluster. This allows for development of newer frameworks as well as more efficient implementations of existing frameworks that can all run on and share the resources of a single multi-tenant YARN cluster. This talk gives a brief introduction to YARN. We will illustrate how to create applications and how to best make use of YARN. We will show examples of different applications such as Apache Tez and Apache Samza that can leverage YARN and present best practices/guidelines on building applications on top of Apache Hadoop YARN.
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016 Databricks
Tathagata 'TD' Das presented at Bay Area Apache Spark Meetup. This talk covers the merits and motivations of Structured Streaming, and how you can start writing end-to-end continuous applications using Structured Streaming APIs.
Organizations need to perform increasingly complex analysis on data — streaming analytics, ad-hoc querying, and predictive analytics — in order to get better customer insights and actionable business intelligence. Apache Spark has recently emerged as the framework of choice to address many of these challenges. In this session, we show you how to use Apache Spark on AWS to implement and scale common big data use cases such as real-time data processing, interactive data science, predictive analytics, and more. We will talk about common architectures, best practices to quickly create Spark clusters using Amazon EMR, and ways to integrate Spark with other big data services in AWS.
Learning Objectives:
• Learn why Spark is great for ad-hoc interactive analysis and real-time stream processing.
• How to deploy and tune scalable clusters running Spark on Amazon EMR.
• How to use EMR File System (EMRFS) with Spark to query data directly in Amazon S3.
• Common architectures to leverage Spark with Amazon DynamoDB, Amazon Redshift, Amazon Kinesis, and more.
Your data is getting bigger while your boss is getting anxious to have insights! This tutorial covers Apache Spark that makes data analytics fast to write and fast to run. Tackle big datasets quickly through a simple API in Python, and learn one programming paradigm in order to deploy interactive, batch, and streaming applications while connecting to data sources incl. HDFS, Hive, JSON, and S3.
How Apache Spark fits into the Big Data landscapePaco Nathan
Boulder/Denver Spark Meetup, 2014-10-02 @ Datalogix
http://www.meetup.com/Boulder-Denver-Spark-Meetup/events/207581832/
Apache Spark is intended as a general purpose engine that supports combinations of Batch, Streaming, SQL, ML, Graph, etc., for apps written in Scala, Java, Python, Clojure, R, etc.
This talk provides an introduction to Spark — how it provides so much better performance, and why — and then explores how Spark fits into the Big Data landscape — e.g., other systems with which Spark pairs nicely — and why Spark is needed for the work ahead.
Real-time analysis starts with transforming raw data into structured records. Typically this is done with bespoke business logic custom written for each use case. Joey Echeverria presents a configuration-based, reusable library for streaming ETL that can be embedded in real-time stream-processing systems and demonstrates its real-world use cases with Apache Kafka and Apache Hadoop.
A Tale of Two APIs: Using Spark Streaming In ProductionLightbend
Fast Data architectures are the answer to the increasing need for the enterprise to process and analyze continuous streams of data to accelerate decision making and become reactive to the particular characteristics of their market.
Apache Spark is a popular framework for data analytics. Its capabilities include SQL-based analytics, dataflow processing, graph analytics and a rich library of built-in machine learning algorithms. These libraries can be combined to address a wide range of requirements for large-scale data analytics.
To address Fast Data flows, Spark offers two API's: The mature Spark Streaming and its younger sibling, Structured Streaming. In this talk, we are going to introduce both APIs. Using practical examples, you will get a taste of each one and obtain guidance on how to choose the right one for your application.
End-to-end Data Governance with Apache Avro and AtlasDataWorks Summit
Aeolus is Comcast’s new internal Big Data system for providing access to an integrated view of a wide variety of high-quality, near-real-time and batch data. Such integration can enable data scientists to uncover otherwise hidden trends, anomalies, and powerful predictors of business successes and failures. But integrating data across silos in a large enterprise is fraught with peril. There typically are few standards on naming conventions and data representation, and spotty documentation at best. The old rule of thumb often applies: 70% of the analysts’ time goes into data wrangling, while only 30% goes toward the actual analyses and simulations. The goal of the Athene Data Governance Platform within Aeolus is to invert this ratio. This talk will explain how Comcast is using Apache Avro and Atlas for end-to-end data governance, the challenges faced, and methods used to address these challenges.
Avro provides a lingua franca for data representation, data integration, and schema evolution. All data published for community consumption must have an associated avro schema in Atlas. Every step in its journey through Aeolus, in flight or at rest, is captured in Atlas. Atlas’ extensibility has allowed us to add or update various entity types (e.g., avro schemas, kafka topics, object store pseudo-directories) and lineage types (e.g., storing streaming data in object storage; embellishing and re-publishing streaming data; performing aggregations and other transformations on data at rest; and evolution of schemas with compatibility flags). Transformation services notify Atlas of lineage links via custom asynchronous kafka messaging.
Atlas provides self-service data discovery and lineage browsing and querying, via full-text search, DSL query language, or gremlin graph query language. Example queries: “Where is data from kafka topic X stored?” “Display the journey of data currently stored in pseudo-directory X since it entered the Aeolus system”. “Show me all earlier versions of schema S, and whether they are forward/backward compatible with each other.”
Running Non-MapReduce Big Data Applications on Apache Hadoophitesh1892
Apache Hadoop has become popular from its specialization in the execution of MapReduce programs. However, it has been hard to leverage existing Hadoop infrastructure for various other processing paradigms such as real-time streaming, graph processing and message-passing. That was true until the introduction of Apache Hadoop YARN in Apache Hadoop 2.0. YARN supports running arbitrary processing paradigms on the same Hadoop cluster. This allows for development of newer frameworks as well as more efficient implementations of existing frameworks that can all run on and share the resources of a single multi-tenant YARN cluster. This talk gives a brief introduction to YARN. We will illustrate how to create applications and how to best make use of YARN. We will show examples of different applications such as Apache Tez and Apache Samza that can leverage YARN and present best practices/guidelines on building applications on top of Apache Hadoop YARN.
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016 Databricks
Tathagata 'TD' Das presented at Bay Area Apache Spark Meetup. This talk covers the merits and motivations of Structured Streaming, and how you can start writing end-to-end continuous applications using Structured Streaming APIs.
Organizations need to perform increasingly complex analysis on data — streaming analytics, ad-hoc querying, and predictive analytics — in order to get better customer insights and actionable business intelligence. Apache Spark has recently emerged as the framework of choice to address many of these challenges. In this session, we show you how to use Apache Spark on AWS to implement and scale common big data use cases such as real-time data processing, interactive data science, predictive analytics, and more. We will talk about common architectures, best practices to quickly create Spark clusters using Amazon EMR, and ways to integrate Spark with other big data services in AWS.
Learning Objectives:
• Learn why Spark is great for ad-hoc interactive analysis and real-time stream processing.
• How to deploy and tune scalable clusters running Spark on Amazon EMR.
• How to use EMR File System (EMRFS) with Spark to query data directly in Amazon S3.
• Common architectures to leverage Spark with Amazon DynamoDB, Amazon Redshift, Amazon Kinesis, and more.
Apache Spark presentation at HasGeek FifthElelephant
https://fifthelephant.talkfunnel.com/2015/15-processing-large-data-with-apache-spark
Covering Big Data Overview, Spark Overview, Spark Internals and its supported libraries
This presentation contains the basic of Apache Spark with details about the Machine Learning module of it. In the end of this presentation a demo has been shown which covers the machine learning pipeline with Spark. It also shows how to install standalone cluster in the local machine and how to deploy the application in the spark cluster.
Big Data Processing with Apache Spark 2014mahchiev
Apache Spark™ is a fast and general engine for large-scale data processing. It has gained enormous popularity recently with its speed and ease of use and is currently replacing traditional Hadoop MapReduce. We'll talk about:
1. What is Big Data ?
2. The Map-Reduce paradigm
3. What does Apache Spark do?
4. Finally, we'll make a quick demo
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
A presentation cum workshop on Real time Analytics with Apache Kafka and Apache Spark. Apache Kafka is a distributed publish-subscribe messaging while other side Spark Streaming brings Spark's language-integrated API to stream processing, allows to write streaming applications very quickly and easily. It supports both Java and Scala. In this workshop we are going to explore Apache Kafka, Zookeeper and Spark with a Web click streaming example using Spark Streaming. A clickstream is the recording of the parts of the screen a computer user clicks on while web browsing.
This introductory workshop is aimed at data analysts & data engineers new to Apache Spark and exposes them how to analyze big data with Spark SQL and DataFrames.
In this partly instructor-led and self-paced labs, we will cover Spark concepts and you’ll do labs for Spark SQL and DataFrames
in Databricks Community Edition.
Toward the end, you’ll get a glimpse into newly minted Databricks Developer Certification for Apache Spark: what to expect & how to prepare for it.
* Apache Spark Basics & Architecture
* Spark SQL
* DataFrames
* Brief Overview of Databricks Certified Developer for Apache Spark
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...Databricks
This talk is about sharing experience and lessons learned on setting up and running the Apache Spark service inside the database group at CERN. It covers the many aspects of this change with examples taken from use cases and projects at the CERN Hadoop, Spark, streaming and database services. The talks is aimed at developers, DBAs, service managers and members of the Spark community who are using and/or investigating “Big Data” solutions deployed alongside relational database processing systems. The talk highlights key aspects of Apache Spark that have fuelled its rapid adoption for CERN use cases and for the data processing community at large, including the fact that it provides easy to use APIs that unify, under one large umbrella, many different types of data processing workloads from ETL, to SQL reporting to ML.
Spark can also easily integrate a large variety of data sources, from file-based formats to relational databases and more. Notably, Spark can easily scale up data pipelines and workloads from laptops to large clusters of commodity hardware or on the cloud. The talk also addresses some key points about the adoption process and learning curve around Apache Spark and the related “Big Data” tools for a community of developers and DBAs at CERN with a background in relational database operations.
Data Science & Best Practices for Apache Spark on Amazon EMRAmazon Web Services
Organizations need to perform increasingly complex analysis on their data — streaming analytics, ad-hoc querying and predictive analytics — in order to get better customer insights and actionable business intelligence. However, the growing data volume, speed, and complexity of diverse data formats make current tools inadequate or difficult to use. Apache Spark has recently emerged as the framework of choice to address these challenges. Spark is a general-purpose processing framework that follows a DAG model and also provides high-level APIs, making it more flexible and easier to use than MapReduce. Thanks to its use of in-memory datasets (RDDs), embedded libraries, fault-tolerance, and support for a variety of programming languages, Apache Spark enables developers to implement and scale far more complex big data use cases, including real-time data processing, interactive querying, graph computations and predictive analytics. In this session, we present a technical deep dive on Spark running on Amazon EMR. You learn why Spark is great for ad-hoc interactive analysis and real-time stream processing, how to deploy and tune scalable clusters running Spark on Amazon EMR, how to use EMRFS with Spark to query data directly in Amazon S3, and best practices and patterns for Spark on Amazon EMR.
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWSAmazon Web Services
Organizations need to perform increasingly complex analysis on data — streaming analytics, ad-hoc querying, and predictive analytics — in order to get better customer insights and actionable business intelligence. Apache Spark has recently emerged as the framework of choice to address many of these challenges.
In this webinar, we show you how to use Apache Spark on AWS to implement and scale common big data use cases such as real-time data processing, interactive data science, predictive analytics, and more. We will talk about common architectures and best practices to quickly create Spark clusters using Amazon Elastic MapReduce (EMR), and ways to use Spark with Amazon Redshift, Amazon DynamoDB, Amazon Kinesis, and other big data applications in the Apache Hadoop ecosystem.
Learning Objectives:
Learn why Spark is great for ad-hoc interactive analysis and real-time stream processing
How to deploy and tune scalable clusters running Spark on Amazon EMR
How to use EMR File System (EMRFS) with Spark to query data directly in Amazon S3
Common architectures to leverage Spark with DynamoDB, Redshift, Kinesis, and more
Extending Spark Streaming to Support Complex Event ProcessingOh Chan Kwon
In this talk, we introduce the extensions of Spark Streaming to support (1) SQL-based query processing and (2) elastic-seamless resource allocation. First, we explain the methods of supporting window queries and query chains. As we know, last year, Grace Huang and Jerry Shao introduced the concept of “StreamSQL” that can process streaming data with SQL-like queries by adapting SparkSQL to Spark Streaming. However, we made advances in supporting complex event processing (CEP) based on their efforts. In detail, we implemented the sliding window concept to support a time-based streaming data processing at the SQL level. Here, to reduce the aggregation time of large windows, we generate an efficient query plan that computes the partial results by evaluating only the data entering or leaving the window and then gets the current result by merging the previous one and the partial ones. Next, to support query chains, we made the result of a query over streaming data be a table by adding the “insert into” query. That is, it allows us to apply stream queries to the results of other ones. Second, we explain the methods of allocating resources to streaming applications dynamically, which enable the applications to meet a given deadline. As the rate of incoming events varies over time, resources allocated to applications need to be adjusted for high resource utilization. However, the current Spark's resource allocation features are not suitable for streaming applications. That is, the resources allocated will not be freed when new data are arriving continuously to the streaming applications even though the quantity of the new ones is very small. In order to resolve the problem, we consider their resource utilization. If the utilization is low, we choose victim nodes to be killed. Then, we do not feed new data into the victims to prevent a useless recovery issuing when they are killed. Accordingly, we can scale-in/-out the resources seamlessly.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
2. Agenda
• What is Apache Spark ?
• Spark Ecosystem
• High Level Architecture
• Key Terminologies
• Spark-submit and Deploy modes
• RDD, DataFrames, Datasets, Spark SQL
.. and a few other concepts
3. Apache Spark – brief history
• Apache Spark was initially started by Matei Zaharia at UC Berkeley's
AMPLab in 2009, and open sourced in 2010 under a BSD license.
• In 2013, the project was donated to the Apache Software Foundation
and switched its license to Apache 2.0.
• In February 2014, Spark became a Top-Level Apache Project.
4. What is Apache Spark ?
• Apache Spark is a unified analytics engine for big data processing,
with built-in modules for
• Batch & streaming applications
• SQL
• Machine learning
• Graph processing
Essentially, it is an In-memory Analytics engine for large scale data processing in
distributed systems
7. Driver
- The process running the main() function of the application and creating the Spark Context
Worker Node
- Any node that can run the application in the cluster
Executor
- A process launched for an application on a worker node, which runs ‘Tasks’
- Each application will have it’s own set of executors
Cluster Manager
- An external service for acquiring resources on the cluster (e.g. Standalone manager, YARN, Mesos, Kubernetes)
Spark Context
- Entry gateway to the Spark Cluster, created by the Spark Driver
- Allows the Spark application to access Spark application with the help of the Cluster Manager
- requires SparkConf to create Spark context.
- In 2.x version, sparksession is created, which contains the sparkContext
Spark Conf
– contains the configuration at cluster level passed on to Spark Context
- sparkConf can be set at the application level
-
Key Components & Terminologies
8. Spark deployment – Client vs Cluster mode
Cluster mode :
- Spark driver runs inside an application master process
which is managed by YARN on the cluster
- the client can go away after initiating the application
Client mode :
- driver runs in the client process, and the application
master is only used for requesting resources from YARN
9. spark-submit : script used to submit spark
application in client or cluster mode
https://spark.apache.org/docs/latest/submitting-
applications.html
10. Spark Web UI – used to monitor the status and resource
consumption of Spark cluster
11. Resilient Distributed Datasets (RDD)
- fundamental data structure of Spark
- Immutable, Distributed collection of objects partitioned across nodes in the Spark cluster
- each RDD has multiple partitions , more the number of partitions – greater the parallelism
- leverages Low-level API that uses Transformations and actions
DataFrame
- Immutable distributed collection of objects
- Data is organized into Named columns, like a table in a Relational database
- Untyped API i.e. of type Dataset[Row]
Datasets
- Typed API i.e. Dataset[T]
- Available in Scala & Java
Spark SQL
- provides the ability to write SQL statements, to process Structured data
- Dataframes/Datasets/Spark SQL AP is optimized and leverage the Apache Spark performance optimizations like
Catalysts Optimizer, Tungsten Off-heap memory management.
Spark API - RDD, Data Frames, Datasets, Spark SQL
13. RDD Operations – Transformations, Actions
Transformations :
-apply function on RDD to create a
new RDD (RDD are immutable)
- Transformations are lazy in
nature
- Spark maintains the record of
operations using a DAG.
- ‘Narrow’ Transformations – donot
cause data shuffle eg. Map, filter
etc
- ‘Wide’ Transformation – cause
data shuffle eg. groupByKey()
Actions :
- Execution happens only when an
‘Action’ is done eg. count(),
saveAsText(), reduce() etc
14. Apache Spark – support for SQL windowing
function, Joins
• Spark SQL/Dataframe/Dataset API
• Support 3 types of windowing functions
• Ranking functions
• Rank
• Dense_rank
• percent_rank
• row_number
• Analytic Function
• Cume_dist
• Lag
• lead
• Aggregate Functions
• sum
• avg
• min
• Max
• count
16. Broadcast Join (or Broadcast hash join)
• Used to optimize join queries when the size of the smaller table is below
property – spark.sql.autoBroadcastJoinThreshold
• Similar to map-side join in Hadoop
• Smaller table is put in memory, and the join avoids sending all data of the larger
table across the network
17. Data Shuffle in Apache Spark
• What is Shuffle ?
• Process of data transfer between
Stages
• Redistributes data across spark
partitions (aka re-partitioning)
• Data will move across JVMs
processes, or even across the
wire (between executors on
different m/c)
• shuffle is expensive and should be
avoided at all costs
• Involves disk I/O, data
serialization and network I/O
18. Data Shuffle in Apache Spark
• Operations that cause shuffle include
• Repartition operations like repartition & coalesce
• ByKey operations like groupByKey, reduceByKey
• Join operations like cogroup, join
• To avoid/reduce shuffle
• Use shared variables (Broadcast variables, accumulators)
• Filter input earlier in the program rather than later.
• Use reduceByKey or aggregateByKey instead of groupByKey
19. Shared variables – broadcast variables
Broadcast variable
• Allows users to keep a ‘Read-only’ variable cached on each worker node, rather than
shipping a copy of it with tasks
20. Shared variables - accumulators
Accumulators
• Accumulators are variables that
are only “added” to through an
associative and commutative
operation and can therefore be
efficiently supported in parallel.
• They can be used to implement
counters (as in MapReduce) or
sums.
• Spark also attempts to distribute
broadcast variables using efficient
broadcast algorithms to reduce
communication cost.
• Accumulators are broadcasted to
worker nodes
• Worker nodes can modify state,
but cannot read content
• Only the driver program can read
accumulated value
21. Dynamic Allocation
• Allows spark to dynamically scale the cluster resources allocated to your
application based on the workload.
• When dynamic allocation is enabled and a Spark application has a backlog
of pending tasks, it can request executors.
• Set to ‘False’ by default
• To enable, set the property ‘spark.dynamicAllocation.enabled’ to True
• Other properties to set :
• spark.dynamicAllocation.initialExecutors
(default value -spark.dynamicAllocation.minExecutors)
• spark.dynamicAllocation.maxExecutors
• spark.dynamicAllocation.minExecutors
22. Spark Storage levels
• Spark RDD and DataFrames - provide the capability to specify the
storage level when we persist RDD/DataFrame
• Storage levels provide trade-offs between memory usage and CPU
efficiency
23. Spark Streaming
• Spark Streaming
• Uses Dstream API
• Powered by Spark RDD API’s
• Dstream API divides source data into micro batches, after processing sends to
destination
• Not ‘true’ streaming
24. • Structured Streaming
• Released in Spark 2.x
• Leverages Spark SQL API to process data
• Each row of the data stream is processed
and the result is updated into the
unbounded result table
• ‘True’ streaming
• Ability to handle late coming data (using
watermarks)
• User has the ability to determine the
frequency of data processing using triggers
• Write Ahead Logs are used to identify data
processed,and ensure end-to-end exactly-
once semantics and fault tolerance. WAL
are stored in checkpoints locations. (e.g. In
HDFS)
Structured Streaming
26. Structured Streaming :
Data gets appended to
Input table at trigger
interval specified
Output modes
1. Complete Mode
2. Append Mode
3. Update Mode (Available since Spark
2.1.1 – only updated rows are moved to the sink)
28. Watermarking in Structured Streaming is a way to
limit state in all stateful streaming operations by
specifying how much late data to consider.
watermark set as (max event time - '10 mins')
Watermark set as (max event
time - '10 mins')
29. Machine Learning using Apache Spark
• MLLib - Spark’s Machine learning library
• DataFrame-based API is primary API for ML using Apache Spark
• Provides tools for
• ML Algorithms for common algorithms like classification, regression,
clustering, and collaborative filtering
• Featurization: feature extraction, transformation, dimensionality reduction,
and selection
• Pipelines: tools for constructing, evaluating, and tuning ML Pipelines
• Persistence: saving and load algorithms, models, and Pipelines
• Utilities: linear algebra, statistics, data handling, etc.
30. Graph processing using Apache Spark
• GraphX is Apache Spark's API for graphs and graph-parallel computation.
• Key features
• Seamlessly work with both graphs and collections.
• Comparable performance to the fastest specialized graph processing
systems.
• Libraries available include
• PageRank
• Connected components
• Label propagation
• SVD++
• Strongly connected components
• Triangle count
31. Delta Lake
• Delta Lake is an open source project with the Linux
Foundation.
• Key features :
• Provides ACID Transactions functionality in Data
lakes
• Delta Lake provides DML APIs to merge, update
and delete datasets.
• Schema enforcement
• Time Travel (Snapshots/Versioning)
• Schema Evolution
• Audit History
• 100% compatible with Apache Spark API
33. Catalyst Optimizer
- which leverages advanced programming language features (e.g. Scala’s pattern matching and quasi quotes) in a
novel way to build an extensible query optimizer
Apache Spark 2.x – leverages Catalyst optimizer
to optimize the Spark execution engine
34. Project Tungsten
- is to improve Spark execution by optimizing Spark jobs for CPU and memory
efficiency (as opposed to network and disk I/O which are considered fast enough)
Optimization features include
- Off-Heap Memory Management using binary in-memory data representation
aka Tungsten row format and managing memory explicitly
- Cache Locality which is about cache-aware computations with cache-aware
layout for high cache hit rates
- Whole-Stage Code Generation (aka CodeGen)
-
Apache Spark 2.x – leverages Tungsten Execution
to optimize Spark Execution engine
35. How to determine number of Executors,
Cores, Memory for spark application?
• With Spark on YARN, there are daemons that run in the background eg. NameNode,
Secondary NameNode, DataNode, Task Tracker, Job Tracker.
• While specifying num-executors, we need to make sure that we leave aside enough
cores (~1 core per node) for these daemons to run smoothly.
• We need to budget in the resources that AM would need (~1 executor, 1024 MB
memory)
• HDFS Throughput
• Is maximized with ~5 cores/executor
• Full memory requested to YARN per executor = spark-executor-memory +
memoryOverhead (i.e. 1.07 * spark-executor-memory)
36. Tiny Executor (1 Executor/Core)
• Cluster Config -> 10 Nodes cluster, 16 cores/Node, 64 GB RAM/node
Configuration Options
• Tiny Executors
• 1 Executor/Core
• --num-executors = 16 * 10 = 160 executors (i.e. 16 Executors/node)
• --executor-cores (cores/executor) = 1
• --executor-memory = 64 GB/16 = 4GB/executor
• Analysis :
• Unable to take advantage of parallelism (ie.. Not running multiple tasks per JVM)
• Also, shared/cached variables like broadcast variables and accumulators will be replicated in each core of
the nodes which is 16 times
• Also, we are not leaving enough memory overhead for Hadoop/Yarn daemon processes and we
are not counting in ApplicationManager.
• Not Good
37. Fat Executor (1 Executor/Node)
• Cluster Config -> 10 Nodes cluster, 16 cores/Node, 64 GB RAM/node
Configuration Options
• Fat Executors
• 1 Executor/Node
• --num-executors = 1 * 10 = 10 executors (i.e. 1 Executors/node)
• --executor-cores (cores/executor) = 16
• --executor-memory = 64 GB/1 = 64GB/executor
• Analysis :
• With all 16 cores per executor, apart from AM and daemon processes are not counted for
• HDFS Throughput will hurt, and result in massive Garbage collection
• Not Good
38. Balance between Fat and Thin Executor
• Cluster Config -> 10 Nodes cluster, 16 cores/Node, 64 GB RAM/node
Configuration Options
• --executor-cores (cores/executor) = 5 (recommended for max HDFS Throughput)
• Leave 1 core for Hadoop/Yarn daemons
• Num cores available per node = 16 -1 = 15
• --num-executors = 15 * 10 = 150 executors
• Number of available executors (total cores/num-cores-per-executor) = 150/5 = 30
• Leaving 1 executor for YARN AM -> --num-executors = 29
• Number of executors/Node = 30/10 = 3
• Memory per executor (--executor-memory) = 64GB/3 = 21 GB
• Counting off heap overhead = 7% of 21GB = 3 GB, so actual –executor-memory = 21 – 3 = 18GB
• Analysis :
• Recommended –> 29 executors, 18GB memory, and 5 cores each