Spark optimization

•Download as PPTX, PDF•

0 likes•495 views

This document provides recommendations for optimizing Spark jobs. It suggests reducing I/O by running the Spark cluster on the same machines as the data. It recommends avoiding functions that collect data to the driver to reduce memory I/O. It also suggests using caching to avoid read I/O. The document discusses configuring resources like memory and cores and tuning configurations like backpressure to improve performance of Spark streaming jobs. Finally, it recommends using efficient serialization formats like Kryo, Avro and Parquet.

Technology

Reduce I/O
 If you are running spark-streaming or spark-sql (hive etc) or your data is residing in
any of distributing platform like in Hadoop as hdfs file or in S3 etc so to avoid
network I/O and for data locality its recommended to setup your spark cluster in
the same machines.
 To avoid the memory I/O or overhead memory try to avoid collectaslist kind a
function because it’ll send data to driver and then further redistribute it.
 To avoid the read I/O should use spark caching option like cache() or persist()
as pert your requirement.

Kafka Parallel Read
 Kafka give as option to create partition and bind those into a single consumer
group, can use this in spark to read data in parallel mode, because spark driver
read data from kafka sequentially for each partition but to do parallel consumption
from kafka topic from multiple machines, you have to instantiate multiple
Dstreams.
http://www.michael-noll.com/blog/2014/10/01/kafka-spark-streaming-integration-
example-tutorial/

Resource Allocation
 In spark heap size can be control through --executor-memory (from command
line) or spark.executor.memory so as per your heap requirement you can tune it.
 Spark cores property control the number of concurrent task in executor, if we set --
executor-cores 3 each executor can run a maximum 3 tasks at the same time.
 The --num-executors command-line flag or spark.executor.instances
configuration property control the number of executors requested.
 We can set driver memory and cores through --driver-memory and --driver-
cores properties

Configuration Changes
 There might be chances in spark streaming where producer generate data faster
than consumer consume due to this memory overhead occur. Spark provide
configuration to handle this called backpressure using
spark.streaming.backpressure.enabled=true, we can avoid this situation.
 Streaming application is a long running process so frequent Garbage Collection
pauses occur and we want to minimize it so using this “--conf
"spark.executor.extraJavaOptions=-XX:+UseConcMarkSweepGC -
Dlog4j.configuration=log4j-eir.properties" “ we can do that.
 Again long running application does generate very large log files using
RollingFileAppender we can limit the size. Also we can turn off console logs using
spark.ui.showConsoleProgress.

Serialization and Data Fromat
 In general, Spark uses the deserialized representation for records in memory and
the serialized representation for records stored on disk or being transferred over
the network. The spark.serializer property controls the serializer that’s used to
convert between these two representations. The Kryo serializer,
org.apache.spark.serializer.KryoSerializer, is the preferred option.
 To avoid serialization issue recommend should use Avro, Parquet, Thrift kind a
formats.

Fluentd is an open source log collector that allows flexible collection and routing of log data. It uses JSON format for log messages and supports many input and output plugins. Fluentd can collect logs from files, network services, and applications before routing them to storage and analysis services like MongoDB, HDFS, and Treasure Data. The open source project has grown a large community contributing over 100 plugins to make log collection and processing easier.

Matt Franklin - Apache Software (Geekfest)

W2O Group

The document discusses the potential benefits of container technologies like Docker. It notes that containers offer significantly higher density than virtual machines by avoiding hypervisor overhead. This density improvement can lead to major cost reductions by reducing infrastructure needs. Containers also improve developer efficiency by making development environments portable and disposable. This allows more rapid experimentation and innovation, potentially translating to increased revenue. Technologies like Amazon Lambda take the on-demand aspects of containers even further by abstracting compute resources. The document promotes StackEngine as a solution for managing containers at scale in production environments.

StreamNative FLiP into scylladb - scylla summit 2022

Timothy Spann

Data science online camp using the flipn stack for edge ai (flink, nifi, pu...

Timothy Spann

Devfest uk & ireland using apache nifi with apache pulsar for fast data on-r...

Timothy Spann

Devfest uk & ireland using apache nifi with apache pulsar for fast data on-ramp 2022 As the Pulsar communities grows, more and more connectors will be added. To enhance the availability of sources and sinks and to make use of the greater Apache Streaming community, joining forces between Apache NiFi and Apache Pulsar is a perfect fit. Apache NiFi also adds the benefits of ELT, ETL, data crunching, transformation, validation and batch data processing. Once data is ready to be an event, NiFi can launch it into Pulsar at light speed. I will walk through how to get started, some use cases and demos and answer questions. https://www.devfest-uki.com/schedule https://linktr.ee/tspannhw

ApacheCon 2021: Apache NiFi 101- introduction and best practices

Timothy Spann

ApacheCon 2021: Apache NiFi 101- introduction and best practices Thursday 14:10 UTC Apache NiFi 101: Introduction and Best Practices Timothy Spann In this talk, we will walk step by step through Apache NiFi from the first load to first application. I will include slides, articles and examples to take away as a Quick Start to utilizing Apache NiFi in your real-time dataflows. I will help you get up and running locally on your laptop, Docker DZone Zone Leader and Big Data MVB @PaasDev https://github.com/tspannhw https://www.datainmotion.dev/ https://github.com/tspannhw/SpeakerProfile https://dev.to/tspannhw https://sessionize.com/tspann/ https://www.slideshare.net/bunkertor

ApacheCon 2021 - Apache NiFi Deep Dive 300

Timothy Spann

21-September-2021 - ApacheCon - Tuesday 17:10 UTC Apache NIFi Deep Dive 300 * https://github.com/tspannhw/EverythingApacheNiFi * https://github.com/tspannhw/FLiP-ApacheCon2021 * https://www.datainmotion.dev/2020/06/no-more-spaghetti-flows.html * https://github.com/tspannhw/FLiP-IoT * https://github.com/tspannhw/FLiP-Energy * https://github.com/tspannhw/FLiP-SOLR * https://github.com/tspannhw/FLiP-EdgeAI * https://github.com/tspannhw/FLiP-CloudQueries * https://github.com/tspannhw/FLiP-Jetson * https://www.linkedin.com/pulse/2021-schedule-tim-spann/ Tuesday 17:10 UTC Apache NIFi Deep Dive 300 Timothy Spann For Data Engineers who have flows already in production, I will dive deep into best practices, advanced use cases, performance optimizations, tips, tricks, edge cases, and interesting examples. This is a master class for those looking to learn quickly things I have picked up after years in the field with Apache NiFi in production. This will be interactive and I encourage questions and discussions. You will take away examples and tips in slides, github, and articles. This talk will cover: Load Balancing Parameters and Parameter Contexts Stateless vs Stateful NiFi Reporting Tasks NiFi CLI NiFi REST Interface DevOps Advanced Record Processing Schemas RetryFlowFile Lookup Services RecordPath Expression Language Advanced Error Handling Techniques Tim Spann is a Developer Advocate @ StreamNative where he works with Apache NiFi, Apache Pulsar, Apache Flink, Apache MXNet, TensorFlow, Apache Spark, big data, the IoT, machine learning, and deep learning. Tim has over a decade of experience with the IoT, big data, distributed computing, streaming technologies, and Java programming. Previously, he was a Principal Field Engineer at Cloudera, a senior solutions architect at AirisData and a senior field engineer at Pivotal. He blogs for DZone, where he is the Big Data Zone leader, and runs a popular meetup in Princeton on big data, the IoT, deep learning, streaming, NiFi, the blockchain, and Spark. Tim is a frequent speaker at conferences such as IoT Fusion, Strata, ApacheCon, Data Works Summit Berlin, DataWorks Summit Sydney, and Oracle Code NYC. He holds a BS and MS in computer science.

The document discusses stream processing with Python and options to avoid summoning Cuthulu when doing so. It summarizes Apache Spark's capabilities for stream processing with Python, current limitations, and potential future improvements. It also discusses alternative approaches like using pure Python or Spark Structured Streaming. The document recommends Spark Streaming for Python stream processing needs today while noting potential performance improvements in the future.

Data minutes #2 Apache Pulsar with MQTT for Edge Computing Lightning - 2022

Timothy Spann

This document discusses using Apache Pulsar with MQTT for edge computing. It provides an overview of Pulsar's capabilities as a unified messaging platform, including guaranteed message delivery, resiliency, and scalability. It then describes how Pulsar supports the MQTT protocol (MoP) for ingesting IoT data from devices. Examples are given of using Python and Java to publish sensor readings to Pulsar topics from the edge via MQTT. Finally, it mentions ways to use NVIDIA Jetson devices with Pulsar for edge AI workloads.

Using FLiP with influxdb for edgeai iot at scale 2022

Timothy Spann

Learning the basics of Apache NiFi for iot OSS Europe 2020

Timothy Spann

Api world apache nifi 101

Timothy Spann

Timothy Spann provides an overview of Apache NiFi, an open source dataflow software. Some key points about NiFi include: - It provides guaranteed data delivery, buffering, prioritized queuing, and data provenance. - It supports over 60 source connectors and has hundreds of processors for handling different data formats. - The architecture includes repositories for storing metadata and provenance data, and supports clustering. - Spann discusses best practices for using NiFi such as avoiding spaghetti flows, leveraging parameters and templates, and upgrading to the latest version. He also demonstrates how to consume data from sources like MQTT and FTP.

Real-time Streaming Pipelines with FLaNK

Data Con LA

Introducing the FLaNK stack which combines Apache Flink, Apache NiFi and Apache Kafka to build fast applications for IoT, AI, rapid ingest and deploy them anywhere. I will walk through live demos and show how to do this yourself. FLaNK provides a quick set of tools to build applications at any scale for any streaming and IoT use cases. We will discuss a use case - Smart Stocks with FLaNK (NiFi, Kafka, Flink SQL) Bio - Tim Spann is an avid blogger and the Big Data Zone Leader for Dzone (https://dzone.com/users/297029/bunkertor.html). He runs the the successful Future of Data Princeton meetup with over 1200 members at http://www.meetup.com/futureofdata-princeton/. He is currently a Senior Solutions Engineer at Cloudera in the Princeton New Jersey area. You can find all the source and material behind his talks at his Github and Community blog: https://github.com/tspannhw/ApacheDeepLearning201 https://community.hortonworks.com/users/9304/tspann.html

PortoTechHub - Hail Hydrate! From Stream to Lake with Apache Pulsar and Friends

Timothy Spann

This document provides an overview and summary of Apache Pulsar, a distributed streaming and messaging platform. It discusses Pulsar's benefits like data durability, scalability, geo-replication and multi-tenancy. It outlines key use cases like message queuing and data streaming. The document also summarizes Pulsar's architecture, subscriptions modes, connectors, and integration with other technologies like Apache Flink, Apache NiFi and MQTT. It highlights real-world customer implementations and provides demos of ingesting IoT data via Pulsar.

Cloud lunch and learn real-time streaming in azure

Timothy Spann

Real time cloud native open source streaming of any data to apache solr

Timothy Spann

Real time cloud native open source streaming of any data to apache solr Utilizing Apache Pulsar and Apache NiFi we can parse any document in real-time at scale. We receive a lot of documents via cloud storage, email, social channels and internal document stores. We want to make all the content and metadata to Apache Solr for categorization, full text search, optimization and combination with other datastores. We will not only stream documents, but all REST feeds, logs and IoT data. Once data is produced to Pulsar topics it can instantly be ingested to Solr through Pulsar Solr Sink. Utilizing a number of open source tools, we have created a real-time scalable any document parsing data flow. We use Apache Tika for Document Processing with real-time language detection, natural language processing with Apache OpenNLP, Sentiment Analysis with Stanford CoreNLP, Spacy and TextBlob. We will walk everyone through creating an open source flow of documents utilizing Apache NiFi as our integration engine. We can convert PDF, Excel and Word to HTML and/or text. We can also extract the text to apply sentiment analysis and NLP categorization to generate additional metadata about our documents. We also will extract and parse images that if they contain text we can extract with TensorFlow and Tesseract.

Architecting for Scale

Pooyan Jamshidi

Cracking the nut, solving edge ai with apache tools and frameworks

Timothy Spann

DBCC 2021 - FLiP Stack for Cloud Data Lakes

Timothy Spann

[March sn meetup] apache pulsar + apache nifi for cloud data lake

Timothy Spann

https://www.meetup.com/new-york-city-apache-pulsar-meetup/events/283837865/ Learn how to use Apache Pulsar and Apache NiFi to Stream to your Data Lake Discover how to stream data to and from your data lake or data mart using Apache Pulsar™ and Apache NiFi®. Learn how these cloud-native, scalable open-source projects built for streaming data pipelines work together to enable you to quickly build applications with minimal coding. |WHAT THE SESSION WILL COVER| Best Practices for using Pulsar and NiFi A deep dive on Apache NiFi's Pulsar connector and demos Building an End-to-End Application in the Hybrid Cloud Attend for a chance to win a We <3 Pulsar t-shirt! The first 50 registrants who register through here [https://hubs.ly/Q013LTpn0] will be entered in a drawing! —------------------------ |AGENDA| 6:00 - 7:00 PM EST: Presentation - Tim Spann, StreamNative Developer Advocate 7:00 - 8:00 PM EST: Presentation - John Kuchmek, Cloudera Principal Solutions Engineer 8:00 - 8:30 PM EST: Q&A + Networking —------------------------ |ABOUT THE SPEAKERS| John Kuchmek is a Principal Solutions Engineer for Cloudera. Before joining Cloudera, John transitioned to the Autonomous Intelligence team where he was in charge of integrating the platforms to allow data scientists to work with various types of data. Tim Spann is a Developer Advocate for StreamNative. He works with StreamNative Cloud, Apache Pulsar™, Apache Flink®, Flink® SQL, Big Data, the IoT, machine learning, and deep learning. Tim has over a decade of experience with the IoT, big data, distributed computing, messaging, streaming technologies, and Java programming. Previously, he was a Principal DataFlow Field Engineer at Cloudera, a Senior Solutions Engineer at Hortonworks, a Senior Solutions Architect at AirisData, a Senior Field Engineer at Pivotal and a Team Leader at HPE. He blogs for DZone, where he is the Big Data Zone leader, and runs a popular meetup in Princeton on Big Data, Cloud, IoT, deep learning, streaming, NiFi, the blockchain, and Spark. Tim is a frequent speaker at conferences such as ApacheCon, DeveloperWeek, Pulsar Summit and many more. He holds a BS and MS in computer science. He is currently working on a book about the FLiP Stack.

Big data conference europe real-time streaming in any and all clouds, hybri...

Timothy Spann

Biography Tim Spann is a Principal DataFlow Field Engineer at Cloudera where he works with Apache NiFi, MiniFi, Pulsar, Apache Flink, Apache MXNet, TensorFlow, Apache Spark, big data, the IoT, machine learning, and deep learning. Tim has over a decade of experience with the IoT, big data, distributed computing, streaming technologies, and Java programming. Previously, he was a senior solutions architect at AirisData and a senior field engineer at Pivotal. He blogs for DZone, where he is the Big Data Zone leader, and runs a popular meetup in Princeton on big data, the IoT, deep learning, streaming, NiFi, the blockchain, and Spark. Tim is a frequent speaker at conferences such as IoT Fusion, Strata, ApacheCon, Data Works Summit Berlin, DataWorks Summit Sydney, and Oracle Code NYC. He holds a BS and MS in computer science. Talk Real-Time Streaming in Any and All Clouds, Hybrid and Beyond Today, data is being generated from devices and containers living at the edge of networks, clouds and data centers. We need to run business logic, analytics and deep learning at the scale and as events arrive. Tools: Apache Flink, Apache Pulsar, Apache NiFi, MiNiFi, DJL.ai Apache MXNet. References: https://www.datainmotion.dev/2019/11/introducing-mm-flank-apache-flink-stack.html https://www.datainmotion.dev/2019/08/rapid-iot-development-with-cloudera.html https://www.datainmotion.dev/2019/09/powering-edge-ai-for-sensor-reading.html https://www.datainmotion.dev/2019/05/dataworks-summit-dc-2019-report.html https://www.datainmotion.dev/2019/03/using-raspberry-pi-3b-with-apache-nifi.html Source Code: https://github.com/tspannhw/MmFLaNK FLiP Stack StreamNative

Cracking the nut, solving edge ai with apache tools and frameworks

Timothy Spann

Pulsar summit asia 2021 apache pulsar with mqtt for edge computing

Timothy Spann

FLiP Into Trino

Timothy Spann

FLiP Into Trino FLiP into Trino. Flink Pulsar Trino Pulsar SQL (Trino/Presto) Remember the days when you could wait until your batch data load was done and then you could run some simple queries or build stale dashboards? Those days are over, today you need instant analytics as the data is streaming in real-time. You need universal analytics where that data is. I will show you how to do this utilizing the latest cloud native open source tools. In this talk we will utilize Trino, Apache Pulsar, Pulsar SQL and Apache Flink to analyze instantly data from IoT, sensors, transportation systems, Logs, REST endpoints, XML, Images, PDFs, Documents, Text, semistructured data, unstructured data, structured data and a hundred data sources you could never dream of streaming before. I will teach how to use Pulsar SQL to run analytics on live data. Tim Spann Developer Advocate StreamNative David Kjerrumgaard Developer Advocate StreamNative https://www.starburst.io/info/trinosummit/ https://github.com/tspannhw/FLiP-Into-Trino/blob/main/README.md https://github.com/tspannhw/StreamingAnalyticsUsingFlinkSQL/tree/main/src/main/java select * from pulsar."public/default"."weather"; Apache Pulsar plus Trio = fast analytics at scale

ApacheCon 2021: Cracking the nut with Apache Pulsar (FLiP)

Timothy Spann

ApacheCon 2021: Cracking the nut with Apache Pulsar (FLiP) by Timothy Spann Wednesday 17:10 UTC - Cracking the Nut, Solving Edge AI with Apache Tools and Frameworks Wednesday 17:10 UTC Cracking the Nut, Solving Edge AI with Apache Tools and Frameworks Today, data is being generated from devices and containers living at the edge of networks, clouds and data centers. We need to run business logic, analytics and deep learning at the edge before we start our real-time streaming flows. Fortunately using the all Apache FLiP Stack we can do this with ease! Streaming AI Powered Analytics From the Edge to the Data Center is now a simple use case. With MiNiFi we can ingest the data, do data checks, cleansing, run machine learning and deep learning models and route our data in real-time to Apache NiFi and Apache Pulsar for further transformations and processing. Apache Flink will provide our advanced streaming capabilities fed real-time via Apache Kafka topics. Apache MXNet models will run both at the edge and in our data centers via Apache NiFi and MiNiFi. Our final data will be stored in various Apache datastores. Event-Driven Microservices in Apache Pulsar Functions. Tools: Apache Flink, Apache Pulsar, Apache NiFi, MiNiFi, Apache MXNet

Real time stock processing with apache nifi, apache flink and apache kafka

Timothy Spann

Getting Started With Spark Structured Streaming With Dustin Vannoy | Current ...

HostedbyConfluent

Getting Started With Spark Structured Streaming With Dustin Vannoy | Current 2022 Many data pipelines still default to processing data nightly or hourly, but information is created all the time and should be available much sooner. While the move to stream processing adds complexity, Spark Structured Streaming makes it achievable for teams of any size to switch to streaming. This session shares techniques for data engineers who are new to building streaming pipelines with Spark Structured Streaming. It covers how to implement real-time stream processes with Apache Spark and Apache Kafka. We will discuss general concepts for Spark Structured Streaming along with introductory code examples. We will also look at important streaming concepts like triggers, windows, and state. To connect it all we will walk through a complete pipeline, including a demo using PySpark, Apache Kafka, and Delta Lake tables

Getting Started with Spark Structured Streaming - Current 22

Dustin Vannoy

This document provides an overview of Apache Spark Structured Streaming. It explains that Structured Streaming provides a simpler way to perform streaming analytics by treating streaming data as a continuously updating table. It describes key concepts like transformations, actions, output modes, triggers, windows, and joins. It also includes an example notebook demonstrating how to set up a complete Structured Streaming application reading from Kafka and writing results.

What's hot

Using Apache Spark with IBM SPSS Modeler

Global Knowledge Training

Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...

confluent

Data minutes #2 Apache Pulsar with MQTT for Edge Computing Lightning - 2022

Timothy Spann

Using FLiP with influxdb for edgeai iot at scale 2022

Timothy Spann

Learning the basics of Apache NiFi for iot OSS Europe 2020

Timothy Spann

Api world apache nifi 101

Timothy Spann

Real-time Streaming Pipelines with FLaNK

Data Con LA

PortoTechHub - Hail Hydrate! From Stream to Lake with Apache Pulsar and Friends

Timothy Spann

Cloud lunch and learn real-time streaming in azure

Timothy Spann

Real time cloud native open source streaming of any data to apache solr

Timothy Spann

Architecting for Scale

Pooyan Jamshidi

Cracking the nut, solving edge ai with apache tools and frameworks

Timothy Spann

DBCC 2021 - FLiP Stack for Cloud Data Lakes

Timothy Spann

[March sn meetup] apache pulsar + apache nifi for cloud data lake

Timothy Spann

Big data conference europe real-time streaming in any and all clouds, hybri...

Timothy Spann

Cracking the nut, solving edge ai with apache tools and frameworks

Timothy Spann

Pulsar summit asia 2021 apache pulsar with mqtt for edge computing

Timothy Spann

FLiP Into Trino

Timothy Spann

ApacheCon 2021: Cracking the nut with Apache Pulsar (FLiP)

Timothy Spann

Real time stock processing with apache nifi, apache flink and apache kafka

Timothy Spann

What's hot (20)

Using Apache Spark with IBM SPSS Modeler

Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...

Data minutes #2 Apache Pulsar with MQTT for Edge Computing Lightning - 2022

Using FLiP with influxdb for edgeai iot at scale 2022

Learning the basics of Apache NiFi for iot OSS Europe 2020

Api world apache nifi 101

Real-time Streaming Pipelines with FLaNK

PortoTechHub - Hail Hydrate! From Stream to Lake with Apache Pulsar and Friends

Cloud lunch and learn real-time streaming in azure

Real time cloud native open source streaming of any data to apache solr

Architecting for Scale

Cracking the nut, solving edge ai with apache tools and frameworks

DBCC 2021 - FLiP Stack for Cloud Data Lakes

[March sn meetup] apache pulsar + apache nifi for cloud data lake

Big data conference europe real-time streaming in any and all clouds, hybri...

Cracking the nut, solving edge ai with apache tools and frameworks

Pulsar summit asia 2021 apache pulsar with mqtt for edge computing

FLiP Into Trino

ApacheCon 2021: Cracking the nut with Apache Pulsar (FLiP)

Real time stock processing with apache nifi, apache flink and apache kafka

Similar to Spark optimization

Getting Started With Spark Structured Streaming With Dustin Vannoy | Current ...

HostedbyConfluent

Getting Started with Spark Structured Streaming - Current 22

Dustin Vannoy

Spark Performance Tuning .pdf

Amit Raj

Data science with spark on amazon EMR - Pop-up Loft Tel Aviv

Amazon Web Services

Organizations need to perform increasingly complex analysis on their data — streaming analytics, ad-hoc querying and predictive analytics — in order to get better customer insights and actionable business intelligence. However, the growing data volume, speed, and complexity of diverse data formats make current tools inadequate or difficult to use. Apache Spark has recently emerged as the framework of choice to address these challenges. Spark is a general-purpose processing framework that follows a DAG model and also provides high-level APIs, making it more flexible and easier to use than MapReduce. Thanks to its use of in-memory datasets (RDDs), embedded libraries, fault-tolerance, and support for a variety of programming languages, Apache Spark enables developers to implement and scale far more complex big data use cases, including real-time data processing, interactive querying, graph computations and predictive analytics. In this session, we present a technical deep dive on Spark running on Amazon EMR. You learn why Spark is great for ad-hoc interactive analysis and real-time stream processing, how to deploy and tune scalable clusters running Spark on Amazon EMR, how to use EMRFS with Spark to query data directly in Amazon S3, and best practices and patterns for Spark on Amazon EMR.

Learning spark ch07 - Running on a Cluster

phanleson

Using pySpark with Google Colab & Spark 3.0 preview

Mario Cartia

Hadoop Spark Introduction-20150130

Xuan-Chao Huang

(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR

Amazon Web Services

Spark on YARN

Adarsh Pannu

Spark supports four cluster managers: Local, Standalone, YARN, and Mesos. YARN is highly recommended for production use. When running Spark on YARN, careful tuning of configuration settings like the number of executors, executor memory and cores, and dynamic allocation is important to optimize performance and resource utilization. Configuring queues also allows separating different applications by priority and resource needs.

In Memory Analytics with Apache Spark

Venkata Naga Ravi

AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS

Amazon Web Services

Organizations need to perform increasingly complex analysis on data — streaming analytics, ad-hoc querying, and predictive analytics — in order to get better customer insights and actionable business intelligence. Apache Spark has recently emerged as the framework of choice to address many of these challenges. In this webinar, we show you how to use Apache Spark on AWS to implement and scale common big data use cases such as real-time data processing, interactive data science, predictive analytics, and more. We will talk about common architectures and best practices to quickly create Spark clusters using Amazon Elastic MapReduce (EMR), and ways to use Spark with Amazon Redshift, Amazon DynamoDB, Amazon Kinesis, and other big data applications in the Apache Hadoop ecosystem. Learning Objectives: Learn why Spark is great for ad-hoc interactive analysis and real-time stream processing How to deploy and tune scalable clusters running Spark on Amazon EMR How to use EMR File System (EMRFS) with Spark to query data directly in Amazon S3 Common architectures to leverage Spark with DynamoDB, Redshift, Kinesis, and more

5 things one must know about spark!

Edureka!

Spark can process data faster than Hadoop by keeping data in-memory as much as possible to avoid disk I/O. It supports streaming data, machine learning algorithms, graph processing, and SQL queries on structured data using its DataFrame API. Spark can integrate with Hadoop by running on YARN and accessing data from HDFS. The key capabilities discussed include low latency processing, streaming, machine learning, graph processing, DataFrames, and Hadoop integration.

5 reasons why spark is in demand!

Edureka!

This document discusses 5 reasons why Apache Spark is in high demand: 1) Low latency processing by keeping data in memory, 2) Support for streaming data through resilient distributed datasets (RDDs), 3) Integration of machine learning and graph processing libraries, 4) DataFrame API for easier data analysis, and 5) Ability to integrate with Hadoop for large scale data processing. It provides details on Spark's architecture and benchmarks showing its faster performance compared to Hadoop for tasks like sorting large datasets.

Module01

NPN Training

This document provides an overview of Apache Spark, including its history, features, architecture and use cases. Spark started in 2009 at UC Berkeley and was later adopted by the Apache Foundation. It provides faster processing than Hadoop by keeping data in memory. Spark supports batch, streaming and interactive processing on large datasets using its core abstraction called resilient distributed datasets (RDDs).

Spark 101

Mohit Garg

Spark is an in-memory cluster computing framework that provides high performance for large-scale data processing. It excels over Hadoop by keeping data in memory as RDDs (Resilient Distributed Datasets) for faster processing. The document provides an overview of Spark architecture including its core-based execution model compared to Hadoop's JVM-based model. It also demonstrates Spark's programming model using RDD transformations and actions through an example of log mining, showing how jobs are lazily evaluated and distributed across the cluster.

Data Science & Best Practices for Apache Spark on Amazon EMR

Amazon Web Services

Apache Spark Performance tuning and Best Practise

Knoldus Inc.

Masterclass Live: Amazon EMR

Amazon Web Services

Abhishek Sinha is a senior product manager at Amazon for Amazon EMR. Amazon EMR allows customers to easily run data frameworks like Hadoop, Spark, and Presto on AWS. It provides a managed platform and tools to launch clusters in minutes that leverage the elasticity of AWS. Customers can customize clusters and choose from different applications, instances types, and access methods. Amazon EMR allows separating compute and storage where the low-cost S3 can be used for persistent storage while clusters are dynamically scaled based on workload.

Apache Kafka with Spark Streaming: Real-time Analytics Redefined

Edureka!

This document provides an overview of Apache Kafka and how it can be used for real-time analytics with Spark Streaming. It begins with an agenda that outlines what will be covered, including what Kafka is, why it is needed, its components, how it works, examples of companies using it, and a hands-on demonstration of integrating Kafka with Spark. The document then discusses why Kafka was developed, how it works, its performance capabilities, and how it can be used with Spark Streaming for real-time analytics by ingesting data, performing analysis, and displaying or storing results.

Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...

Databricks

Recently, there has been increased interest in running analytics and machine learning workloads on top of serverless frameworks in the cloud. The serverless execution model provides fine-grained scaling and unburdens users from having to manage servers, but also adds substantial performance overheads due to the fact that all data and intermediate state of compute task is stored on remote shared storage. In this talk I first provide a detailed performance breakdown from a machine learning workload using Spark on AWS Lambda. I show how the intermediate state of tasks — such as model updates or broadcast messages — is exchanged using remote storage and what the performance overheads are. Later, I illustrate how the same workload performs on-premise using Apache Spark and Apache Crail deployed on a high-performance cluster (100Gbps network, NVMe Flash, etc.). Serverless computing simplifies the deployment of machine learning applications. The talk shows that performance does not need to be sacrificed.

Similar to Spark optimization (20)

Getting Started With Spark Structured Streaming With Dustin Vannoy | Current ...

Getting Started with Spark Structured Streaming - Current 22

Spark Performance Tuning .pdf

Data science with spark on amazon EMR - Pop-up Loft Tel Aviv

Learning spark ch07 - Running on a Cluster

Using pySpark with Google Colab & Spark 3.0 preview

Hadoop Spark Introduction-20150130

(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR

Spark on YARN

In Memory Analytics with Apache Spark

AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS

5 things one must know about spark!

5 reasons why spark is in demand!

Module01

Spark 101

Data Science & Best Practices for Apache Spark on Amazon EMR

Apache Spark Performance tuning and Best Practise

Masterclass Live: Amazon EMR

Apache Kafka with Spark Streaming: Real-time Analytics Redefined

Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...

Recently uploaded

inQuba Webinar Mastering Customer Journey Management with Dr Graham Hill

LizaNolte

HERE IS YOUR WEBINAR CONTENT! 'Mastering Customer Journey Management with Dr. Graham Hill'. We hope you find the webinar recording both insightful and enjoyable. In this webinar, we explored essential aspects of Customer Journey Management and personalization. Here’s a summary of the key insights and topics discussed: Key Takeaways: Understanding the Customer Journey: Dr. Hill emphasized the importance of mapping and understanding the complete customer journey to identify touchpoints and opportunities for improvement. Personalization Strategies: We discussed how to leverage data and insights to create personalized experiences that resonate with customers. Technology Integration: Insights were shared on how inQuba’s advanced technology can streamline customer interactions and drive operational efficiency.

ScyllaDB Tablets: Rethinking Replication

ScyllaDB

ScyllaDB is making a major architecture shift. We’re moving from vNode replication to tablets – fragments of tables that are distributed independently, enabling dynamic data distribution and extreme elasticity. In this keynote, ScyllaDB co-founder and CTO Avi Kivity explains the reason for this shift, provides a look at the implementation and roadmap, and shares how this shift benefits ScyllaDB users.

Y-Combinator seed pitch deck template PP

c5vrf27qcz

QA or the Highway - Component Testing: Bridging the gap between frontend appl...

zjhamm304

QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...

AlexanderRichford

QR Secure: A Hybrid Approach Using Machine Learning and Security Validation Functions to Prevent Interaction with Malicious QR Codes. Aim of the Study: The goal of this research was to develop a robust hybrid approach for identifying malicious and insecure URLs derived from QR codes, ensuring safe interactions. This is achieved through: Machine Learning Model: Predicts the likelihood of a URL being malicious. Security Validation Functions: Ensures the derived URL has a valid certificate and proper URL format. This innovative blend of technology aims to enhance cybersecurity measures and protect users from potential threats hidden within QR codes 🖥 🔒 This study was my first introduction to using ML which has shown me the immense potential of ML in creating more secure digital environments!

Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck

FilipTomaszewski5

Christine's Product Research Presentation.pptx

christinelarrosa

Session 1 - Intro to Robotic Process Automation.pdf

UiPathCommunity

👉 Check out our full 'Africa Series - Automation Student Developers (EN)' page to register for the full program: https://bit.ly/Automation_Student_Kickstart In this session, we shall introduce you to the world of automation, the UiPath Platform, and guide you on how to install and setup UiPath Studio on your Windows PC. 📕 Detailed agenda: What is RPA? Benefits of RPA? RPA Applications The UiPath End-to-End Automation Platform UiPath Studio CE Installation and Setup 💻 Extra training through UiPath Academy: Introduction to Automation UiPath Business Automation Platform Explore automation development with UiPath Studio 👉 Register here for our upcoming Session 2 on June 20: Introduction to UiPath Studio Fundamentals: https://community.uipath.com/events/details/uipath-lagos-presents-session-2-introduction-to-uipath-studio-fundamentals/

Principle of conventional tomography-Bibash Shahi ppt..pptx

BibashShahi

Demystifying Knowledge Management through Storytelling

Enterprise Knowledge

The Department of Veteran Affairs (VA) invited Taylor Paschal, Knowledge & Information Management Consultant at Enterprise Knowledge, to speak at a Knowledge Management Lunch and Learn hosted on June 12, 2024. All Office of Administration staff were invited to attend and received professional development credit for participating in the voluntary event. The objectives of the Lunch and Learn presentation were to: - Review what KM ‘is’ and ‘isn’t’ - Understand the value of KM and the benefits of engaging - Define and reflect on your “what’s in it for me?” - Share actionable ways you can participate in Knowledge - - Capture & Transfer

"$10 thousand per minute of downtime: architecture, queues, streaming and fin...

Fwdays

Direct losses from downtime in 1 minute = $5-$10 thousand dollars. Reputation is priceless. As part of the talk, we will consider the architectural strategies necessary for the development of highly loaded fintech solutions. We will focus on using queues and streaming to efficiently work and manage large amounts of data in real-time and to minimize latency. We will focus special attention on the architectural patterns used in the design of the fintech system, microservices and event-driven architecture, which ensure scalability, fault tolerance, and consistency of the entire system.

Christine's Supplier Sourcing Presentaion.pptx

christinelarrosa

"Choosing proper type of scaling", Olena Syrota

Fwdays

Harnessing the Power of NLP and Knowledge Graphs for Opioid Research

Neo4j

Astute Business Solutions | Oracle Cloud Partner |

AstuteBusiness

Must Know Postgres Extension for DBA and Developer during Migration

Mydbops

Mydbops Opensource Database Meetup 16 Topic: Must-Know PostgreSQL Extensions for Developers and DBAs During Migration Speaker: Deepak Mahto, Founder of DataCloudGaze Consulting Date & Time: 8th June | 10 AM - 1 PM IST Venue: Bangalore International Centre, Bangalore Abstract: Discover how PostgreSQL extensions can be your secret weapon! This talk explores how key extensions enhance database capabilities and streamline the migration process for users moving from other relational databases like Oracle. Key Takeaways: * Learn about crucial extensions like oracle_fdw, pgtt, and pg_audit that ease migration complexities. * Gain valuable strategies for implementing these extensions in PostgreSQL to achieve license freedom. * Discover how these key extensions can empower both developers and DBAs during the migration process. * Don't miss this chance to gain practical knowledge from an industry expert and stay updated on the latest open-source database trends. Mydbops Managed Services specializes in taking the pain out of database management while optimizing performance. Since 2015, we have been providing top-notch support and assistance for the top three open-source databases: MySQL, MongoDB, and PostgreSQL. Our team offers a wide range of services, including assistance, support, consulting, 24/7 operations, and expertise in all relevant technologies. We help organizations improve their database's performance, scalability, efficiency, and availability. Contact us: info@mydbops.com Visit: https://www.mydbops.com/ Follow us on LinkedIn: https://in.linkedin.com/company/mydbops For more details and updates, please follow up the below links. Meetup Page : https://www.meetup.com/mydbops-databa... Twitter: https://twitter.com/mydbopsofficial Blogs: https://www.mydbops.com/blog/ Facebook(Meta): https://www.facebook.com/mydbops/

Mutation Testing for Task-Oriented Chatbots

Pablo Gómez Abajo

Conversational agents, or chatbots, are increasingly used to access all sorts of services using natural language. While open-domain chatbots - like ChatGPT - can converse on any topic, task-oriented chatbots - the focus of this paper - are designed for specific tasks, like booking a flight, obtaining customer support, or setting an appointment. Like any other software, task-oriented chatbots need to be properly tested, usually by defining and executing test scenarios (i.e., sequences of user-chatbot interactions). However, there is currently a lack of methods to quantify the completeness and strength of such test scenarios, which can lead to low-quality tests, and hence to buggy chatbots. To fill this gap, we propose adapting mutation testing (MuT) for task-oriented chatbots. To this end, we introduce a set of mutation operators that emulate faults in chatbot designs, an architecture that enables MuT on chatbots built using heterogeneous technologies, and a practical realisation as an Eclipse plugin. Moreover, we evaluate the applicability, effectiveness and efficiency of our approach on open-source chatbots, with promising results.

Containers & AI - Beauty and the Beast!?!

Tobias Schneck

As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other? Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real. Keywords: AI, Containeres, Kubernetes, Cloud Native Event Link: https://meine.doag.org/events/cloudland/2024/agenda/#agendaId.4211

AI in the Workplace Reskilling, Upskilling, and Future Work.pptx

Sunil Jagani

Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors

DianaGray10

Join us to learn how UiPath Apps can directly and easily interact with prebuilt connectors via Integration Service--including Salesforce, ServiceNow, Open GenAI, and more. The best part is you can achieve this without building a custom workflow! Say goodbye to the hassle of using separate automations to call APIs. By seamlessly integrating within App Studio, you can now easily streamline your workflow, while gaining direct access to our Connector Catalog of popular applications. We’ll discuss and demo the benefits of UiPath Apps and connectors including: Creating a compelling user experience for any software, without the limitations of APIs. Accelerating the app creation process, saving time and effort Enjoying high-performance CRUD (create, read, update, delete) operations, for seamless data management. Speakers: Russell Alfeche, Technology Leader, RPA at qBotic and UiPath MVP Charlie Greenberg, host

Recently uploaded (20)

inQuba Webinar Mastering Customer Journey Management with Dr Graham Hill

ScyllaDB Tablets: Rethinking Replication

Y-Combinator seed pitch deck template PP

QA or the Highway - Component Testing: Bridging the gap between frontend appl...

QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...

Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck

Christine's Product Research Presentation.pptx

Session 1 - Intro to Robotic Process Automation.pdf

Principle of conventional tomography-Bibash Shahi ppt..pptx

Demystifying Knowledge Management through Storytelling

"$10 thousand per minute of downtime: architecture, queues, streaming and fin...

Christine's Supplier Sourcing Presentaion.pptx

"Choosing proper type of scaling", Olena Syrota

Harnessing the Power of NLP and Knowledge Graphs for Opioid Research

Astute Business Solutions | Oracle Cloud Partner |

Must Know Postgres Extension for DBA and Developer during Migration

Mutation Testing for Task-Oriented Chatbots

Containers & AI - Beauty and the Beast!?!

AI in the Workplace Reskilling, Upskilling, and Future Work.pptx

Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors

Spark optimization

1. Spark Optimization

2. Reduce I/O  If you are running spark-streaming or spark-sql (hive etc) or your data is residing in any of distributing platform like in Hadoop as hdfs file or in S3 etc so to avoid network I/O and for data locality its recommended to setup your spark cluster in the same machines.  To avoid the memory I/O or overhead memory try to avoid collectaslist kind a function because it’ll send data to driver and then further redistribute it.  To avoid the read I/O should use spark caching option like cache() or persist() as pert your requirement.

3. Kafka Parallel Read  Kafka give as option to create partition and bind those into a single consumer group, can use this in spark to read data in parallel mode, because spark driver read data from kafka sequentially for each partition but to do parallel consumption from kafka topic from multiple machines, you have to instantiate multiple Dstreams. http://www.michael-noll.com/blog/2014/10/01/kafka-spark-streaming-integration- example-tutorial/

4. Resource Allocation  In spark heap size can be control through --executor-memory (from command line) or spark.executor.memory so as per your heap requirement you can tune it.  Spark cores property control the number of concurrent task in executor, if we set -- executor-cores 3 each executor can run a maximum 3 tasks at the same time.  The --num-executors command-line flag or spark.executor.instances configuration property control the number of executors requested.  We can set driver memory and cores through --driver-memory and --driver- cores properties

5. Configuration Changes  There might be chances in spark streaming where producer generate data faster than consumer consume due to this memory overhead occur. Spark provide configuration to handle this called backpressure using spark.streaming.backpressure.enabled=true, we can avoid this situation.  Streaming application is a long running process so frequent Garbage Collection pauses occur and we want to minimize it so using this “--conf "spark.executor.extraJavaOptions=-XX:+UseConcMarkSweepGC - Dlog4j.configuration=log4j-eir.properties" “ we can do that.  Again long running application does generate very large log files using RollingFileAppender we can limit the size. Also we can turn off console logs using spark.ui.showConsoleProgress.

6. Serialization and Data Fromat  In general, Spark uses the deserialized representation for records in memory and the serialized representation for records stored on disk or being transferred over the network. The spark.serializer property controls the serializer that’s used to convert between these two representations. The Kryo serializer, org.apache.spark.serializer.KryoSerializer, is the preferred option.  To avoid serialization issue recommend should use Avro, Parquet, Thrift kind a formats.

Spark optimization

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Spark optimization

Similar to Spark optimization (20)

Recently uploaded

Recently uploaded (20)

Spark optimization