Spark is an open-source cluster computing framework. It was developed in 2009 at UC Berkeley and open sourced in 2010. Spark supports batch, streaming, and interactive computations in a unified framework. The core abstraction in Spark is the resilient distributed dataset (RDD), which allows data to be partitioned across a cluster for parallel processing. RDDs support transformations like map and filter that return new RDDs and actions that return values to the driver program.
Last year, in Apache Spark 2.0, Databricks introduced Structured Streaming, a new stream processing engine built on Spark SQL, which revolutionized how developers could write stream processing application. Structured Streaming enables users to express their computations the same way they would express a batch query on static data. Developers can express queries using powerful high-level APIs including DataFrames, Dataset and SQL. Then, the Spark SQL engine is capable of converting these batch-like transformations into an incremental execution plan that can process streaming data, while automatically handling late, out-of-order data and ensuring end-to-end exactly-once fault-tolerance guarantees.
Since Spark 2.0, Databricks has been hard at work building first-class integration with Kafka. With this new connectivity, performing complex, low-latency analytics is now as easy as writing a standard SQL query. This functionality, in addition to the existing connectivity of Spark SQL, makes it easy to analyze data using one unified framework. Users can now seamlessly extract insights from data, independent of whether it is coming from messy / unstructured files, a structured / columnar historical data warehouse, or arriving in real-time from Kafka/Kinesis.
In this session, Das will walk through a concrete example where – in less than 10 lines – you read Kafka, parse JSON payload data into separate columns, transform it, enrich it by joining with static data and write it out as a table ready for batch and ad-hoc queries on up-to-the-last-minute data. He’ll use techniques including event-time based aggregations, arbitrary stateful operations, and automatic state management using event-time watermarks.
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
Watch video at: http://youtu.be/Wg2boMqLjCg
Want to learn how to write faster and more efficient programs for Apache Spark? Two Spark experts from Databricks, Vida Ha and Holden Karau, provide some performance tuning and testing tips for your Spark applications
ApacheCon NA 2015 Spark / Solr Integrationthelabdude
Apache Solr has been adopted by all major Hadoop platform vendors because of its ability to scale horizontally to meet even the most demanding big data search problems. Apache Spark has emerged as the leading platform for real-time big data analytics and machine learning. In this presentation, Timothy Potter presents several common use cases for integrating Solr and Spark.
Specifically, Tim covers how to populate Solr from a Spark streaming job as well as how to expose the results of any Solr query as an RDD. The Solr RDD makes efficient use of deep paging cursors and SolrCloud sharding to maximize parallel computation in Spark. After covering basic use cases, Tim digs a little deeper to show how to use MLLib to enrich documents before indexing in Solr, such as sentiment analysis (logistic regression), language detection, and topic modeling (LDA), and document classification.
This presentation show the main Spark characteristics, like RDD, Transformations and Actions.
I used this presentation for many Spark Intro workshops from Cluj-Napoca Big Data community : http://www.meetup.com/Big-Data-Data-Science-Meetup-Cluj-Napoca/
Last year, in Apache Spark 2.0, Databricks introduced Structured Streaming, a new stream processing engine built on Spark SQL, which revolutionized how developers could write stream processing application. Structured Streaming enables users to express their computations the same way they would express a batch query on static data. Developers can express queries using powerful high-level APIs including DataFrames, Dataset and SQL. Then, the Spark SQL engine is capable of converting these batch-like transformations into an incremental execution plan that can process streaming data, while automatically handling late, out-of-order data and ensuring end-to-end exactly-once fault-tolerance guarantees.
Since Spark 2.0, Databricks has been hard at work building first-class integration with Kafka. With this new connectivity, performing complex, low-latency analytics is now as easy as writing a standard SQL query. This functionality, in addition to the existing connectivity of Spark SQL, makes it easy to analyze data using one unified framework. Users can now seamlessly extract insights from data, independent of whether it is coming from messy / unstructured files, a structured / columnar historical data warehouse, or arriving in real-time from Kafka/Kinesis.
In this session, Das will walk through a concrete example where – in less than 10 lines – you read Kafka, parse JSON payload data into separate columns, transform it, enrich it by joining with static data and write it out as a table ready for batch and ad-hoc queries on up-to-the-last-minute data. He’ll use techniques including event-time based aggregations, arbitrary stateful operations, and automatic state management using event-time watermarks.
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
Watch video at: http://youtu.be/Wg2boMqLjCg
Want to learn how to write faster and more efficient programs for Apache Spark? Two Spark experts from Databricks, Vida Ha and Holden Karau, provide some performance tuning and testing tips for your Spark applications
ApacheCon NA 2015 Spark / Solr Integrationthelabdude
Apache Solr has been adopted by all major Hadoop platform vendors because of its ability to scale horizontally to meet even the most demanding big data search problems. Apache Spark has emerged as the leading platform for real-time big data analytics and machine learning. In this presentation, Timothy Potter presents several common use cases for integrating Solr and Spark.
Specifically, Tim covers how to populate Solr from a Spark streaming job as well as how to expose the results of any Solr query as an RDD. The Solr RDD makes efficient use of deep paging cursors and SolrCloud sharding to maximize parallel computation in Spark. After covering basic use cases, Tim digs a little deeper to show how to use MLLib to enrich documents before indexing in Solr, such as sentiment analysis (logistic regression), language detection, and topic modeling (LDA), and document classification.
This presentation show the main Spark characteristics, like RDD, Transformations and Actions.
I used this presentation for many Spark Intro workshops from Cluj-Napoca Big Data community : http://www.meetup.com/Big-Data-Data-Science-Meetup-Cluj-Napoca/
Video to talk: https://www.youtube.com/watch?v=gd4Jqtyo7mM
Apache Spark is a next generation engine for large scale data processing built with Scala. This talk will first show how Spark takes advantage of Scala's function idioms to produce an expressive and intuitive API. You will learn about the design of Spark RDDs and the abstraction enables the Spark execution engine to be extended to support a wide variety of use cases(Spark SQL, Spark Streaming, MLib and GraphX). The Spark source will be be referenced to illustrate how these concepts are implemented with Scala.
http://www.meetup.com/Scala-Bay/events/209740892/
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...thelabdude
My presentation focuses on how we implemented Solr 4 to be the cornerstone of our social marketing analytics platform. Our platform analyzes relationships, behaviors, and conversations between 30,000 brands and 100M social accounts every 15 minutes. Combined with our Hadoop cluster, we have achieved throughput rates greater than 8,000 documents per second. Our index currently contains more than 620M documents and is growing by 3 to 4 million documents per day. My presentation will include details about: 1) Designing a Solr Cloud cluster for scalability and high-availability using sharding and replication with Zookeeper, 2) Operations concerns like how to handle a failed node and monitoring, 3) How we deal with indexing big data from Pig/Hadoop as an example of using the CloudSolrServer in SolrJ and managing searchers for high indexing throughput, 4) Example uses of key features like real-time gets, atomic updates, custom hashing, and distributed facets. Attendees will come away from this presentation with a real-world use case that proves Solr 4 is scalable, stable, and is production ready.
An over-ambitious introduction to Spark programming, test and deployment. This slide tries to cover most core technologies and design patterns used in SpookyStuff, the fastest query engine for data collection/mashup from the deep web.
For more information please follow: https://github.com/tribbloid/spookystuff
A bug in PowerPoint used to cause transparent background color not being rendered properly. This has been fixed in a recent upload.
Introduction to Apache Spark. With an emphasis on the RDD API, Spark SQL (DataFrame and Dataset API) and Spark Streaming.
Presented at the Desert Code Camp:
http://oct2016.desertcodecamp.com/sessions/all
Webinar: Solr & Spark for Real Time Big Data AnalyticsLucidworks
Lucidworks Senior Engineer and Lucene/Solr Committer Tim Potter presents common use cases for integrating Spark and Solr, access to open source code, and performance metrics to help you develop your own large-scale search and discovery solution with Spark and Solr.
Faster Data Analytics with Apache Spark using Apache SolrChitturi Kiran
Apache Spark is a fast and general engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing. Spark SQL allows users to execute relation queries in Spark with distributed in-memory computations. Though Spark gives us faster in-memory computations, Solr is blazing fast for some analytic queries. In this talk, we will take a deep dive into how to optimize the SQL queries from Spark to Solr by plugging into the Spark LogicalPlanner using pushdown strategies. The key take aways from the talk will be:
How to perform Spark SQL queries with Apache Solr?
What happens inside a Spark SQL query?
How to plug into Spark Logical Planner?
What type of push-down strategies are optimal with Solr?
Examples of push-down strategies
Presented at Lucene Revolution - http://sched.co/BAwV
If you’re already a SQL user then working with Hadoop may be a little easier than you think, thanks to Apache Hive. It provides a mechanism to project structure onto the data in Hadoop and to query that data using a SQL-like language called HiveQL (HQL).
This cheat sheet covers:
-- Query
-- Metadata
-- SQL Compatibility
-- Command Line
-- Hive Shell
This slide deck is used as an introduction to the internals of Apache Spark, as part of the Distributed Systems and Cloud Computing course I hold at Eurecom.
Course website:
http://michiard.github.io/DISC-CLOUD-COURSE/
Sources available here:
https://github.com/michiard/DISC-CLOUD-COURSE
Knoldus organized a Meetup on 1 April 2015. In this Meetup, we introduced Spark with Scala. Apache Spark is a fast and general engine for large-scale data processing. Spark is used at a wide range of organizations to process large datasets.
DataSource V2 and Cassandra – A Whole New WorldDatabricks
Data Source V2 has arrived for the Spark Cassandra Connector, but what does this mean for you? Speed, Flexibility and Usability improvements abound and we’ll walk you through some of the biggest highlights and how you can take advantage of them today.
Apache Spark is a In Memory Data Processing Solution that can work with existing data source like HDFS and can make use of your existing computation infrastructure like YARN/Mesos etc. This talk will cover a basic introduction of Apache Spark with its various components like MLib, Shark, GrpahX and with few examples.
Abstract –
Spark 2 is here, while Spark has been the leading cluster computation framework for severl years, its second version takes Spark to new heights. In this seminar, we will go over Spark internals and learn the new concepts of Spark 2 to create better scalable big data applications.
Target Audience
Architects, Java/Scala developers, Big Data engineers, team leaders
Prerequisites
Java/Scala knowledge and SQL knowledge
Contents:
- Spark internals
- Architecture
- RDD
- Shuffle explained
- Dataset API
- Spark SQL
- Spark Streaming
Video to talk: https://www.youtube.com/watch?v=gd4Jqtyo7mM
Apache Spark is a next generation engine for large scale data processing built with Scala. This talk will first show how Spark takes advantage of Scala's function idioms to produce an expressive and intuitive API. You will learn about the design of Spark RDDs and the abstraction enables the Spark execution engine to be extended to support a wide variety of use cases(Spark SQL, Spark Streaming, MLib and GraphX). The Spark source will be be referenced to illustrate how these concepts are implemented with Scala.
http://www.meetup.com/Scala-Bay/events/209740892/
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...thelabdude
My presentation focuses on how we implemented Solr 4 to be the cornerstone of our social marketing analytics platform. Our platform analyzes relationships, behaviors, and conversations between 30,000 brands and 100M social accounts every 15 minutes. Combined with our Hadoop cluster, we have achieved throughput rates greater than 8,000 documents per second. Our index currently contains more than 620M documents and is growing by 3 to 4 million documents per day. My presentation will include details about: 1) Designing a Solr Cloud cluster for scalability and high-availability using sharding and replication with Zookeeper, 2) Operations concerns like how to handle a failed node and monitoring, 3) How we deal with indexing big data from Pig/Hadoop as an example of using the CloudSolrServer in SolrJ and managing searchers for high indexing throughput, 4) Example uses of key features like real-time gets, atomic updates, custom hashing, and distributed facets. Attendees will come away from this presentation with a real-world use case that proves Solr 4 is scalable, stable, and is production ready.
An over-ambitious introduction to Spark programming, test and deployment. This slide tries to cover most core technologies and design patterns used in SpookyStuff, the fastest query engine for data collection/mashup from the deep web.
For more information please follow: https://github.com/tribbloid/spookystuff
A bug in PowerPoint used to cause transparent background color not being rendered properly. This has been fixed in a recent upload.
Introduction to Apache Spark. With an emphasis on the RDD API, Spark SQL (DataFrame and Dataset API) and Spark Streaming.
Presented at the Desert Code Camp:
http://oct2016.desertcodecamp.com/sessions/all
Webinar: Solr & Spark for Real Time Big Data AnalyticsLucidworks
Lucidworks Senior Engineer and Lucene/Solr Committer Tim Potter presents common use cases for integrating Spark and Solr, access to open source code, and performance metrics to help you develop your own large-scale search and discovery solution with Spark and Solr.
Faster Data Analytics with Apache Spark using Apache SolrChitturi Kiran
Apache Spark is a fast and general engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing. Spark SQL allows users to execute relation queries in Spark with distributed in-memory computations. Though Spark gives us faster in-memory computations, Solr is blazing fast for some analytic queries. In this talk, we will take a deep dive into how to optimize the SQL queries from Spark to Solr by plugging into the Spark LogicalPlanner using pushdown strategies. The key take aways from the talk will be:
How to perform Spark SQL queries with Apache Solr?
What happens inside a Spark SQL query?
How to plug into Spark Logical Planner?
What type of push-down strategies are optimal with Solr?
Examples of push-down strategies
Presented at Lucene Revolution - http://sched.co/BAwV
If you’re already a SQL user then working with Hadoop may be a little easier than you think, thanks to Apache Hive. It provides a mechanism to project structure onto the data in Hadoop and to query that data using a SQL-like language called HiveQL (HQL).
This cheat sheet covers:
-- Query
-- Metadata
-- SQL Compatibility
-- Command Line
-- Hive Shell
This slide deck is used as an introduction to the internals of Apache Spark, as part of the Distributed Systems and Cloud Computing course I hold at Eurecom.
Course website:
http://michiard.github.io/DISC-CLOUD-COURSE/
Sources available here:
https://github.com/michiard/DISC-CLOUD-COURSE
Knoldus organized a Meetup on 1 April 2015. In this Meetup, we introduced Spark with Scala. Apache Spark is a fast and general engine for large-scale data processing. Spark is used at a wide range of organizations to process large datasets.
DataSource V2 and Cassandra – A Whole New WorldDatabricks
Data Source V2 has arrived for the Spark Cassandra Connector, but what does this mean for you? Speed, Flexibility and Usability improvements abound and we’ll walk you through some of the biggest highlights and how you can take advantage of them today.
Apache Spark is a In Memory Data Processing Solution that can work with existing data source like HDFS and can make use of your existing computation infrastructure like YARN/Mesos etc. This talk will cover a basic introduction of Apache Spark with its various components like MLib, Shark, GrpahX and with few examples.
Abstract –
Spark 2 is here, while Spark has been the leading cluster computation framework for severl years, its second version takes Spark to new heights. In this seminar, we will go over Spark internals and learn the new concepts of Spark 2 to create better scalable big data applications.
Target Audience
Architects, Java/Scala developers, Big Data engineers, team leaders
Prerequisites
Java/Scala knowledge and SQL knowledge
Contents:
- Spark internals
- Architecture
- RDD
- Shuffle explained
- Dataset API
- Spark SQL
- Spark Streaming
This presentation is an introduction to Apache Spark. It covers the basic API, some advanced features and describes how Spark physically executes its jobs.
Spark real world use cases and optimizationsGal Marder
Using Spark for BigData became the standard in the industry. The internet is
full with "hello world" examples, but when your Spark job meets production all hell breaks loose. We will cover real world use cases, how they were designed, why they didn't work and how we made them run fast
In this talk, we present two emerging, popular open source projects: Spark and Shark. Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write. It outperform Hadoop by up to 100x in many real-world applications. Spark programs are often much shorter than their MapReduce counterparts thanks to its high-level APIs and language integration in Java, Scala, and Python. Shark is an analytic query engine built on top of Spark that is compatible with Hive. It can run Hive queries much faster in existing Hive warehouses without modifications.
These systems have been adopted by many organizations large and small (e.g. Yahoo, Intel, Adobe, Alibaba, Tencent) to implement data intensive applications such as ETL, interactive SQL, and machine learning.
Author: Stefan Papp, Data Architect at “The unbelievable Machine Company“. An overview of Big Data Processing engines with a focus on Apache Spark and Apache Flink, given at a Vienna Data Science Group meeting on 26 January 2017. Following questions are addressed:
• What are big data processing paradigms and how do Spark 1.x/Spark 2.x and Apache Flink solve them?
• When to use batch and when stream processing?
• What is a Lambda-Architecture and a Kappa Architecture?
• What are the best practices for your project?
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
3. Spark’s Goal
Support batch, streaming, and interactive computations
in a unified framework
http://strataconf.com/stratany2013/public/schedule/detail/30959
5. To validate our hypothesis
that specialized frameworks provide value over general ones,
we have also built a new framework
on top of Mesos called Spark,
optimized for iterative jobs
where a dataset is reused in many parallel operations,
and shown that Spark can outperform Hadoop by 10x
in iterative machine learning workloads.
6. http://spark.apache.org/docs/latest/cluster-overview.html
• Cluster Manager
• external service for acquiring resources on the cluster
• Standalone, Mesos, YARN
• Worker node
• Any node that can run application code in the cluster
• Application
• driver program + executors
• SparkContext
• application session
• connection to a cluster
7. http://spark.apache.org/docs/latest/cluster-overview.html
• Driver Program
• Process running the main()
• create SparkContext
• Executor
• process launched on a worker node
• Each application has its own executors
• Long-running and runs many small tasks
• keeps data in memory/disk storage
• Task
• Unit of work that will be sent to one executor
12. Shell in Local Mode
REPL(Read-Eval-Print Loop) = Interactive Shell
import org.apache.spark.{SparkContext, SparkConf}
val sc = new SparkContext("local[*]", "Spark shell", new SparkConf())
scala>
scala> sc
res0: org.apache.spark.SparkContext = org.apache.spark.SparkContext@36c783ca
자동으로 다음과 같이 SparkContext(sc) 생성한 후 shell prompt
• 장점
• 분산처리에서 어려운 local test/debugging, unit test
• lazy 처리: 데이터 처리 없이 syntax 먼저 검증 가능
• 단점
• 메모리 한계
• jar loading 차이(YARN/Mesos Shell/Submit Mode 차이)
13. $ cd ../
$ tar zxvf spark/spark-1.5.1-bin-spark-hadoop2.4.tgz
$ cd spark-1.5.1-bin-spark-hadoop2.4/
$ bin/spark-shell --master local[*]
Run Spark Shell
14. $ cat init.script
import java.lang.Runtime
println(s"cores = ${Runtime.getRuntime.availableProcessors}")
$ SPARK_PRINT_LAUNCH_COMMAND=1 bin/spark-shell --master local[*] -i init.script
Spark Command: /Library/Java/JavaVirtualMachines/jdk1.8.0_60.jdk/Contents/Home/jre/
bin/java -cp /Users/taewook/spark-1.5.1-bin-spark-hadoop2.4/conf/:/Users/taewook/
spark-1.5.1-bin-spark-hadoop2.4/lib/spark-assembly-1.5.1-hadoop2.4.0.jar:/Users/
taewook/spark-1.5.1-bin-spark-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar:/Users/
taewook/spark-1.5.1-bin-spark-hadoop2.4/lib/datanucleus-core-3.2.10.jar:/Users/
taewook/spark-1.5.1-bin-spark-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar -
Dscala.usejavacp=true -Xms1g -Xmx1g org.apache.spark.deploy.SparkSubmit --master
local[*] --class org.apache.spark.repl.Main --name Spark shell spark-shell -i
init.script
========================================
...
Type :help for more information.
...
Spark context available as sc.
...
Loading init.script...
import java.lang.Runtime
cores = 4
scala>
scala> :load init.script
Loading init.script...
import java.lang.Runtime
cores = 4
Run Spark Shell
:paste enter paste mode: all input up to
ctrl-D compiled together
:cp <path> add a jar or directory to the classpath
:history [num] show the history
(optional num is commands to show)
~/.spark_history
java 실행 명령으로 classpath, JVM 옵션 등 확인
함수/클래스 정의 등 미리 실행하면 편리한 초기화 명령 수행
16. An RDD is a read-only collection of objects
partitioned across a set of machines
that can be rebuilt if a partitionis lost.
17. • Read-only = Immutable
• Parallelism ➔ 분산 처리
• 오랫동안 Caching 가능 ➔ 성능
• Transformation for change
• 데이터 복사 반복 ➔ 성능➡, 공간 낭비
• Laziness로 극복
Core abstraction in the core of Spark
An RDD is a read-only collection of objects
partitioned across a set of machines
that can be rebuilt if a partitionis lost.
Resilient Distributed Dataset
18. • Partitioned = Distributed
• more partiton = more parallelism
Resilient Distributed Dataset
Core abstraction in the core of Spark
An RDD is a read-only collection of objects
partitioned across a set of machines
that can be rebuilt if a partitionis lost.
19. Resilient Distributed Dataset
• can be rebuilt = Resilient
• recover from lost data partitions
• by data lineage
• can be cached
• lineage 짧게 줄여서 더 빠르게 복구
Core abstraction in the core of Spark
An RDD is a read-only collection of objects
partitioned across a set of machines
that can be rebuilt if a partitionis lost.
20. • Scala Collection API와 비슷 + 분산 데이터 연산
• map(), filter(), reduce(), count(), foreach(), …
• http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD
• Transformations
• return a new RDD
• deterministic - 실패해서 재실행해도 결과 항상 같음
• lazy evaluation
• Actions
• return a final value(some other data type)
• 첫 RDD부터 실제 실행(Caching 하면 cache된 RDD부터 실행)
RDD Operations
http://training.databricks.com/workshop/sparkcamp.pdf
22. • Laziness = Lazily constructed
• 지연 계산 - 필요할 때까지 연산 미루기
• evaluation과 execution의 분리
• 실행 전에 최소한의 오류 검사
• 중간 RDD 결과값 저장 불필요
• Intermediate RDDs not materialized
• Immutability & Laziness
• Immutability ➔ Laziness 가능
• side-effect 없어 transformation들 combine 가능
• combine node steps into “stages” (최적화)
• ➔ 성능 , 분산 처리 가능
Resilient Distributed Dataset
23. Creating RDDs
• parallelizing a collection
• 한 대의 driver 장비의 메모리에 모두 올림
• for only prototyping, testing
• loading an external data set
• 외부 소스로부터 읽기
• sc.textFile(): file://, hdfs://, s3n://
• sc.hadoopFile(), sc.newAPIHadoopFile()
• sqlContext.sql(), JdbcRDD(), …
scala> val numbers = sc.parallelize(1 to 10)
numbers: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0]
at parallelize at <console>:21
scala> val textFile = sc.textFile("README.md")
textFile: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[2]
at textFile at <console>:21
24. scala> val numbers = sc.parallelize(1 to 10)
numbers: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0]
at parallelize at <console>:21
scala> numbers.partitions.length
res0: Int = 4
scala> numbers.glom().collect()
res1: Array[Array[Int]] = Array(Array(1, 2), Array(3, 4, 5),
Array(6, 7), Array(8, 9, 10))
scala> val numbersWith2Partitions = sc.parallelize(1 to 10, 2)
numbersWith2Partitions: org.apache.spark.rdd.RDD[Int] =
ParallelCollectionRDD[2] at parallelize at <console>:21
scala> numbersWith2Partitions.partitions.length
res2: Int = 2
scala> numbersWith2Partitions.glom().collect()
res3: Array[Array[Int]] = Array(Array(1, 2, 3, 4, 5), Array(6, 7,
8, 9, 10))
Partitions
numbers.mapPartitionsWithIndex()
26. Word Count
val textFile = sc.textFile("README.md", 4)
val words = textFile.flatMap(line => line.split("[s]+"))
val realWords = words.filter(_.nonEmpty)
val wordTuple = realWords.map(word => (word, 1))
val groupBy = wordTuple.groupByKey(2)
val wordCount = groupBy.mapValues(value => value.reduce(_ + _))
wordCount.collect().sortBy(-_._2)
27. Word Count
val textFile = sc.textFile("README.md", 4)
textFile: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[1] at textFile at <console>:21
val words = textFile.flatMap(line => line.split("[s]+"))
words: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[2] at flatMap at <console>:23
val realWords = words.filter(_.nonEmpty)
realWords: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[3] at filter at <console>:25
val wordTuple = realWords.map(word => (word, 1))
wordTuple: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[4] at map at <console>:27
val groupBy = wordTuple.groupByKey(2)
groupBy: org.apache.spark.rdd.RDD[(String, Iterable[Int])] = ShuffledRDD[5] at groupByKey at
<console>:29
val wordCount = groupBy.mapValues(value => value.reduce(_ + _))
wordCount: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[6] at mapValues at
<console>:31
wordCount.collect().sortBy(-_._2)
res0: Array[(String, Int)] = Array((the,21), (Spark,14), (to,14), (for,12), (a,10), (and,10),
(##,8), (run,7), (can,6), (is,6), (on,6), (also,5), (in,5), (of,5), (with,4), (if,4), ...
wordCount.saveAsTextFile("wordcount.txt")
28. • Main program are executed on the Spark Driver
• Transformations are executed on the Spark Workers
• Actions may transfer from the Workers to the Driver
collect(), countByKey(), countByValue(), collectAsMap()
➔ • bounded output: count(), take(N)
• unbounded output: saveAsTextFile()
http://www.slideshare.net/databricks/strata-sj-everyday-im-shuffling-tips-for-writing-better-spark-programs
executor(on worker) 동작 영역과 driver 동작 영역 구분
driver에서는 action과 accumulator 외에 executor의 데이터 받을 수 없음
29. Data Lineage of RDD
scala> wordCount.toDebugString
res1: String =
(2) MapPartitionsRDD[6] at mapValues at <console>:31 []
| ShuffledRDD[5] at groupByKey at <console>:29 []
+-(4) MapPartitionsRDD[4] at map at <console>:27 []
| MapPartitionsRDD[3] at filter at <console>:25 []
| MapPartitionsRDD[2] at flatMap at <console>:23 []
| MapPartitionsRDD[1] at textFile at <console>:21 []
| README.md HadoopRDD[0] at textFile at <console>:21 []
scala> wordCount.dependencies.head.rdd
res2: org.apache.spark.rdd.RDD[_] = ShuffledRDD[5] at groupByKey
at <console>:29
scala> textFile.dependencies.head.rdd
res3: org.apache.spark.rdd.RDD[_] = README.md HadoopRDD[0] at
textFile at <console>:21
scala> textFile.dependencies.head.rdd.dependencies
res4: Seq[org.apache.spark.Dependency[_]] = List()
모든 RDD는 부모 RDD 추적 ➔ DAG Scheduling과 복구의 기본
30. Data Lineage of RDD
scala> wordCount.toDebugString
res1: String =
(2) MapPartitionsRDD[6] at mapValues at <console>:31 []
| ShuffledRDD[5] at groupByKey at <console>:29 []
+-(4) MapPartitionsRDD[4] at map at <console>:27 []
| MapPartitionsRDD[3] at filter at <console>:25 []
| MapPartitionsRDD[2] at flatMap at <console>:23 []
| MapPartitionsRDD[1] at textFile at <console>:21 []
| README.md HadoopRDD[0] at textFile at <console>:21 []
flatMap
filter
map
groupByKey
mapValues
Step
Stage
textFile
textFile
Nil
Stage 1
Stage 0
Parent
shuffle
boundary
34. Schedule & Execute tasks
$ bin/spark-shell --master local[3]
...
val textFile = sc.textFile("README.md", 4)
...
val groupBy = wordTuple.groupByKey(2)
35. val groupBy = wordTuple.groupByKey(2)
groupBy: org.apache.spark.rdd.RDD[(String, Iterable[Int])] = ShuffledRDD[5] at groupByKey at
<console>:29
scala> groupBy.collect()
res5: Array[(String, Iterable[Int])] = Array((package,CompactBuffer(1)), (this,CompactBuffer(1)
), (Version"](http://spark.apache.org/docs/latest/building-spark.html#specifying-the-hadoop-
version),CompactBuffer(1)), (Because,CompactBuffer(1)), (Python,CompactBuffer(1, 1)),
(cluster.,CompactBuffer(1)), (its,CompactBuffer(1)), ([run,CompactBuffer(1)),
(general,CompactBuffer(1, 1)), (YARN,,CompactBuffer(1)), (have,CompactBuffer(1)), (pre-
built,CompactBuffer(1)), (locally.,CompactBuffer(1)), (locally,CompactBuffer(1, 1)),
(changed,CompactBuffer(1)), (sc.parallelize(1,CompactBuffer(1)), (only,CompactBuffer(1)),
(several,CompactBuffer(1)), (learning,,CompactBuffer(1)), (basic,CompactBuffer(1)),
(first,CompactBuffer(1)), (This,CompactBuffer(1, 1)), (documentation,CompactBuffer(1, 1, 1)),
(Confi...
• HashMap within each partition
• no map-side aggregation (=combiner of MapReduce)
• single key-value pair must fit in memory (Out Of Memory/Disk)
groupByKey()
rdd.groupByKey().mapValues(value => value.reduce(func))
= rdd.reduceByKey(func)
대체 가능하다면 groupByKey 대신
reducedByKey, aggregateByKey, foldByKey, combineByKey 사용 권장
36. scala> val wordCountReduceByKey = wordTuple.reduceByKey(_ + _, 2)
wordCountReduceByKey: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[7] at reduceByKey at <console>:29
scala> wordCountReduceByKey.toDebugString
res6: String =
(2) ShuffledRDD[7] at reduceByKey at <console>:29 []
+-(4) MapPartitionsRDD[4] at map at <console>:27 []
| MapPartitionsRDD[3] at filter at <console>:25 []
| MapPartitionsRDD[2] at flatMap at <console>:23 []
| MapPartitionsRDD[1] at textFile at <console>:21 []
| README.md HadoopRDD[0] at textFile at <console>:21 []
scala> wordCountReduceByKey.collect().sortBy(-_._2)
res7: Array[(String, Int)] = Array((the,21), (Spark,14), (to,14), (for,12), (a,10), (and,10), (##,8), (run,7), (can,6)
, (is,6), (on,6), (also,5), (in,5), (of,5), (with,4), (if,4), ...
reduceByKey()
scala> sc.setLogLevel("INFO")
scala> wordCountReduceByKey.collect().sortBy(-_._2)
...
... INFO DAGScheduler: Got job 3 (collect at <console>:32) with 2 output partitions
... INFO DAGScheduler: Final stage: ResultStage 7(collect at <console>:32)
... INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 6)
... INFO DAGScheduler: Missing parents: List()
... INFO DAGScheduler: Submitting ResultStage 7 (ShuffledRDD[7] at reduceByKey at <console>:29), which has no missing
parents
... INFO MemoryStore: ensureFreeSpace(2328) called with curMem=109774, maxMem=555755765
... INFO MemoryStore: Block broadcast_6 stored as values in memory (estimated size 2.3 KB, free 529.9 MB)
... INFO MemoryStore: ensureFreeSpace(1378) called with curMem=112102, maxMem=555755765
... INFO MemoryStore: Block broadcast_6_piece0 stored as bytes in memory (estimated size 1378.0 B, free 529.9 MB)
... INFO BlockManagerInfo: Added broadcast_6_piece0 in memory on localhost:60479 (size: 1378.0 B, free: 530.0 MB)
... INFO SparkContext: Created broadcast 6 from broadcast at DAGScheduler.scala:861
... INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 7 (ShuffledRDD[7] at reduceByKey at <console>:29)
...
res9: Array[(String, Int)] = Array((the,21), (Spark,14), (to,14), (for,12), (a,10), (and,10), (##,8), (run,7), (can,6)
, (is,6), (on,6), (also,5), (in,5), (of,5), (with,4), (if,4), ...
38. scala> sc.setLogLevel("INFO")
scala> val textFile = sc.textFile("README.md", 4)
scala> val words = textFile.flatMap(line => line.split("[s]+"))
scala> val realWords = words.filter(_.nonEmpty)
scala> val wordTuple = realWords.map(word => (word, 1))
scala> wordTuple.cache()
scala> val groupBy = wordTuple.groupByKey(2)
scala> val wordCount = groupBy.mapValues(value => value.reduce(
_ + _))
scala> wordCount.toDebugString
res2: String =
(2) MapPartitionsRDD[6] at mapValues at <console>:31 []
| ShuffledRDD[5] at groupByKey at <console>:29 []
+-(4) MapPartitionsRDD[4] at map at <console>:27 []
| MapPartitionsRDD[3] at filter at <console>:25 []
| MapPartitionsRDD[2] at flatMap at <console>:23 []
| MapPartitionsRDD[1] at textFile at <console>:21 []
| README.md HadoopRDD[0] at textFile at <console>:21 []
scala> wordCount.collect().sortBy(-_._2)
...
... INFO BlockManagerInfo: Added rdd_4_0 in memory on localhost:60641 (size: 9.9 KB, free: 530.0 MB)
... INFO BlockManagerInfo: Added rdd_4_1 in memory on localhost:60641 (size: 9.5 KB, free: 530.0 MB)
... INFO BlockManagerInfo: Added rdd_4_2 in memory on localhost:60641 (size: 10.7 KB, free: 530.0 MB)
... INFO BlockManagerInfo: Added rdd_4_3 in memory on localhost:60641 (size: 8.0 KB, free: 530.0 MB)
...
Cache
39. scala> val wordCountReduceByKey = wordTuple.reduceByKey(_ + _, 2)
scala> wordCountReduceByKey.toDebugString
res4: String =
(2) ShuffledRDD[7] at reduceByKey at <console>:29 []
+-(4) MapPartitionsRDD[4] at map at <console>:27 []
| CachedPartitions: 4; MemorySize: 38.1 KB; ExternalBlockStoreSize: 0.0 B; DiskSize: 0.0 B
| MapPartitionsRDD[3] at filter at <console>:25 []
| MapPartitionsRDD[2] at flatMap at <console>:23 []
| MapPartitionsRDD[1] at textFile at <console>:21 []
| README.md HadoopRDD[0] at textFile at <console>:21 []
scala> wordCountReduceByKey.collect().sortBy(-_._2)
...
... INFO BlockManager: Found block rdd_4_0 locally
... INFO BlockManager: Found block rdd_4_1 locally
... INFO BlockManager: Found block rdd_4_2 locally
... INFO BlockManager: Found block rdd_4_3 locally
...
Cache
• Spark은 반복 연산에 특화되어 빠르지만 cache 안하면 무의미
• cache 없으면 첫 RDD부터 모두 계산
• LRU(Least-Recently-Used) 정책
• 바로 재사용하지 않으면 불필요하게 메모리에 캐싱할 필요 없음
• 수동 cache 해제: RDD.unpersist()
• 기본 StorageLevel(MEMORY_ONLY)은 deserialized to memory
• 메모리 CPU
45. DataFrame
• Distributed collection of rows organized into named columns.
• inspired by DataFrame in R and Pandas in Python
• RDD with schema (org.apache.spark.sql.SchemaRDD before v1.3)
• Python, Scala, Java, and R (via SparkR)
• Making Spark accessible to everyone (RDB, R에 익숙한 사람까지)
• data scientists, engineers, statisticians, ...
http://www.slideshare.net/databricks/2015-0616-spark-summit
48. DataFrame API
• API 사용법
• http://spark.apache.org/docs/latest/sql-programming-guide.html
• http://www.slideshare.net/databricks/introducing-dataframes-in-spark-for-large-scale-data-science
• https://github.com/yu-iskw/spark-dataframe-introduction/blob/master/doc/dataframe-introduction.md
• 책도 부족, 최신 내용은 Source Code 참고
• Example Code
• /spark/examples/src/main/scala/org/apache/spark/examples/sql/RDDRelation.scala
• TestSuite
• /spark/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala
50. Same performance for all languages
http://www.slideshare.net/databricks/spark-dataframes-simple-and-fast-analytics-on-structured-data-at-spark-summit-2015
51. • Simple tasks easy with DataFrame API
• Complex tasks possible with RDD API
scala> import sqlContext.implicits._
scala> case class Person(name: String, age: Int)
scala> val people = sc.textFile("examples/src/main/resources/people.txt").map(_.split(",")).map(p =>
Person(p(0), p(1).trim.toInt)).toDF()
people: org.apache.spark.sql.DataFrame = [name: string, age: int]
scala> people.registerTempTable("people")
teenagers: org.apache.spark.sql.DataFrame = [name: string, age: int]
scala> val teenagers = sqlContext.sql("SELECT name, age FROM people WHERE age >= 13 AND age <= 19")
scala> teenagers.show()
+------+---+
| name|age|
+------+---+
|Justin| 19|
+------+---+
scala> val teenagersRdd = teenagers.rdd
teenagersRdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[8] at rdd at <console>:24
scala> teenagersRdd.toDebugString
res2: String =
(2) MapPartitionsRDD[8] at rdd at <console>:24 []
| MapPartitionsRDD[7] at rdd at <console>:24 []
| MapPartitionsRDD[4] at rddToDataFrameHolder at <console>:26 []
| MapPartitionsRDD[3] at map at <console>:26 []
| MapPartitionsRDD[2] at map at <console>:26 []
| MapPartitionsRDD[1] at textFile at <console>:26 []
| examples/src/main/resources/people.txt HadoopRDD[0] at textFile at <console>:26 []
scala> teenagersRdd.collect()
res3: Array[org.apache.spark.sql.Row] = Array([Justin,19])
54. • Apache Spark User List
• http://apache-spark-user-list.1001560.n3.nabble.com/
• Devops Advanced Class
• http://training.databricks.com/devops.pdf
• Intro to Apache Spark
• http://training.databricks.com/workshop/sparkcamp.pdf
• Apache Spark Tutorial
• http://cdn.liber118.com/workshop/fcss_spark.pdf
• Anatomy of RDD : Deep dive into Spark RDD abstraction
• http://www.slideshare.net/datamantra/anatomy-of-rdd
• A Deeper Understanding of Spark’s Internals
• https://spark-summit.org/wp-content/uploads/2014/07/A-Deeper-Understanding-of-Spark-Internals-Aaron-Davidson.pdf
• Scala and the JVM for Big Data: Lessons from Spark
• https://deanwampler.github.io/polyglotprogramming/papers/ScalaJVMBigData-SparkLessons.pdf
• Lightning Fast Big Data Analytics with Apache Spark
• http://www.virdata.com/wp-content/uploads/Spark-Devoxx2014.pdf
References
58. • Partition 수
• 너무 큰 파일 읽을 때는 coalesce(N) 사용해서 executor 수 줄임
• repartition 없이 하나의 executor가 여러 partition 처리
• 오래걸리는 CPU 연산은 repartition(N) 으로 executor 수 늘림
• Executor 수
• Job 내 최대 partition 수의 2배 이상
• Executor Memory
• 장비 메모리의 최대 75% 사용 권장
• 최소 Heap 크기는 8GB
• 최대 Heap 크기는 40GB 넘지 않도록 (GC 확인 필요)
• 메모리 사용량은 StorageLevel과 Serialization 형식에 영향 받음
• G1GC 설정
• https://databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark-applications.html
• 각종 Tuning
• http://spark.apache.org/docs/latest/tuning.html
Tips
59. • min, max, sum, mean 여러번 따로 구하지 말고 stats() 한 번에 구하기
• count, sum, min, max, mean, stdev, variance, sampleStdev, smapleVariance
• Shuffle 문제 확인 방법
• Web UI에서 오래 걸리거나 큰 input/output이 있는 stage/partition 확인
• KryoSerializer
• /conf/spark-defaults.conf
• "spark.serializer", “org.apache.spark.serializer.KryoSerializer"
• StorageLevel.MEMORY_ONLY_SER 과 함께 사용 권장
• Task not serializable: java.io.NotSerializableException
• executor 내에서 수행되는 함수 내에서 객체 생성 및 사용
• https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/troubleshooting/javaionotserializableexception.html
Tips