Apache Spark™ is a fast and general engine for large-scale data processing. Spark is written in Scala and runs on top of JVM, but Python is one of the officially supported languages. But how does it actually work? How can Python communicate with Java / Scala? In this talk, we’ll dive into the PySpark internals and try to understand how to write and test high-performance PySpark applications.
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Edureka!
** PySpark Certification Training: https://www.edureka.co/pyspark-certification-training**
This Edureka tutorial on PySpark Tutorial will provide you with a detailed and comprehensive knowledge of Pyspark, how it works, the reason why python works best with Apache Spark. You will also learn about RDDs, data frames and mllib.
PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...Edureka!
** PySpark Certification Training: https://www.edureka.co/pyspark-certification-training **
This Edureka tutorial on PySpark Programming will give you a complete insight of the various fundamental concepts of PySpark. Fundamental concepts include the following:
1. PySpark
2. RDDs
3. DataFrames
4. PySpark SQL
5. PySpark Streaming
6. Machine Learning (MLlib)
PySpark Training | PySpark Tutorial for Beginners | Apache Spark with Python ...Edureka!
** PySpark Certification Training: https://www.edureka.co/pyspark-certification-training **
This Edureka tutorial on PySpark Training will help you learn about PySpark API. You will get to know how python can be used with Apache Spark for Big Data Analytics. Edureka's structured training on Pyspark will help you master skills that are required to become a successful Spark Developer using Python and prepare you for the Cloudera Hadoop and Spark Developer Certification Exam (CCA175).
Data Wrangling with PySpark for Data Scientists Who Know Pandas with Andrew RayDatabricks
Data scientists spend more time wrangling data than making models. Traditional tools like Pandas provide a very powerful data manipulation toolset. Transitioning to big data tools like PySpark allows one to work with much larger datasets, but can come at the cost of productivity.
In this session, learn about data wrangling in PySpark from the perspective of an experienced Pandas user. Topics will include best practices, common pitfalls, performance consideration and debugging.
Apache Spark™ is a fast and general engine for large-scale data processing. Spark is written in Scala and runs on top of JVM, but Python is one of the officially supported languages. But how does it actually work? How can Python communicate with Java / Scala? In this talk, we’ll dive into the PySpark internals and try to understand how to write and test high-performance PySpark applications.
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Edureka!
** PySpark Certification Training: https://www.edureka.co/pyspark-certification-training**
This Edureka tutorial on PySpark Tutorial will provide you with a detailed and comprehensive knowledge of Pyspark, how it works, the reason why python works best with Apache Spark. You will also learn about RDDs, data frames and mllib.
PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...Edureka!
** PySpark Certification Training: https://www.edureka.co/pyspark-certification-training **
This Edureka tutorial on PySpark Programming will give you a complete insight of the various fundamental concepts of PySpark. Fundamental concepts include the following:
1. PySpark
2. RDDs
3. DataFrames
4. PySpark SQL
5. PySpark Streaming
6. Machine Learning (MLlib)
PySpark Training | PySpark Tutorial for Beginners | Apache Spark with Python ...Edureka!
** PySpark Certification Training: https://www.edureka.co/pyspark-certification-training **
This Edureka tutorial on PySpark Training will help you learn about PySpark API. You will get to know how python can be used with Apache Spark for Big Data Analytics. Edureka's structured training on Pyspark will help you master skills that are required to become a successful Spark Developer using Python and prepare you for the Cloudera Hadoop and Spark Developer Certification Exam (CCA175).
Data Wrangling with PySpark for Data Scientists Who Know Pandas with Andrew RayDatabricks
Data scientists spend more time wrangling data than making models. Traditional tools like Pandas provide a very powerful data manipulation toolset. Transitioning to big data tools like PySpark allows one to work with much larger datasets, but can come at the cost of productivity.
In this session, learn about data wrangling in PySpark from the perspective of an experienced Pandas user. Topics will include best practices, common pitfalls, performance consideration and debugging.
Integrating Existing C++ Libraries into PySpark with Esther KundinDatabricks
Bloomberg’s Machine Learning/Text Analysis team has developed many machine learning libraries for fast real-time sentiment analysis of incoming news stories. These models were developed using smaller training sets, implemented in C++ for minimal latency, and are currently running in production. To facilitate backtesting our production models across our full data set, we needed to be able to parallelize our workloads, while using the actual production code.
We also wanted to integrate the C++ code with PySpark and use it to run our models. In this talk, I will discuss some of the challenges we faced, decisions we made, and other options when dealing with integrating existing C++ code into a Spark system. The techniques we developed have been used successfully by our team multiple times and I am sure others will benefit from the gotchas that we were able to identify.
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...Spark Summit
If you are running Apache Spark in cloud environments, Object Stores —such as Amazon S3 or Azure WASB— are a core part of your system. What you can’t do is treat them like “just another filesystem” —do that and things will, eventually, go horribly wrong.
This talk looks at the object stores in the cloud infrastructures, including underlying architectures., compares them to what a “real filesystem” is expected to do and shows how to use object stores efficiently and safely as sources of and destinations of data.
It goes into depth on recent “S3a” work, showing how including improvements in performance, security, functionality and measurement —and demonstrating how to use make best use of it from a spark application.
If you are planning to deploy Spark in cloud, or doing so today: this is information you need to understand. The performance of you code and integrity of your data depends on it.
Development of Software for scalable anomaly detection modeling of time-series data using Apache Spark.
私たちはこれまで、様々な機器類を監視するセンサーの時系列データを分析し、異常を検知する手法およびソフトウェアの研究開発を行ってきた。
今回紹介するソフトウェアでは、バッチ処理で複数のセンサーから得られた高次元の時系列データから線形のLASSO回帰により学習、モデル化し、異常時を識別する。
しかし学習時間やメモリー使用量の増大が課題になってきたため、Sparkを活用し並列分散化を行った。
SparkにはMLlibという汎用的な機械学習ライブラリが存在するが、今回は使用するアルゴリズムの特殊性を考慮し、既存実装を基に新規に開発した。
本講演では当開発におけるデザインチョイスや性能計測結果について報告する。
a
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Edureka!
This Edureka Spark Tutorial will help you to understand all the basics of Apache Spark. This Spark tutorial is ideal for both beginners as well as professionals who want to learn or brush up Apache Spark concepts. Below are the topics covered in this tutorial:
1) Big Data Introduction
2) Batch vs Real Time Analytics
3) Why Apache Spark?
4) What is Apache Spark?
5) Using Spark with Hadoop
6) Apache Spark Features
7) Apache Spark Ecosystem
8) Demo: Earthquake Detection Using Apache Spark
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...Databricks
GPU acceleration has been at the heart of scientific computing and artificial intelligence for many years now. GPUs provide the computational power needed for the most demanding applications such as Deep Neural Networks, nuclear or weather simulation. Since the launch of RAPIDS in mid-2018, this vast computational resource has become available for Data Science workloads too. The RAPIDS toolkit, which is now available on the Databricks Unified Analytics Platform, is a GPU-accelerated drop-in replacement for utilities such as Pandas/NumPy/ScikitLearn/XGboost. Through its use of Dask wrappers the platform allows for true, large scale computation with minimal, if any, code changes.
The goal of this talk is to discuss RAPIDS, its functionality, architecture as well as the way it integrates with Spark providing on many occasions several orders of magnitude acceleration versus its CPU-only counterparts.
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...Databricks
This talk is about methods and tools for troubleshooting Spark workloads at scale and is aimed at developers, administrators and performance practitioners. You will find examples illustrating the importance of using the right tools and right methodologies for measuring and understanding performance, in particular highlighting the importance of using data and root cause analysis to understand and improve the performance of Spark applications. The talk has a strong focus on practical examples and on tools for collecting data relevant for performance analysis. This includes tools for collecting Spark metrics and tools for collecting OS metrics. Among others, the talk will cover sparkMeasure, a tool developed by the author to collect Spark task metric and SQL metrics data, tools for analysing I/O and network workloads, tools for analysing CPU usage and memory bandwidth, tools for profiling CPU usage and for Flame Graph visualization.
In this tutorial we will present Koalas, a new open source project that we announced at the Spark + AI Summit in April. Koalas is an open-source Python package that implements the pandas API on top of Apache Spark, to make the pandas API scalable to big data. Using Koalas, data scientists can make the transition from a single machine to a distributed environment without needing to learn a new framework.
We will demonstrate Koalas’ new functionalities since its initial release, discuss its roadmaps, and how we think Koalas could become the standard API for large scale data science.
What you will learn:
How to get started with Koalas
Easy transition from Pandas to Koalas on Apache Spark
Similarities between Pandas and Koalas APIs for DataFrame transformation and feature engineering
Single machine Pandas vs distributed environment of Koalas
Prerequisites:
A fully-charged laptop (8-16GB memory) with Chrome or Firefox
Python 3 and pip pre-installed
pip install koalas from PyPI
Pre-register for Databricks Community Edition
Read koalas docs
Running Emerging AI Applications on Big Data Platforms with Ray On Apache SparkDatabricks
With the rapid evolution of AI in recent years, we need to embrace advanced and emerging AI technologies to gain insights and make decisions based on massive amounts of data. Ray (https://github.com/ray-project/ray) is a fast and simple framework open-sourced by UC Berkeley RISELab particularly designed for easily building advanced AI applications in a distributed fashion.
Using SparkR to Scale Data Science Applications in Production. Lessons from t...Spark Summit
R is a hugely popular platform for Data Scientists to create analytic models in many different domains. But when these applications should move from the science lab to the production environment of large enterprises a new set of challenges arises. Independently of R, Spark has been very successful as a powerful general-purpose computing platform. With the introduction of SparkR an exciting new option to productionize Data Science applications has been made available. This talk will give insight into two real-life projects at major enterprises where Data Science applications in R have been migrated to SparkR.
• Dealing with platform challenges: R was not installed on the cluster. We show how to execute SparkR on a Yarn cluster with a dynamic deployment of R.
• Integrating Data Engineering and Data Science: we highlight the technical and cultural challenges that arise from closely integrating these two different areas.
• Separation of concerns: we describe how to disentangle ETL and data preparation from analytic computing and statistical methods.
• Scaling R with SparkR: we present what options SparkR offers to scale R applications and how we applied them to different areas such as time series forecasting and web analytics.
• Performance Improvements: we will show benchmarks for an R applications that took over 20 hours on a single server/single-threaded setup. With moderate effort we have been able to reduce that number to 15 minutes with SparkR. And we will show how we plan to further reduces this to less than a minute in the future.
• Mixing SparkR, SparkSQL and MLlib: we show how we combined the three different libraries to maximize efficiency.
• Summary and Outlook: we describe what we have learnt so far, what the biggest gaps currently are and what challenges we expect to solve in the short- to mid-term.
Using open source tools for network device dataplane testing.
Our experiences from redGuardian DDoS mitigation scrubber testing.
Presented at PLNOG 20 (2018).
PLNOG20 - Paweł Małachowski - Stress your DUT–wykorzystanie narzędzi open sou...PROIDEA
Wybór docelowej platformy sieciowej (np. routera, firewalla, scrubbera DDoS) jest często poprzedzony jej testami. Jednym z celów testów jest sprawdzenie, czy parametry wydajnościowe deklarowane przez producenta odpowiadają rzeczywistości. Zespół rozwijający redGuardian Anty DDoS testuje rozwiązanie regresyjnie i wydajnościowo w sposób zautomatyzowany od początku jego istnienia. W czasie prezentacji przeanalizujemy aspekty, na które warto zwrócić uwagę w czasie testów wydajnościowych urządzeń IP oraz przyjrzymy się narzędziom open source pomocnym w realizacji tego zadania.
Integrating Existing C++ Libraries into PySpark with Esther KundinDatabricks
Bloomberg’s Machine Learning/Text Analysis team has developed many machine learning libraries for fast real-time sentiment analysis of incoming news stories. These models were developed using smaller training sets, implemented in C++ for minimal latency, and are currently running in production. To facilitate backtesting our production models across our full data set, we needed to be able to parallelize our workloads, while using the actual production code.
We also wanted to integrate the C++ code with PySpark and use it to run our models. In this talk, I will discuss some of the challenges we faced, decisions we made, and other options when dealing with integrating existing C++ code into a Spark system. The techniques we developed have been used successfully by our team multiple times and I am sure others will benefit from the gotchas that we were able to identify.
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...Spark Summit
If you are running Apache Spark in cloud environments, Object Stores —such as Amazon S3 or Azure WASB— are a core part of your system. What you can’t do is treat them like “just another filesystem” —do that and things will, eventually, go horribly wrong.
This talk looks at the object stores in the cloud infrastructures, including underlying architectures., compares them to what a “real filesystem” is expected to do and shows how to use object stores efficiently and safely as sources of and destinations of data.
It goes into depth on recent “S3a” work, showing how including improvements in performance, security, functionality and measurement —and demonstrating how to use make best use of it from a spark application.
If you are planning to deploy Spark in cloud, or doing so today: this is information you need to understand. The performance of you code and integrity of your data depends on it.
Development of Software for scalable anomaly detection modeling of time-series data using Apache Spark.
私たちはこれまで、様々な機器類を監視するセンサーの時系列データを分析し、異常を検知する手法およびソフトウェアの研究開発を行ってきた。
今回紹介するソフトウェアでは、バッチ処理で複数のセンサーから得られた高次元の時系列データから線形のLASSO回帰により学習、モデル化し、異常時を識別する。
しかし学習時間やメモリー使用量の増大が課題になってきたため、Sparkを活用し並列分散化を行った。
SparkにはMLlibという汎用的な機械学習ライブラリが存在するが、今回は使用するアルゴリズムの特殊性を考慮し、既存実装を基に新規に開発した。
本講演では当開発におけるデザインチョイスや性能計測結果について報告する。
a
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Edureka!
This Edureka Spark Tutorial will help you to understand all the basics of Apache Spark. This Spark tutorial is ideal for both beginners as well as professionals who want to learn or brush up Apache Spark concepts. Below are the topics covered in this tutorial:
1) Big Data Introduction
2) Batch vs Real Time Analytics
3) Why Apache Spark?
4) What is Apache Spark?
5) Using Spark with Hadoop
6) Apache Spark Features
7) Apache Spark Ecosystem
8) Demo: Earthquake Detection Using Apache Spark
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...Databricks
GPU acceleration has been at the heart of scientific computing and artificial intelligence for many years now. GPUs provide the computational power needed for the most demanding applications such as Deep Neural Networks, nuclear or weather simulation. Since the launch of RAPIDS in mid-2018, this vast computational resource has become available for Data Science workloads too. The RAPIDS toolkit, which is now available on the Databricks Unified Analytics Platform, is a GPU-accelerated drop-in replacement for utilities such as Pandas/NumPy/ScikitLearn/XGboost. Through its use of Dask wrappers the platform allows for true, large scale computation with minimal, if any, code changes.
The goal of this talk is to discuss RAPIDS, its functionality, architecture as well as the way it integrates with Spark providing on many occasions several orders of magnitude acceleration versus its CPU-only counterparts.
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...Databricks
This talk is about methods and tools for troubleshooting Spark workloads at scale and is aimed at developers, administrators and performance practitioners. You will find examples illustrating the importance of using the right tools and right methodologies for measuring and understanding performance, in particular highlighting the importance of using data and root cause analysis to understand and improve the performance of Spark applications. The talk has a strong focus on practical examples and on tools for collecting data relevant for performance analysis. This includes tools for collecting Spark metrics and tools for collecting OS metrics. Among others, the talk will cover sparkMeasure, a tool developed by the author to collect Spark task metric and SQL metrics data, tools for analysing I/O and network workloads, tools for analysing CPU usage and memory bandwidth, tools for profiling CPU usage and for Flame Graph visualization.
In this tutorial we will present Koalas, a new open source project that we announced at the Spark + AI Summit in April. Koalas is an open-source Python package that implements the pandas API on top of Apache Spark, to make the pandas API scalable to big data. Using Koalas, data scientists can make the transition from a single machine to a distributed environment without needing to learn a new framework.
We will demonstrate Koalas’ new functionalities since its initial release, discuss its roadmaps, and how we think Koalas could become the standard API for large scale data science.
What you will learn:
How to get started with Koalas
Easy transition from Pandas to Koalas on Apache Spark
Similarities between Pandas and Koalas APIs for DataFrame transformation and feature engineering
Single machine Pandas vs distributed environment of Koalas
Prerequisites:
A fully-charged laptop (8-16GB memory) with Chrome or Firefox
Python 3 and pip pre-installed
pip install koalas from PyPI
Pre-register for Databricks Community Edition
Read koalas docs
Running Emerging AI Applications on Big Data Platforms with Ray On Apache SparkDatabricks
With the rapid evolution of AI in recent years, we need to embrace advanced and emerging AI technologies to gain insights and make decisions based on massive amounts of data. Ray (https://github.com/ray-project/ray) is a fast and simple framework open-sourced by UC Berkeley RISELab particularly designed for easily building advanced AI applications in a distributed fashion.
Using SparkR to Scale Data Science Applications in Production. Lessons from t...Spark Summit
R is a hugely popular platform for Data Scientists to create analytic models in many different domains. But when these applications should move from the science lab to the production environment of large enterprises a new set of challenges arises. Independently of R, Spark has been very successful as a powerful general-purpose computing platform. With the introduction of SparkR an exciting new option to productionize Data Science applications has been made available. This talk will give insight into two real-life projects at major enterprises where Data Science applications in R have been migrated to SparkR.
• Dealing with platform challenges: R was not installed on the cluster. We show how to execute SparkR on a Yarn cluster with a dynamic deployment of R.
• Integrating Data Engineering and Data Science: we highlight the technical and cultural challenges that arise from closely integrating these two different areas.
• Separation of concerns: we describe how to disentangle ETL and data preparation from analytic computing and statistical methods.
• Scaling R with SparkR: we present what options SparkR offers to scale R applications and how we applied them to different areas such as time series forecasting and web analytics.
• Performance Improvements: we will show benchmarks for an R applications that took over 20 hours on a single server/single-threaded setup. With moderate effort we have been able to reduce that number to 15 minutes with SparkR. And we will show how we plan to further reduces this to less than a minute in the future.
• Mixing SparkR, SparkSQL and MLlib: we show how we combined the three different libraries to maximize efficiency.
• Summary and Outlook: we describe what we have learnt so far, what the biggest gaps currently are and what challenges we expect to solve in the short- to mid-term.
Using open source tools for network device dataplane testing.
Our experiences from redGuardian DDoS mitigation scrubber testing.
Presented at PLNOG 20 (2018).
PLNOG20 - Paweł Małachowski - Stress your DUT–wykorzystanie narzędzi open sou...PROIDEA
Wybór docelowej platformy sieciowej (np. routera, firewalla, scrubbera DDoS) jest często poprzedzony jej testami. Jednym z celów testów jest sprawdzenie, czy parametry wydajnościowe deklarowane przez producenta odpowiadają rzeczywistości. Zespół rozwijający redGuardian Anty DDoS testuje rozwiązanie regresyjnie i wydajnościowo w sposób zautomatyzowany od początku jego istnienia. W czasie prezentacji przeanalizujemy aspekty, na które warto zwrócić uwagę w czasie testów wydajnościowych urządzeń IP oraz przyjrzymy się narzędziom open source pomocnym w realizacji tego zadania.
We provide an update on developments in the intersection of the R and the broader machine learning ecosystems. These collections of packages enable R users to leverage the latest technologies for big data analytics and deep learning in their existing workflows, and also facilitate collaboration within multidisciplinary data science teams. Topics covered include – MLflow: managing the ML lifecycle with improved dependency management and more deployment targets – TensorFlow: TF 2.0 update and probabilistic (deep) machine learning with TensorFlow Probability – Spark: latest improvements and extensions, including text processing at scale with SparkNLP
How to measure everything - a million metrics per second with minimal develop...Jos Boumans
Krux is an infrastructure provider for many of the websites you
use online today, like NYTimes.com, WSJ.com, Wikia and NBCU. For
every request on those properties, Krux will get one or more as
well. We grew from zero traffic to several billion requests per
day in the span of 2 years, and we did so exclusively in AWS.
To make the right decisions in such a volatile environment, we
knew that data is everything; without it, you can't possibly make
informed decisions. However, collecting it efficiently, at scale,
at minimal cost and without burdening developers is a tremendous
challenge.
Join me in this session to learn how we overcame this challenge
at Krux; I will share with you the details of how we set up our
global infrastructure, entirely managed by Puppet, to capture over
a million data points every second on virtually every part of the
system, including inside the web server, user apps and Puppet itself,
for under $2000/month using off the shelf Open Source software and
some code we've released as Open Source ourselves. In addition, I’ll
show you how you can take (a subset of) these metrics and send them
to advanced analytics and alerting tools like Circonus or Zabbix.
This content will be applicable for anyone collecting or desiring to
collect vast amounts of metrics in a cloud or datacenter setting and
making sense of them.
Imagine you're tackling one of these evasive performance issues in the field, and your go-to monitoring checklist doesn't seem to cut it. There are plenty of suspects, but they are moving around rapidly and you need more logs, more data, more in-depth information to make a diagnosis. Maybe you've heard about DTrace, or even used it, and are yearning for a similar toolkit, which can plug dynamic tracing into a system that wasn't prepared or instrumented in any way.
Hopefully, you won't have to yearn for a lot longer. eBPF (extended Berkeley Packet Filters) is a kernel technology that enables a plethora of diagnostic scenarios by introducing dynamic, safe, low-overhead, efficient programs that run in the context of your live kernel. Sure, BPF programs can attach to sockets; but more interestingly, they can attach to kprobes and uprobes, static kernel tracepoints, and even user-mode static probes. And modern BPF programs have access to a wide set of instructions and data structures, which means you can collect valuable information and analyze it on-the-fly, without spilling it to huge files and reading them from user space.
In this talk, we will introduce BCC, the BPF Compiler Collection, which is an open set of tools and libraries for dynamic tracing on Linux. Some tools are easy and ready to use, such as execsnoop, fileslower, and memleak. Other tools such as trace and argdist require more sophistication and can be used as a Swiss Army knife for a variety of scenarios. We will spend most of the time demonstrating the power of modern dynamic tracing -- from memory leaks to static probes in Ruby, Node, and Java programs, from slow file I/O to monitoring network traffic. Finally, we will discuss building our own tools using the Python and Lua bindings to BCC, and its LLVM backend.
Tech talk by Serena Signorelli (https://www.linkedin.com/in/serenasignorelli/) in the event ''Tensorflow and Sparklyr: Scaling Deep Learning and R to the Big Data ecosystem'', May 15, 2017 at ICTeam Grassobbio (BG). The event was part of the Data Science Milan Meetup (https://www.meetup.com/it-IT/Data-Science-Milan/).
This talk will present R as a programming language suited for solving data analysis and modeling problems, MLflow as an open source project to help organizations manage their machine learning lifecycle and the intersection of both by adding support for R in MLflow. It will be highly interactive and touch on some of the technical implementation choices taken while making R available in MLflow. It will also demonstrate using MLflow tracking, projects, and models directly from R as well as reusing R models in MLflow to interoperate with other programming languages and technologies.
A talk from Toronto's FITC Spotlight on Hardware talk. I spoke about using tools like Openframeworks, OpenCV, and the Kinect to create Interactive Installations, and paired it with an interactive lighting installation.
References, citations, and source code can be found here: http://www.andrewlb.com/2013/06/sls-notes/
Running R at Scale with Apache Arrow on SparkDatabricks
In this talk you will learn how to easily configure Apache Arrow with R on Apache Spark, which will allow you to gain speed improvements and expand the scope of your data science workflows; for instance, by enabling data to be efficiently transferred between your local environment and Apache Spark. This talk will present use cases for running R at scale on Apache Spark. It will also introduce the Apache Arrow project and recent developments that enable running R with Apache Arrow on Apache Spark to significantly improve performance and efficiency. We will end this talk by discussing performance and recent development in this space.
Author: Javier Luraschi
PLNOG 13: P. Kupisiewicz, O. Pelerin: Make IOS-XE Troubleshooting Easy – Pack...PROIDEA
Piotr Kupisiewicz – Technical Expert in Krakow’s TAC VPN team. In IT for more than 10 years, out of which 5 years is mostly software engineering experience. Last 5 years spent mostly in networking area interested mostly in Network Security. His hobby are drums and very heavy music. CCIE Security 39762.
Olivier Pelerin – as a key member of the escalation team at Cisco’s Technical Assistance Center, he handles world-wide escalations on VPN technologies pertaining to IPSEC, DMVPN, EzVPN, GetVPN, FlexVPN, PKI. Olivier has spent years troubleshooting and diagnosing issues on some of largest, and most complex VPN deployments Olivier have a CCIE in security #20306
Topic of Presentation: Make IOS-XE Troubleshooting Easy – Packet-Tracer
Language: English
Abstract: “IOS-XE is operating system running on Service Provider devices like ASR series and ISR-4451. Aim of this session is to show how very complicated Service Provider’s configurations can be easily troubleshoted using packet-tracer tool.”
Slides from the Oracle ANZ workshop held in Sydney and Melbourne. We look at the killer features that will make 18c and 19c great productivity upgrades for DBAs
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
StarCompliance is a leading firm specializing in the recovery of stolen cryptocurrency. Our comprehensive services are designed to assist individuals and organizations in navigating the complex process of fraud reporting, investigation, and fund recovery. We combine cutting-edge technology with expert legal support to provide a robust solution for victims of crypto theft.
Our Services Include:
Reporting to Tracking Authorities:
We immediately notify all relevant centralized exchanges (CEX), decentralized exchanges (DEX), and wallet providers about the stolen cryptocurrency. This ensures that the stolen assets are flagged as scam transactions, making it impossible for the thief to use them.
Assistance with Filing Police Reports:
We guide you through the process of filing a valid police report. Our support team provides detailed instructions on which police department to contact and helps you complete the necessary paperwork within the critical 72-hour window.
Launching the Refund Process:
Our team of experienced lawyers can initiate lawsuits on your behalf and represent you in various jurisdictions around the world. They work diligently to recover your stolen funds and ensure that justice is served.
At StarCompliance, we understand the urgency and stress involved in dealing with cryptocurrency theft. Our dedicated team works quickly and efficiently to provide you with the support and expertise needed to recover your assets. Trust us to be your partner in navigating the complexities of the crypto world and safeguarding your investments.
1. 10.03.19, 17)15Life of PySpark
Page 1 of 68http://localhost:8000/?print-pdf#/
A TALE OF TWO ENVIRONMENTS
LIFE OF PYSPARK
Mohanababu Sathyakumari Shankar
2. 10.03.19, 17)15Life of PySpark
Page 2 of 68http://localhost:8000/?print-pdf#/
CONTENTS
Who I am!
A Brief History of Spark
Grapes of Spark
The Metamorphosis
Brave New PySpark
To Kill a Mocking Bear
Pride and Production
Sense and Scalability
A Song of Scala and Python
The Finkler Questions
The Sense of an Ending
3. 10.03.19, 17)15Life of PySpark
Page 3 of 68http://localhost:8000/?print-pdf#/
WHO I AM
by day
by night
, all day long
Natural habitat:
MSc Computer Science,
So!ware Engineer,
, Bangalore
Data Engineer
Data Scientist
Data Geek
KI labs
TU München
Oracle Financial Services
So!ware
4. 10.03.19, 17)15Life of PySpark
Page 4 of 68http://localhost:8000/?print-pdf#/
A BRIEF HISTORY OF SPARK
5. 10.03.19, 17)15Life of PySpark
Page 5 of 68http://localhost:8000/?print-pdf#/
A BRIEF HISTORY OF SPARK
- Disk based access
- More lines of code
- Default
- Nothing more
- Not available
- Java, Python (verbose)
MAPREDUCE AND RECYCLE
Slow
Cumbersome programming
Abstractions-less
Batch processing
Built-in Interactive mode
Support
7. 10.03.19, 17)15Life of PySpark
Page 7 of 68http://localhost:8000/?print-pdf#/
A BRIEF HISTORY OF SPARK
, UC Berkeley
, Spark v0.6.0
, Apache Incubator
Unified analytics engine
Matei Zaharia
AMPLab
October 2012
June 2013
8. 10.03.19, 17)15Life of PySpark
Page 8 of 68http://localhost:8000/?print-pdf#/
A BRIEF HISTORY OF SPARK
- Processing in-memory
- Lesser lines of code
- RDDs ++
Java, Scala, Python & R
Fast
Concise
Special Abstractions
Stream and batch processing
Built-in Interactive mode
9. 10.03.19, 17)15Life of PySpark
Page 9 of 68http://localhost:8000/?print-pdf#/
GRAPES OF SPARK
10. 10.03.19, 17)15Life of PySpark
Page 10 of 68http://localhost:8000/?print-pdf#/
GRAPES OF SPARK
processing
- Non-linear flow
- Query optimiser
FASTER PROCESSING
Stream and batch processing
In-memory
DAG
Lazy Evaluation
Calcite
11. 10.03.19, 17)15Life of PySpark
Page 11 of 68http://localhost:8000/?print-pdf#/
GRAPES OF SPARK
- Less lines of code
- Java, Scala, Python & R
high-level operators
atop Spark
EASE OF USE
Concise
Support
80
Built-in Interactive mode
Numerous projects
12. 10.03.19, 17)15Life of PySpark
Page 12 of 68http://localhost:8000/?print-pdf#/
GRAPES OF SPARK
multiple libraries
- SQL-styled processing
- streaming data
- Machine Learning
- Graphs
- SQL Analytics
DIVERSITY
Leverages
Spark SQL
Spark Streaming
MLlib
GraphX and GraphFrames
BlinkDB/Tachyon
13. 10.03.19, 17)15Life of PySpark
Page 13 of 68http://localhost:8000/?print-pdf#/
GRAPES OF SPARK
abstraction in Spark
- File
- RDDs
- Read-only
- Across nodes
- Parallel
- Lineage
- Java/Scala
RDD
Primary
Created
Created
Immutable
Partitioned
Distributed
Fault-tolerant
Object collection
14. 10.03.19, 17)15Life of PySpark
Page 14 of 68http://localhost:8000/?print-pdf#/
GRAPES OF SPARK
by DFs in R/Python
- Table structure
- Columns
- Defined by a schema
- API, build query plans
- Query optimiser
DATAFRAMES
Inspired
Relational
Named
Schema
SQL
Catalyst
15. 10.03.19, 17)15Life of PySpark
Page 15 of 68http://localhost:8000/?print-pdf#/
GRAPES OF SPARK
of RDDs and DFs
- Columns
- No schema
- No table
- Compile-time type safety
DATASETS
Best features
Unnamed
Schema-less
Non-relational
Type safe
16. 10.03.19, 17)15Life of PySpark
Page 16 of 68http://localhost:8000/?print-pdf#/
THE METAMORPHOSIS
17. 10.03.19, 17)15Life of PySpark
Page 17 of 68http://localhost:8000/?print-pdf#/
THE METAMORPHOSIS
on RDDs and DFs
- new RDDs
- a DAG
map, filter, groupBy, sortBy
union, intersection, distinct
TRANSFORMATIONS
Operates
Creates
RDD Lineage
Lazy Evaluation
18. 10.03.19, 17)15Life of PySpark
Page 18 of 68http://localhost:8000/?print-pdf#/
THE METAMORPHOSIS
on RDDs and DFs
- applied on RDDs
- No new RDDs
- Initiator
unt, reduce, collect
aggregate, first, take, sum
ACTIONS
Operates
Functions
Triggered
Lazy Evaluation
19. 10.03.19, 17)15Life of PySpark
Page 19 of 68http://localhost:8000/?print-pdf#/
BRAVE NEW PYSPARK
20. 10.03.19, 17)15Life of PySpark
Page 20 of 68http://localhost:8000/?print-pdf#/
BRAVE NEW PYSPARK
PYTHON + SPARK
21. 10.03.19, 17)15Life of PySpark
Page 21 of 68http://localhost:8000/?print-pdf#/
BRAVE NEW PYSPARK
TIME AND COMPLEXITY
22. 10.03.19, 17)15Life of PySpark
Page 22 of 68http://localhost:8000/?print-pdf#/
BRAVE NEW PYSPARK
NOTEBOOK INTEGRATION
23. 10.03.19, 17)15Life of PySpark
Page 23 of 68http://localhost:8000/?print-pdf#/
OPTION 1: DOWNLOAD TAR RELEASE
BRAVE NEW PYSPARK
SETUP
wget https://www.apache.org/spark-2.4.0/spark-2.4.0-bin-hadoop2.7.tgz
tar -xzf spark-2.4.0-bin-hadoop2.7.tgz
PATH="$PATH:$(pwd)/spark-2.4.0-bin-hadoop2.7/bin
25. 10.03.19, 17)15Life of PySpark
Page 25 of 68http://localhost:8000/?print-pdf#/
OPTION 2: USING BREW ON MACOS
BRAVE NEW PYSPARK
SETUP
brew install apache-spark
26. 10.03.19, 17)15Life of PySpark
Page 26 of 68http://localhost:8000/?print-pdf#/
OPTION 3: USING PYPI
BRAVE NEW PYSPARK
SETUP
pip install pyspark
27. 10.03.19, 17)15Life of PySpark
Page 27 of 68http://localhost:8000/?print-pdf#/
OPTION 4: USING CONDA
BRAVE NEW PYSPARK
SETUP
conda install -c conda-forge pyspark=2.3.1
28. 10.03.19, 17)15Life of PySpark
Page 28 of 68http://localhost:8000/?print-pdf#/
CONFIGURE AND START
BRAVE NEW PYSPARK
SETUP
## Running PySpark in cluster mode inside Jupyter
## Include additional python modules
IPYTHON_OPTS="notebook" pyspark
--master spark://localhost:7077
--executor-memory 7g
--py-files tensorflow-py2.7.egg
30. 10.03.19, 17)15Life of PySpark
Page 30 of 68http://localhost:8000/?print-pdf#/
BRAVE NEW PYSPARK
EASY TO PROTOTYPE
31. 10.03.19, 17)15Life of PySpark
Page 31 of 68http://localhost:8000/?print-pdf#/
TO KILL A MOCKING BEAR
32. 10.03.19, 17)15Life of PySpark
Page 32 of 68http://localhost:8000/?print-pdf#/
Pandas Dataframe
PySpark Dataframe
TO KILL A MOCKING BEAR
LOADING CSV
df = pd.read_csv("world_rankings.csv")
df = sql.context.read.format('com.databricks.spark.csv')
.options(header='true', inferschema='true')
.load("world_rankings.csv")
33. 10.03.19, 17)15Life of PySpark
Page 33 of 68http://localhost:8000/?print-pdf#/
Pandas Dataframe
PySpark Dataframe
TO KILL A MOCKING BEAR
VIEW DATAFRAME
df
df.head(10)
df
df.show(10)
34. 10.03.19, 17)15Life of PySpark
Page 34 of 68http://localhost:8000/?print-pdf#/
Pandas Dataframe
PySpark Dataframe
TO KILL A MOCKING BEAR
COLUMNS AND DATATYPES
df.columns
df.dtypes
df.columns
df.dtypes
35. 10.03.19, 17)15Life of PySpark
Page 35 of 68http://localhost:8000/?print-pdf#/
Pandas Dataframe
PySpark Dataframe
TO KILL A MOCKING BEAR
DROP COLUMN
df.drop('column1', axis=1)
df.drop('column1')
36. 10.03.19, 17)15Life of PySpark
Page 36 of 68http://localhost:8000/?print-pdf#/
Pandas Dataframe
PySpark Dataframe
TO KILL A MOCKING BEAR
FILL NULLS
df.fillna(0)
df.fillna(0)
37. 10.03.19, 17)15Life of PySpark
Page 37 of 68http://localhost:8000/?print-pdf#/
Pandas Dataframe
PySpark Dataframe
TO KILL A MOCKING BEAR
AGGREGATION
df.groupby(['column1', 'column2'])
.agg({"column3": "mean", "column4": "min"})
df.groupby(['column1', 'column2'])
.agg({"column3": "mean", "column4": "min"})
38. 10.03.19, 17)15Life of PySpark
Page 38 of 68http://localhost:8000/?print-pdf#/
Pandas Dataframe
PySpark Dataframe
TO KILL A MOCKING BEAR
MERGE/JOIN DATAFRAMES
left.merge(right, on='key')
left.merge(right, left_on='column1', right_on='column2')
left.join(right, on='key')
left.join(right, left.column1 == right.column2
39. 10.03.19, 17)15Life of PySpark
Page 39 of 68http://localhost:8000/?print-pdf#/
Pandas Dataframe
PySpark Dataframe
TO KILL A MOCKING BEAR
SUMMARY STATISTICS
df.describe()
df.describe().show()
40. 10.03.19, 17)15Life of PySpark
Page 40 of 68http://localhost:8000/?print-pdf#/
Pandas Dataframe
PySpark Dataframe
TO KILL A MOCKING BEAR
RENAME COLUMNS
df.columns = ['C1', 'C2', 'C3']
df.rename(columns = {"C1": "c1", "C2": "c2", "C3": "c3"})
df.toDF('C1', 'C2', 'C3')
df.withColumnRenamed('C1', 'c1')
41. 10.03.19, 17)15Life of PySpark
Page 41 of 68http://localhost:8000/?print-pdf#/
Pandas Dataframe
PySpark Dataframe
TO KILL A MOCKING BEAR
FILTER COLUMNS
df[(df.column1 < 10) && (df.column2 == 100)]
df.filter((df.column1 < 10) && (df.column2 == 100))
42. 10.03.19, 17)15Life of PySpark
Page 42 of 68http://localhost:8000/?print-pdf#/
Pandas Dataframe
PySpark Dataframe
TO KILL A MOCKING BEAR
ADD COLUMN
df[df.column] = 1 / df.column
df.withColumn('df.column', 1 / df.column)
43. 10.03.19, 17)15Life of PySpark
Page 43 of 68http://localhost:8000/?print-pdf#/
Pandas Dataframe
PySpark Dataframe
TO KILL A MOCKING BEAR
STANDARD TRANSFORMATIONS
import numpy as np
df['log_values'] = np.log(df.values)
import pyspark.sql.functions as F
df.withColumn('log_values', F.log(df.values))
44. 10.03.19, 17)15Life of PySpark
Page 44 of 68http://localhost:8000/?print-pdf#/
Pandas Dataframe
PySpark Dataframe
TO KILL A MOCKING BEAR
ROW CONDITIONAL STATEMENTS
df['conditional'] = df.apply(lambda x: 1 if x.column1 > 20
else 10 if x.column2 == 100 else 42, axis=1)
import pyspark.sql.functions as F
df.withColumn('conditional',
F.when(df.column1 > 20, 1)
.when(df.column2 == 100, 10)
.otherwise(42))
46. 10.03.19, 17)15Life of PySpark
Page 46 of 68http://localhost:8000/?print-pdf#/
Pandas Dataframe
PySpark Dataframe
TO KILL A MOCKING BEAR
PIVOT TABLE
pd.pivot_table(df, values='column4',
index=['column1', 'column2'], columns=['column3],
aggfunc=np.sum)
df.groupBy("column1", "column2").pivot("column3").sum("column4")
47. 10.03.19, 17)15Life of PySpark
Page 47 of 68http://localhost:8000/?print-pdf#/
Pandas Dataframe
PySpark Dataframe
TO KILL A MOCKING BEAR
HISTOGRAM
df.hist()
df.sample(False, 0.1).toPandas().hist()
48. 10.03.19, 17)15Life of PySpark
Page 48 of 68http://localhost:8000/?print-pdf#/
Pandas Dataframe
PySpark Dataframe
TO KILL A MOCKING BEAR
SQL QUERIES
Not Applicable
df.createOrReplaceTempView('TempTable')
df_query = spark.sql('select * from TempTable')
49. 10.03.19, 17)15Life of PySpark
Page 49 of 68http://localhost:8000/?print-pdf#/
PRIDE AND PRODUCTION
50. 10.03.19, 17)15Life of PySpark
Page 50 of 68http://localhost:8000/?print-pdf#/
PRIDE AND PRODUCTION
through complete data
access too slow
chunks of data
environment
No functions
PYTHON FUNCTIONS IN SPARK
Iterate
Row-by-row
Distributed
Production
Conventional
51. 10.03.19, 17)15Life of PySpark
Page 51 of 68http://localhost:8000/?print-pdf#/
PRIDE AND PRODUCTION
Python functions
and
is specified
operations only
access too slow
ser/deser
PYSPARK UDFS
(ROW-AT-A-TIME UDFS)
Primitive
map() apply()
Output data type
Series/Scalar
Row-by-row
Non-vectorized
52. 10.03.19, 17)15Life of PySpark
Page 52 of 68http://localhost:8000/?print-pdf#/
PRIDE AND PRODUCTION
Python functions
Pandas & Scikit-learn
based
ser/deser
required
required
required
and
PANDAS UDFS
(VECTORIZED UDFS)
Optimised
Supports
Apache Arrow
Vectorized
Output data type
PandasUDFType
DataFrame Schema
Scalar Grouped Map
53. 10.03.19, 17)15Life of PySpark
Page 53 of 68http://localhost:8000/?print-pdf#/
DIFFERENCES
PRIDE AND PRODUCTION
SCALAR AND GROUPEDBY UDFS
54. 10.03.19, 17)15Life of PySpark
Page 54 of 68http://localhost:8000/?print-pdf#/
PERFORMANCE
PRIDE AND PRODUCTION
SCALAR AND GROUPEDBY UDFS
55. 10.03.19, 17)15Life of PySpark
Page 55 of 68http://localhost:8000/?print-pdf#/
PRIDE AND PRODUCTION
detection
data from trucks
- No
Complexity
exist
and with bugs
required
- -
DBSCAN ON SPARK
Density-Based Spatial Clustering
Stay Points
Telematics
Spark MLlib DBSCAN
O(n^2)
Implementations
Non-performant
Scikit-learn
ELKI O(nlogn) JAVA
56. 10.03.19, 17)15Life of PySpark
Page 56 of 68http://localhost:8000/?print-pdf#/
PRIDE AND PRODUCTION
DBSCAN USING PANDAS UDF
57. 10.03.19, 17)15Life of PySpark
Page 57 of 68http://localhost:8000/?print-pdf#/
SENSE AND SCALABILITY
58. 10.03.19, 17)15Life of PySpark
Page 58 of 68http://localhost:8000/?print-pdf#/
SENSE AND SCALABILITY
in native Python
objects
required
best approach
only scope
avoided
SCALA UDFS
Driver
Non-native JVM
2x Ser/Deser
Scala UDFs
Spark v2.1
JVM
Unnecessary Ser/Deser
59. 10.03.19, 17)15Life of PySpark
Page 59 of 68http://localhost:8000/?print-pdf#/
SENSE AND SCALABILITY
as Scala project
using SBT
to PySpark session
the Scala UDF
only scope
SCALA UDFS
Create Scala UDF
Build JAR
Submit JAR
Register
JVM
60. 10.03.19, 17)15Life of PySpark
Page 60 of 68http://localhost:8000/?print-pdf#/
SENSE AND SCALABILITY
Benchmark Python UDF vs Pandas UDF vs Scala UDF
61. 10.03.19, 17)15Life of PySpark
Page 61 of 68http://localhost:8000/?print-pdf#/
A SONG OF SCALA AND PYTHON
62. 10.03.19, 17)15Life of PySpark
Page 62 of 68http://localhost:8000/?print-pdf#/
A SONG OF SCALA AND PYTHON
expertise is high
not mature enough
required
of UDFs
usage
avoided
PATCH-22
Python
Spark MLlib
Pandas and Scikit-learn
Blackbox behaviour
High-level column based
Objects conversion
63. 10.03.19, 17)15Life of PySpark
Page 63 of 68http://localhost:8000/?print-pdf#/
A SONG OF SCALA AND PYTHON
THE PY4J REDEMPTION
64. 10.03.19, 17)15Life of PySpark
Page 64 of 68http://localhost:8000/?print-pdf#/
A SONG OF SCALA AND PYTHON
NO PYTHON FOR SPARK MAIN()
65. 10.03.19, 17)15Life of PySpark
Page 65 of 68http://localhost:8000/?print-pdf#/
A SONG OF SCALA AND PYTHON