In this talk, I gave an overview of Dask and Kubernetes and how they can be used together to perform various tasks like data wrangling, monte-carlo simulations, machine learning, hyper-parameter optimisation
Dask Tutorial at PyConDE / PyData Karlsruhe 2018. These were the introductory slides that mainly contain the link to Matthew Rocklin's Dask workshop at PyData NYC 2018 whereon this workshop was based.
Lens: Data exploration with Dask and Jupyter widgetsVíctor Zabalza
The first step in any data-intensive project is understanding the available data. To this end, data scientists spend a significant part of their time carrying out data quality assessments and data exploration. In spite of this being a crucial step, it usually requires repeating a series of menial tasks before the data scientist gains an understanding of the dataset and can progress to the next steps in the project.
In this talk I will present Lens (https://github.com/asidatascience/lens), a Python package which automates this drudge work, enables efficient data exploration, and kickstarts data science projects. A summary is generated for each dataset, including:
- General information about the dataset, including data quality of each of the columns;
- Distribution of each of the columns through statistics and plots (histogram, CDF, KDE), optionally grouped by other categorical variables;
- 2D distribution between pairs of columns;
- Correlation coefficient matrix for all numerical columns.
Building this tool has provided a unique view into the full Python data stack, from the parallelised analysis of a dataframe within a Dask custom execution graph, to the interactive visualisation with Jupyter widgets and Plotly. During the talk, I will also introduce how Dask works, and demonstrate how to migrate data pipelines to take advantage of its scalable capabilities.
AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with DaskVíctor Zabalza
# Talk given at PyCon UK 2017
The first step in any data-intensive project is understanding the available data. To this end, data scientists spend a significant part of their time carrying out data quality assessments and data exploration. In spite of this being a crucial step, it usually requires repeating a series of menial tasks before the data scientist gains an understanding ofthe dataset and can progress to the next steps in the project.
In this talk I will detail the inner workings of a Python package that we have built which automates this drudge work, enables efficient data exploration, and kickstarts data science projects. A summary is generated for each dataset, including:
- General information about the dataset, including data quality of each of the columns;
- Distribution of each of the columns through statistics and plots (histogram, CDF, KDE), optionally grouped by other categorical variables;
- 2D distribution between pairs of columns;
- Correlation coefficient matrix for all numerical columns.
Building this tool has provided a unique view into the full Python data stack, from the parallelised analysis of a dataframe within a Dask custom execution graph, to the interactive visualisation with Jupyter widgets and Plotly. During the talk, I will also introduce how Dask works, and demonstrate how to migrate data pipelines to take advantage of its scalable capabilities.
APACHE TOREE: A JUPYTER KERNEL FOR SPARK by Marius van NiekerkSpark Summit
Many data scientists are already making heavy usage of the Jupyter ecosystem for analyzing data using interactive notebooks.
Apache Toree (incubating) is a Jupyter kernel designed to act as a gateway to Spark by enabling users Spark from standard Jupyter notebooks. This allows users to easily integrate Spark into their existing Jupyter deployments, This allows users to easily move between languages and contexts without needing to switch to a different set of tools.
Apache Toree is designed expressly for interactive work. It supports interpreters in Scala, Python, and R.
In this talk, I will cover the design of Toree, how it interacts with the Jupyter ecosystem and various ways in which users can extend the functionality of Apache Toree via a powerful plugin system.
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...Spark Summit
Since April 2016, Spark-as-a-service has been available to researchers in Sweden from the Swedish ICT SICS Data Center at www.hops.site. Researchers work in an entirely UI-driven environment on a platform built with only open-source software.
Spark applications can be either deployed as jobs (batch or streaming) or written and run directly from Apache Zeppelin. Spark applications are run within a project on a YARN cluster with the novel property that Spark applications are metered and charged to projects. Projects are also securely isolated from each other and include support for project-specific Kafka topics. That is, Kafka topics are protected from access by users that are not members of the project. In this talk we will discuss the challenges in building multi-tenant Spark streaming applications on YARN that are metered and easy-to-debug. We show how we use the ELK stack (Elasticsearch, Logstash, and Kibana) for logging and debugging running Spark streaming applications, how we use Graphana and Graphite for monitoring Spark streaming applications, and how users can debug and optimize terminated Spark Streaming jobs using Dr Elephant. We will also discuss the experiences of our users (over 120 users as of Sept 2016): how they manage their Kafka topics and quotas, patterns for how users share topics between projects, and our novel solutions for helping researchers debug and optimize Spark applications.
To conclude, we will also give an overview on our course ID2223 on Large Scale Learning and Deep Learning, in which 60 students designed and ran SparkML applications on the platform.
In this talk, I gave an overview of Dask and Kubernetes and how they can be used together to perform various tasks like data wrangling, monte-carlo simulations, machine learning, hyper-parameter optimisation
Dask Tutorial at PyConDE / PyData Karlsruhe 2018. These were the introductory slides that mainly contain the link to Matthew Rocklin's Dask workshop at PyData NYC 2018 whereon this workshop was based.
Lens: Data exploration with Dask and Jupyter widgetsVíctor Zabalza
The first step in any data-intensive project is understanding the available data. To this end, data scientists spend a significant part of their time carrying out data quality assessments and data exploration. In spite of this being a crucial step, it usually requires repeating a series of menial tasks before the data scientist gains an understanding of the dataset and can progress to the next steps in the project.
In this talk I will present Lens (https://github.com/asidatascience/lens), a Python package which automates this drudge work, enables efficient data exploration, and kickstarts data science projects. A summary is generated for each dataset, including:
- General information about the dataset, including data quality of each of the columns;
- Distribution of each of the columns through statistics and plots (histogram, CDF, KDE), optionally grouped by other categorical variables;
- 2D distribution between pairs of columns;
- Correlation coefficient matrix for all numerical columns.
Building this tool has provided a unique view into the full Python data stack, from the parallelised analysis of a dataframe within a Dask custom execution graph, to the interactive visualisation with Jupyter widgets and Plotly. During the talk, I will also introduce how Dask works, and demonstrate how to migrate data pipelines to take advantage of its scalable capabilities.
AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with DaskVíctor Zabalza
# Talk given at PyCon UK 2017
The first step in any data-intensive project is understanding the available data. To this end, data scientists spend a significant part of their time carrying out data quality assessments and data exploration. In spite of this being a crucial step, it usually requires repeating a series of menial tasks before the data scientist gains an understanding ofthe dataset and can progress to the next steps in the project.
In this talk I will detail the inner workings of a Python package that we have built which automates this drudge work, enables efficient data exploration, and kickstarts data science projects. A summary is generated for each dataset, including:
- General information about the dataset, including data quality of each of the columns;
- Distribution of each of the columns through statistics and plots (histogram, CDF, KDE), optionally grouped by other categorical variables;
- 2D distribution between pairs of columns;
- Correlation coefficient matrix for all numerical columns.
Building this tool has provided a unique view into the full Python data stack, from the parallelised analysis of a dataframe within a Dask custom execution graph, to the interactive visualisation with Jupyter widgets and Plotly. During the talk, I will also introduce how Dask works, and demonstrate how to migrate data pipelines to take advantage of its scalable capabilities.
APACHE TOREE: A JUPYTER KERNEL FOR SPARK by Marius van NiekerkSpark Summit
Many data scientists are already making heavy usage of the Jupyter ecosystem for analyzing data using interactive notebooks.
Apache Toree (incubating) is a Jupyter kernel designed to act as a gateway to Spark by enabling users Spark from standard Jupyter notebooks. This allows users to easily integrate Spark into their existing Jupyter deployments, This allows users to easily move between languages and contexts without needing to switch to a different set of tools.
Apache Toree is designed expressly for interactive work. It supports interpreters in Scala, Python, and R.
In this talk, I will cover the design of Toree, how it interacts with the Jupyter ecosystem and various ways in which users can extend the functionality of Apache Toree via a powerful plugin system.
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...Spark Summit
Since April 2016, Spark-as-a-service has been available to researchers in Sweden from the Swedish ICT SICS Data Center at www.hops.site. Researchers work in an entirely UI-driven environment on a platform built with only open-source software.
Spark applications can be either deployed as jobs (batch or streaming) or written and run directly from Apache Zeppelin. Spark applications are run within a project on a YARN cluster with the novel property that Spark applications are metered and charged to projects. Projects are also securely isolated from each other and include support for project-specific Kafka topics. That is, Kafka topics are protected from access by users that are not members of the project. In this talk we will discuss the challenges in building multi-tenant Spark streaming applications on YARN that are metered and easy-to-debug. We show how we use the ELK stack (Elasticsearch, Logstash, and Kibana) for logging and debugging running Spark streaming applications, how we use Graphana and Graphite for monitoring Spark streaming applications, and how users can debug and optimize terminated Spark Streaming jobs using Dr Elephant. We will also discuss the experiences of our users (over 120 users as of Sept 2016): how they manage their Kafka topics and quotas, patterns for how users share topics between projects, and our novel solutions for helping researchers debug and optimize Spark applications.
To conclude, we will also give an overview on our course ID2223 on Large Scale Learning and Deep Learning, in which 60 students designed and ran SparkML applications on the platform.
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...Spark Summit
Devops engineers have applied a great deal of creativity and energy to invent tools that automate infrastructure management, in the service of deploying capable and functional applications. For data-driven applications running on Apache Spark, the details of instantiating and managing the backing Spark cluster can be a distraction from focusing on the application logic. In the spirit of devops, automating Spark cluster management tasks allows engineers to focus their attention on application code that provides value to end-users.
Using Openshift Origin as a laboratory, we implemented a platform where Apache Spark applications create their own clusters and then dynamically manage their own scale via host-platform APIs. This makes it possible to launch a fully elastic Spark application with little more than the click of a button.
We will present a live demo of turn-key deployment for elastic Apache Spark applications, and share what we’ve learned about developing Spark applications that manage their own resources dynamically with platform APIs.
The audience for this talk will be anyone looking for ways to streamline their Apache Spark cluster management, reduce the workload for Spark application deployment, or create self-scaling elastic applications. Attendees can expect to learn about leveraging APIs in the Kubernetes ecosystem that enable application deployments to manipulate their own scale elastically.
Apache Spark on K8S Best Practice and Performance in the CloudDatabricks
Kubernetes As of Spark 2.3, Spark can run on clusters managed by Kubernetes. we will describes the best practices about running Spark SQL on Kubernetes upon Tencent cloud includes how to deploy Kubernetes against public cloud platform to maximum resource utilization and how to tune configurations of Spark to take advantage of Kubernetes resource manager to achieve best performance. To evaluate performance, the TPC-DS benchmarking tool will be used to analysis performance impact of queries between configurations set.
Speakers: Junjie Chen, Junping Du
Very Large Data Files, Object Stores, and Deep Learning—Lessons Learned While...Databricks
In this session, IBM will present details on advanced Apache Spark analytics currently being performed through a collaborative project with the SETI Institute, NASA, Swinburne University, Stanford University and IBM. The Allen Telescope Array in northern California has been continuously scanning the skies for over two decades, generating data archives with over 200 million signal events.
Come and learn how astronomers and researchers are using Apache Spark, in conjunction with assets such as IBM’s Cognitive Compute Cluster with over 700 GPUs, to train neural net models for signal classification, and to perform computationally intensive Spark workloads on multi-terabyte binary signal files. The speakers will also share details on one of the key components of this implementation: Stocator, an open source (Apache License 2.0) object store connector for Hadoop and Apache Spark, specifically designed to optimize their performance with object stores. Learn how Stocator works, and see how it was able to greatly improve performance and reduce the quantity of resources used, both for ground-to-cloud uploads of very large signal files, and for subsequent access of radio data for analysis using Spark.
Keeping Spark on Track: Productionizing Spark for ETLDatabricks
ETL is the first phase when building a big data processing platform. Data is available from various sources and formats, and transforming the data into a compact binary format (Parquet, ORC, etc.) allows Apache Spark to process it in the most efficient manner. This talk will discuss common issues and best practices for speeding up your ETL workflows, handling dirty data, and debugging tips for identifying errors.
Speakers: Kyle Pistor & Miklos Christine
This talk was originally presented at Spark Summit East 2017.
Hands-on Session on Big Data processing using Apache Spark and Hadoop Distributed File System
This is the first session in the series of "Apache Spark Hands-on"
Topics Covered
+ Introduction to Apache Spark
+ Introduction to RDD (Resilient Distributed Datasets)
+ Loading data into an RDD
+ RDD Operations - Transformation
+ RDD Operations - Actions
+ Hands-on demos using CloudxLab
Mobility insights at Swisscom - Understanding collective mobility in SwitzerlandFrançois Garillot
Swisscom is the leading mobile-service provider in Switzerland, with a market share high enough to enable us to model and understand the collective mobility in every area of the country. To accomplish that, we built an urban planning tool that helps cities better manage their infrastructure based on data-based insights, produced with Apache Spark, YARN, Kafka and a good dose of machine learning. In this talk, we will explain how building such a tool involves mining a massive amount of raw data (1.5E9 records/day) to extract fine-grained mobility features from raw network traces. These features are obtained using different machine learning algorithms. For example, we built an algorithm that segments a trajectory into mobile and static periods and trained classifiers that enable us to distinguish between different means of transport. As we sketch the different algorithmic components, we will present our approach to continuously run and test them, which involves complex pipelines managed with Oozie and fuelled with ground truth data. Finally, we will delve into the streaming part of our analytics and see how network events allow Swisscom to understand the characteristics of the flow of people on roads and paths of interest. This requires making a link between network coverage information and geographical positioning in the space of milliseconds and using Spark streaming with libraries that were originally designed for batch processing. We will conclude on the advantages and pitfalls of Spark involved in running this kind of pipeline on a multi-tenant cluster. Audiences should come back from this talk with an overall picture of the use of Apache Spark and related components of its ecosystem in the field of trajectory mining.
Deep Learning on Apache® Spark™ : Workflows and Best PracticesJen Aman
The combination of Deep Learning with Apache Spark has the potential for tremendous impact in many sectors of the industry. This webinar, based on the experience gained in assisting customers with the Databricks Virtual Analytics Platform, will present some best practices for building deep learning pipelines with Spark.
Rather than comparing deep learning systems or specific optimizations, this webinar will focus on issues that are common to deep learning frameworks when running on a Spark cluster, including:
* optimizing cluster setup;
* configuring the cluster;
* ingesting data; and
* monitoring long-running jobs.
We will demonstrate the techniques we cover using Google’s popular TensorFlow library. More specifically, we will cover typical issues users encounter when integrating deep learning libraries with Spark clusters.
Clusters can be configured to avoid task conflicts on GPUs and to allow using multiple GPUs per worker. Setting up pipelines for efficient data ingest improves job throughput, and monitoring facilitates both the work of configuration and the stability of deep learning jobs.
Natural Language Processing with CNTK and Apache Spark with Ali ZaidiDatabricks
Apache Spark provides an elegant API for developing machine learning pipelines that can be deployed seamlessly in production. However, one of the most intriguing and performant family of algorithms – deep learning – remains difficult for many groups to deploy in production, both because of the need for tremendous compute resources and also because of the inherent difficulty in tuning and configuring.
In this session, you’ll discover how to deploy the Microsoft Cognitive Toolkit (CNTK) inside of Spark clusters on the Azure cloud platform. Learn about the key considerations for administering GPU-enabled Spark clusters, configuring such workloads for maximum performance, and techniques for distributed hyperparameter optimization. You’ll also see a real-world example of training distributed deep learning learning algorithms for speech recognition and natural language processing.Microsoft Cognitive Toolkit (CNTK) inside of Spark clusters on the Azure cloud platform. We’ll discuss the key considerations for administering GPU-enabled Spark clusters, configuring such workloads for maximum performance, and techniques for distributed hyperparameter optimization. We’ll illustrate a real-world example of training distributed deep learning learning algorithms for speech recognition and natural language processing.
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-MallaSpark Summit
Spark had been elected, deservedly, as the main massive parallel processing framework, and HDFS is the one of the most popular Big Data storage technologies. Therefore its combination is one of the most usual Big Data’s use cases. But, what happens with the security? Can these two technologies coexist in a secure environment? Furthermore, with the proliferation of BI technologies adapted to Big Data environments, that demands that several users interacts with the same cluster concurrently, can we continue to ensure that our Big Data environments are still secure? In this lecture, Abel and Jorge will explain which adaptations of Spark´s core they had to perform in order to guarantee the security of multiple concurrent users using a single Spark cluster, which can use any of its cluster managers, without degrading the outstanding Spark’s performance.
Apache Spark has emerged over the past year as the imminent successor to Hadoop MapReduce. Spark can process data in memory at very high speed, while still be able to spill to disk if required. Spark’s powerful, yet flexible API allows users to write complex applications very easily without worrying about the internal workings and how the data gets processed on the cluster.
Spark comes with an extremely powerful Streaming API to process data as it is ingested. Spark Streaming integrates with popular data ingest systems like Apache Flume, Apache Kafka, Amazon Kinesis etc. allowing users to process data as it comes in.
In this talk, Hari will discuss the basics of Spark Streaming, its API and its integration with Flume, Kafka and Kinesis. Hari will also discuss a real-world example of a Spark Streaming application, and how code can be shared between a Spark application and a Spark Streaming application. Each stage of the application execution will be presented, which can help understand practices while writing such an application. Hari will finally discuss how to write a custom application and a custom receiver to receive data from other systems.
Scaling Data Analytics Workloads on DatabricksDatabricks
Imagine an organization with thousands of users who want to run data analytics workloads. These users shouldn’t have to worry about provisioning instances from a cloud provider, deploying a runtime processing engine, scaling resources based on utilization, or ensuring their data is secure. Nor should the organization’s system administrators.
In this talk we will highlight some of the exciting problems we’re working on at Databricks in order to meet the demands of organizations that are analyzing data at scale. In particular, data engineers attending this session will walk away with learning how we:
Manage a typical query lifetime through the Databricks software stack
Dynamically allocate resources to satisfy the elastic demands of a single cluster
Isolate the data and the generated state within a large organization with multiple clusters
Cotopaxi - IoT testing toolkit (Black Hat Asia 2019 Arsenal)Jakub Botwicz
Presentation about Cotopaxi toolkit from Black Hat Asia 2019 Arsenal session. Author: Jakub Botwicz
https://www.blackhat.com/asia-19/arsenal/schedule/index.html#cotopaxi-iot-protocols-security-testing-toolkit-14325
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...Spark Summit
Devops engineers have applied a great deal of creativity and energy to invent tools that automate infrastructure management, in the service of deploying capable and functional applications. For data-driven applications running on Apache Spark, the details of instantiating and managing the backing Spark cluster can be a distraction from focusing on the application logic. In the spirit of devops, automating Spark cluster management tasks allows engineers to focus their attention on application code that provides value to end-users.
Using Openshift Origin as a laboratory, we implemented a platform where Apache Spark applications create their own clusters and then dynamically manage their own scale via host-platform APIs. This makes it possible to launch a fully elastic Spark application with little more than the click of a button.
We will present a live demo of turn-key deployment for elastic Apache Spark applications, and share what we’ve learned about developing Spark applications that manage their own resources dynamically with platform APIs.
The audience for this talk will be anyone looking for ways to streamline their Apache Spark cluster management, reduce the workload for Spark application deployment, or create self-scaling elastic applications. Attendees can expect to learn about leveraging APIs in the Kubernetes ecosystem that enable application deployments to manipulate their own scale elastically.
Apache Spark on K8S Best Practice and Performance in the CloudDatabricks
Kubernetes As of Spark 2.3, Spark can run on clusters managed by Kubernetes. we will describes the best practices about running Spark SQL on Kubernetes upon Tencent cloud includes how to deploy Kubernetes against public cloud platform to maximum resource utilization and how to tune configurations of Spark to take advantage of Kubernetes resource manager to achieve best performance. To evaluate performance, the TPC-DS benchmarking tool will be used to analysis performance impact of queries between configurations set.
Speakers: Junjie Chen, Junping Du
Very Large Data Files, Object Stores, and Deep Learning—Lessons Learned While...Databricks
In this session, IBM will present details on advanced Apache Spark analytics currently being performed through a collaborative project with the SETI Institute, NASA, Swinburne University, Stanford University and IBM. The Allen Telescope Array in northern California has been continuously scanning the skies for over two decades, generating data archives with over 200 million signal events.
Come and learn how astronomers and researchers are using Apache Spark, in conjunction with assets such as IBM’s Cognitive Compute Cluster with over 700 GPUs, to train neural net models for signal classification, and to perform computationally intensive Spark workloads on multi-terabyte binary signal files. The speakers will also share details on one of the key components of this implementation: Stocator, an open source (Apache License 2.0) object store connector for Hadoop and Apache Spark, specifically designed to optimize their performance with object stores. Learn how Stocator works, and see how it was able to greatly improve performance and reduce the quantity of resources used, both for ground-to-cloud uploads of very large signal files, and for subsequent access of radio data for analysis using Spark.
Keeping Spark on Track: Productionizing Spark for ETLDatabricks
ETL is the first phase when building a big data processing platform. Data is available from various sources and formats, and transforming the data into a compact binary format (Parquet, ORC, etc.) allows Apache Spark to process it in the most efficient manner. This talk will discuss common issues and best practices for speeding up your ETL workflows, handling dirty data, and debugging tips for identifying errors.
Speakers: Kyle Pistor & Miklos Christine
This talk was originally presented at Spark Summit East 2017.
Hands-on Session on Big Data processing using Apache Spark and Hadoop Distributed File System
This is the first session in the series of "Apache Spark Hands-on"
Topics Covered
+ Introduction to Apache Spark
+ Introduction to RDD (Resilient Distributed Datasets)
+ Loading data into an RDD
+ RDD Operations - Transformation
+ RDD Operations - Actions
+ Hands-on demos using CloudxLab
Mobility insights at Swisscom - Understanding collective mobility in SwitzerlandFrançois Garillot
Swisscom is the leading mobile-service provider in Switzerland, with a market share high enough to enable us to model and understand the collective mobility in every area of the country. To accomplish that, we built an urban planning tool that helps cities better manage their infrastructure based on data-based insights, produced with Apache Spark, YARN, Kafka and a good dose of machine learning. In this talk, we will explain how building such a tool involves mining a massive amount of raw data (1.5E9 records/day) to extract fine-grained mobility features from raw network traces. These features are obtained using different machine learning algorithms. For example, we built an algorithm that segments a trajectory into mobile and static periods and trained classifiers that enable us to distinguish between different means of transport. As we sketch the different algorithmic components, we will present our approach to continuously run and test them, which involves complex pipelines managed with Oozie and fuelled with ground truth data. Finally, we will delve into the streaming part of our analytics and see how network events allow Swisscom to understand the characteristics of the flow of people on roads and paths of interest. This requires making a link between network coverage information and geographical positioning in the space of milliseconds and using Spark streaming with libraries that were originally designed for batch processing. We will conclude on the advantages and pitfalls of Spark involved in running this kind of pipeline on a multi-tenant cluster. Audiences should come back from this talk with an overall picture of the use of Apache Spark and related components of its ecosystem in the field of trajectory mining.
Deep Learning on Apache® Spark™ : Workflows and Best PracticesJen Aman
The combination of Deep Learning with Apache Spark has the potential for tremendous impact in many sectors of the industry. This webinar, based on the experience gained in assisting customers with the Databricks Virtual Analytics Platform, will present some best practices for building deep learning pipelines with Spark.
Rather than comparing deep learning systems or specific optimizations, this webinar will focus on issues that are common to deep learning frameworks when running on a Spark cluster, including:
* optimizing cluster setup;
* configuring the cluster;
* ingesting data; and
* monitoring long-running jobs.
We will demonstrate the techniques we cover using Google’s popular TensorFlow library. More specifically, we will cover typical issues users encounter when integrating deep learning libraries with Spark clusters.
Clusters can be configured to avoid task conflicts on GPUs and to allow using multiple GPUs per worker. Setting up pipelines for efficient data ingest improves job throughput, and monitoring facilitates both the work of configuration and the stability of deep learning jobs.
Natural Language Processing with CNTK and Apache Spark with Ali ZaidiDatabricks
Apache Spark provides an elegant API for developing machine learning pipelines that can be deployed seamlessly in production. However, one of the most intriguing and performant family of algorithms – deep learning – remains difficult for many groups to deploy in production, both because of the need for tremendous compute resources and also because of the inherent difficulty in tuning and configuring.
In this session, you’ll discover how to deploy the Microsoft Cognitive Toolkit (CNTK) inside of Spark clusters on the Azure cloud platform. Learn about the key considerations for administering GPU-enabled Spark clusters, configuring such workloads for maximum performance, and techniques for distributed hyperparameter optimization. You’ll also see a real-world example of training distributed deep learning learning algorithms for speech recognition and natural language processing.Microsoft Cognitive Toolkit (CNTK) inside of Spark clusters on the Azure cloud platform. We’ll discuss the key considerations for administering GPU-enabled Spark clusters, configuring such workloads for maximum performance, and techniques for distributed hyperparameter optimization. We’ll illustrate a real-world example of training distributed deep learning learning algorithms for speech recognition and natural language processing.
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-MallaSpark Summit
Spark had been elected, deservedly, as the main massive parallel processing framework, and HDFS is the one of the most popular Big Data storage technologies. Therefore its combination is one of the most usual Big Data’s use cases. But, what happens with the security? Can these two technologies coexist in a secure environment? Furthermore, with the proliferation of BI technologies adapted to Big Data environments, that demands that several users interacts with the same cluster concurrently, can we continue to ensure that our Big Data environments are still secure? In this lecture, Abel and Jorge will explain which adaptations of Spark´s core they had to perform in order to guarantee the security of multiple concurrent users using a single Spark cluster, which can use any of its cluster managers, without degrading the outstanding Spark’s performance.
Apache Spark has emerged over the past year as the imminent successor to Hadoop MapReduce. Spark can process data in memory at very high speed, while still be able to spill to disk if required. Spark’s powerful, yet flexible API allows users to write complex applications very easily without worrying about the internal workings and how the data gets processed on the cluster.
Spark comes with an extremely powerful Streaming API to process data as it is ingested. Spark Streaming integrates with popular data ingest systems like Apache Flume, Apache Kafka, Amazon Kinesis etc. allowing users to process data as it comes in.
In this talk, Hari will discuss the basics of Spark Streaming, its API and its integration with Flume, Kafka and Kinesis. Hari will also discuss a real-world example of a Spark Streaming application, and how code can be shared between a Spark application and a Spark Streaming application. Each stage of the application execution will be presented, which can help understand practices while writing such an application. Hari will finally discuss how to write a custom application and a custom receiver to receive data from other systems.
Scaling Data Analytics Workloads on DatabricksDatabricks
Imagine an organization with thousands of users who want to run data analytics workloads. These users shouldn’t have to worry about provisioning instances from a cloud provider, deploying a runtime processing engine, scaling resources based on utilization, or ensuring their data is secure. Nor should the organization’s system administrators.
In this talk we will highlight some of the exciting problems we’re working on at Databricks in order to meet the demands of organizations that are analyzing data at scale. In particular, data engineers attending this session will walk away with learning how we:
Manage a typical query lifetime through the Databricks software stack
Dynamically allocate resources to satisfy the elastic demands of a single cluster
Isolate the data and the generated state within a large organization with multiple clusters
Cotopaxi - IoT testing toolkit (Black Hat Asia 2019 Arsenal)Jakub Botwicz
Presentation about Cotopaxi toolkit from Black Hat Asia 2019 Arsenal session. Author: Jakub Botwicz
https://www.blackhat.com/asia-19/arsenal/schedule/index.html#cotopaxi-iot-protocols-security-testing-toolkit-14325
Kernel Recipes 2014 - NDIV: a low overhead network traffic diverterAnne Nicolas
NDIV is a young, very simple, yet efficient network traffic diverter. Its purpose is to help build network applications that intercept packets at line rate with a very low processing overhead. A first example application is a stateless HTTP server reaching line rate on all packet sizes.
Willy Tarreau, HaproxyTech
CocoaConf: The Language of Mobile Software is APIsTim Burks
We’re all excited about using the same language to write our mobile apps and cloud services, but as we do, we’ll still need to work with a few things that aren’t written with Swift. Fortunately, there are some great patterns that we can use for doing that. In this session we’ll talk about two technologies that you can use to make your app speak with APIs written in any language: OpenAPI and Protocol Buffers, and then we’ll see how to use them from clients and servers that are written in Swift.
Presented Friday November 4, 2016 in San Jose.
This presentation introduces Data Plane Development Kit overview and basics. It is a part of a Network Programming Series.
First, the presentation focuses on the network performance challenges on the modern systems by comparing modern CPUs with modern 10 Gbps ethernet links. Then it touches memory hierarchy and kernel bottlenecks.
The following part explains the main DPDK techniques, like polling, bursts, hugepages and multicore processing.
DPDK overview explains how is the DPDK application is being initialized and run, touches lockless queues (rte_ring), memory pools (rte_mempool), memory buffers (rte_mbuf), hashes (rte_hash), cuckoo hashing, longest prefix match library (rte_lpm), poll mode drivers (PMDs) and kernel NIC interface (KNI).
At the end, there are few DPDK performance tips.
Tags: access time, burst, cache, dpdk, driver, ethernet, hub, hugepage, ip, kernel, lcore, linux, memory, pmd, polling, rss, softswitch, switch, userspace, xeon
Apache Pulsar Development 101 with PythonTimothy Spann
Apache Pulsar Development 101 with Python PS2022_Ecosystem_v0.0
There is always the fear a speaker cannot make it. So just in case, since I was the MC for the ecosystem track I put together a talk just in case.
Here it is. Never seen or presented.
The Raspberry Pi is a series of credit card–sized single-board computers developed in the UK by the Raspberry Pi Foundation with the intention of promoting the teaching of basic computer science in schools.
The original Raspberry Pi and Raspberry Pi 2 are manufactured in several board configurations through licensed manufacturing agreements with Newark element14 (Premier Farnell), RS Components and Egoman. These companies sell the Raspberry Pi online. Egoman produces a version for distribution solely in China and Taiwan, which can be distinguished from other Pis by their red colouring and lack of FCC/CE marks. The hardware is the same across all manufacturers.
The original Raspberry Pi is based on the Broadcom BCM2835 system on a chip (SoC), which includes an ARM1176JZF-S 700 MHz processor, VideoCore IV GPU, and was originally shipped with 256 megabytes of RAM, later upgraded (models B and B+) to 512 MB. The system has Secure Digital (SD) (models A and B) or MicroSD (models A+ and B+) sockets for boot media and persistent storage.
ApacheCon2022_Deep Dive into Building Streaming Applications with Apache PulsarTimothy Spann
ApacheCon2022_Deep Dive into Building Streaming Applications with Apache Pulsar
In this session I will get you started with real-time cloud native streaming programming with Java, Golang, Python and Apache NiFi. If there’s a preferred language that the attendees pick, we will focus only on that one. I will start off with an introduction to Apache Pulsar and setting up your first easy standalone cluster in docker. We will then go into terms and architecture so you have an idea of what is going on with your events. I will then show you how to produce and consume messages to and from Pulsar topics. As well as using some of the command line and REST interfaces to monitor, manage and do CRUD on things like tenants, namespaces and topics. We will discuss Functions, Sinks, Sources, Pulsar SQL, Flink SQL and Spark SQL interfaces. We also discuss why you may want to add protocols such as MoP (MQTT), AoP (AMQP/RabbitMQ) or KoP (Kafka) to your cluster. We will also look at WebSockets as a producer and consumer. I will demonstrate a simple web page that sends and receives Pulsar messages with basic JavaScript. After this session you will be able to build simple real-time streaming and messaging applications with your chosen language or tool of your choice.
apache pulsar
tim spann developer advocate
streamnative
datainmotion.dev
D. Fast, Simple User-Space Network Functions with Snabb (RIPE 77)Igalia
By Andy Wingo.
Snabb is an open-source toolkit for building fast, flexible network functions. Since its beginnings in 2012, Snabb has seen some modest deployment success ranging from simple one-off diagnosis tools to border routers that process all IPv4 traffic for entire countries. This talk will give an introduction to Snabb. After going over Snabb's fundamental components and how they combine, the talk will move on to examples of how network engineers are taking advantage of Snabb in practice, mentioning a few of the many open-source network functions built on Snabb.
(c) RIPE 77
15 - 19 October 2018
Amsterdam, Netherlands
https://ripe77.ripe.net
We hosted a live workshop on Sep 22, 2015. These are the slides requested by the audience. If you want to hear the replay AND see the live demonstration, follow the link below.
Open NX-OS: Maximize Programmability with NX-API REST
http://cs.co/9008BKHIi
Thanks for watching, Robb, @robbboyd
Helpful links:
Nexus Programmability Guide:
http://www.cisco.com/c/en/us/td/docs/switches/datacenter/nexus9000/sw/7-x/programmability/guide/b_Cisco_Nexus_9000_Series_NX-OS_Programmability_Guide_7x.html
NX API REST Object Model Specification: http://developer.cisco.com
Chef Enterprise Agent (Supermarket): http://supermarket.chef.io
Cisco Chef Cookbook GitHub: https://github.com/cisco/cisco-network-chef-cookbook
Puppet Enterprise Agent (Puppetforge): http://forge.puppetlabs.com
Cisco Puppet Module GitHub: https://github.com/cisco/cisco-network-puppet-module
Cisco Software Repository: https://devhub.cisco.com/artifactory/open-nxos
Nexus 3/9K Datacenter GitHub Repository (NX-API, Ansible, NXAPI REST Scripting Examples, etc): http://github.com/datacenter/nexus9000
Custom Application Integration SDK: http://devhub.cisco.com/artifactory/open-nxos
Virtual vN9K, Please contact your Cisco AM to request access currently
Apache Pulsar with MQTT for Edge Computing - Pulsar Summit Asia 2021StreamNative
Today we will span from edge to any and all clouds to support data collection, real-time streaming, sensor ingest, edge computing, IoT use cases and edge AI. Apache Pulsar allows us to build computing at the edge and produce and consume messages at scale in any IoT, hybrid or cloud environment. Apache Pulsar supports MoP which allows for MQTT protocol to be used for high speed messaging.
We will teach you to quickly build scalable open source streaming applications regardless of if you are running in containers, pods, edge devices, VMs, on-premise servers, moving vehicles and any cloud.
Similar to UCX-Python - A Flexible Communication Library for Python Applications (20)
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfPeter Spielvogel
Building better applications for business users with SAP Fiori.
• What is SAP Fiori and why it matters to you
• How a better user experience drives measurable business benefits
• How to get started with SAP Fiori today
• How SAP Fiori elements accelerates application development
• How SAP Build Code includes SAP Fiori tools and other generative artificial intelligence capabilities
• How SAP Fiori paves the way for using AI in SAP apps
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Welocme to ViralQR, your best QR code generator.ViralQR
Welcome to ViralQR, your best QR code generator available on the market!
At ViralQR, we design static and dynamic QR codes. Our mission is to make business operations easier and customer engagement more powerful through the use of QR technology. Be it a small-scale business or a huge enterprise, our easy-to-use platform provides multiple choices that can be tailored according to your company's branding and marketing strategies.
Our Vision
We are here to make the process of creating QR codes easy and smooth, thus enhancing customer interaction and making business more fluid. We very strongly believe in the ability of QR codes to change the world for businesses in their interaction with customers and are set on making that technology accessible and usable far and wide.
Our Achievements
Ever since its inception, we have successfully served many clients by offering QR codes in their marketing, service delivery, and collection of feedback across various industries. Our platform has been recognized for its ease of use and amazing features, which helped a business to make QR codes.
Our Services
At ViralQR, here is a comprehensive suite of services that caters to your very needs:
Static QR Codes: Create free static QR codes. These QR codes are able to store significant information such as URLs, vCards, plain text, emails and SMS, Wi-Fi credentials, and Bitcoin addresses.
Dynamic QR codes: These also have all the advanced features but are subscription-based. They can directly link to PDF files, images, micro-landing pages, social accounts, review forms, business pages, and applications. In addition, they can be branded with CTAs, frames, patterns, colors, and logos to enhance your branding.
Pricing and Packages
Additionally, there is a 14-day free offer to ViralQR, which is an exceptional opportunity for new users to take a feel of this platform. One can easily subscribe from there and experience the full dynamic of using QR codes. The subscription plans are not only meant for business; they are priced very flexibly so that literally every business could afford to benefit from our service.
Why choose us?
ViralQR will provide services for marketing, advertising, catering, retail, and the like. The QR codes can be posted on fliers, packaging, merchandise, and banners, as well as to substitute for cash and cards in a restaurant or coffee shop. With QR codes integrated into your business, improve customer engagement and streamline operations.
Comprehensive Analytics
Subscribers of ViralQR receive detailed analytics and tracking tools in light of having a view of the core values of QR code performance. Our analytics dashboard shows aggregate views and unique views, as well as detailed information about each impression, including time, device, browser, and estimated location by city and country.
So, thank you for choosing ViralQR; we have an offer of nothing but the best in terms of QR code services to meet business diversity!
3. 3
WHY PYTHON-BASED GPU COMMUNICATION?
Python use growing
Extensive libraries
Python in Data science/HPC is growing
+ GPU usage and communication needs
4. 4
IMPACT ON DATA SCIENCE
RAPIDS uses dask-distributed for data distribution over python sockets
=> slows down all communication-bound components
Critical to enable dask with the ability to leverage IB, NVLINK
CUDA
PYTHON
APACHE ARROW
DASK
DEEP LEARNING
FRAMEWORKS
CUDNN
RAPIDS
CUMLCUDF CUGRAPH
Courtesy RAPIDS Team
5. 5
CURRENT COMMUNICATION DRAWBACKS
Existing python communication modules primarily
rely on sockets
Low latency / high bandwidth critical for better
system utilization of GPUs (eg: NVLINK, IB)
Frameworks that transfer GPU data between sites
make copies
But CUDA-aware data movement is largely solved
in HPC!
6. 6
REQUIREMENTS AND RESTRICTIONS
Dask – popular framework facilitates scaling python workloads to many nodes
Permits use of cuda-based python objects
Allows workers to be added and removed dynamically
communication backed built around coroutines (more later)
Why not use mpi4py then?
Dimension mpi4py
CUDA-Aware? No - Makes GPU<->CPU copies
Dynamic scaling? No - Imposes MPI restrictions
Coroutine support? No known support
7. 7
GOALS
Provide a flexible communication library that:
1. Supports CPU/GPU buffers over a range of message types
- raw bytes, host objects/memoryview, cupy objects, numba objects
2. Supports Dynamic connection capability
3. Supports pythonesque programming using Futures, Coroutines, etc (if needed)
4. Provides close to native performance from python world
How? – Cython, UCX
9. 9
WHY UCX?
Popular unified communication library used for MPI/PGAS
implementations such as OpenMPI, MPICH, OSHMEM, etc
Exposes API for:
Client-server based connection establishment
Point-to-point, RMA, atomics capabilities
Tag matching
Callbacks on communication events
Blocking/Polling progress
Cuda-Aware Point-to-point communication
C library!
10. 10
PYTHON BINDING APPROACHES
Three main considerations:
SWIG, CFFI, Cython
Problems with SWIG, CFFI
Works well for small examples but not for C libraries
UCX definitions of structures isn’t consolidated
Tedious to populate interface file / python script by hand
11. 11
CYTHON
Call C functions and structures from cython code (.pyx)
Expose classes, functions from python which can use C underneath
ucx_echo.py
...
ep=ucp.get_endpoint()
ep.send_obj(…)
…
ucp_py.pyx
cdef class ucp_endpoint:
def send_obj(…):
ucp_py_send_nb(…)
ucp_py_ucp_fxns.c
struct ctx *ucp_py_send_nb()
{
ucp_tag_send_nb(…)
}
Defined in UCX C libraryDefined in UCX-PY module
17. 17
UCX-PY DATA MOVEMENT API
Send data (on endpoint)
.send_*() : raw bytes, host objects (numpy), cuda objects (cupy, numba)
Receive data (on endpoint)
.recv_obj() : pass an object as argument where data is received
.recv_future() ‘blind’ : no input; returns received object; low performance
async def accept_cb(ep, …):
…
await ep.send_obj(cupy.array([42]))
…
async def talk_to_client():
ep = ucp.get_endpoint(ip, port)
…
rr = await ep.recv_future()
msg = ucp.get_obj_from_msg(rr)
…
18. 18
UCX-PY DATA MOVEMENT SEMANTICS
Send/Recv operations are non-blocking by default
Issue of the operation returns a future
Calling await on the future or calling future.result() blocks until completion
Caveat - Limited number of object types tested
memoryview, numpy, cupy, and numba
19. 19
UNDER THE HOOD
UCX depends on event notification to avoid the main thread from constantly polling
Layer UCX Calls
Connection management ucp_{listener/ep}_create
Issuing data movement ucp_tag_{send/recv/probe}_nb
Request progress ucp_worker_{arm/signal/progress}
UCX-PY Blocking progress
UCX event notification
mechanism
Completion queue events
from IB event channel
Read/write event from Sockets
26. 26
NEXT STEPS
Performance validate dask-distributed over UCX-PY with dask-cuda workloads
Objects that have mixed physical backing (CPU and GPU)
Adding blocking support to NVLINK based UCT
Non-contiguous data transfers
Integration into dask-distributed underway
(https://github.com/TomAugspurger/distributed/commits/ucx+data-handling)
Current implementation (https://github.com/Akshay-Venkatesh/ucx/tree/topic/py-
bind)
Push to UCX project underway (https://github.com/openucx/ucx/pull/3165)
27. 27
SUMMARY
UCX-PY is a flexible communication library
Provides python developers a way to leverage high-speed interconnects like IB
Can support pythonesque way of overlap communication with other coroutines
Or can be non-overlapped like in traditional HPC
Can support data-movement of objects residing on CPU memory or on GPU memory
users needn’t explicitly copy GPU<->CPU
UCX-PY is close to native performance for major use case range
28. 28
BIG PICTURE
UCX-PY will serve as a high-performance communication module for dask
CUDA
PYTHON
APACHE ARROW
DASK
DEEP LEARNING
FRAMEWORKS
CUDNN
RAPIDS
CUMLCUDF CUGRAPH
UCX-PY
Editor's Notes
Ease of use
Python was language of the year 2018
Data science is one of the most sought after areas. Large investments
Ease of use
Python was language of the year 2018
Data science is one of the most sought after areas. Large investments
Ease of use
Python was language of the year 2018
Data science is one of the most sought after areas. Large investments