Dask Tutorial at PyConDE / PyData Karlsruhe 2018. These were the introductory slides that mainly contain the link to Matthew Rocklin's Dask workshop at PyData NYC 2018 whereon this workshop was based.
In this talk, I gave an overview of Dask and Kubernetes and how they can be used together to perform various tasks like data wrangling, monte-carlo simulations, machine learning, hyper-parameter optimisation
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...Spark Summit
Since April 2016, Spark-as-a-service has been available to researchers in Sweden from the Swedish ICT SICS Data Center at www.hops.site. Researchers work in an entirely UI-driven environment on a platform built with only open-source software.
Spark applications can be either deployed as jobs (batch or streaming) or written and run directly from Apache Zeppelin. Spark applications are run within a project on a YARN cluster with the novel property that Spark applications are metered and charged to projects. Projects are also securely isolated from each other and include support for project-specific Kafka topics. That is, Kafka topics are protected from access by users that are not members of the project. In this talk we will discuss the challenges in building multi-tenant Spark streaming applications on YARN that are metered and easy-to-debug. We show how we use the ELK stack (Elasticsearch, Logstash, and Kibana) for logging and debugging running Spark streaming applications, how we use Graphana and Graphite for monitoring Spark streaming applications, and how users can debug and optimize terminated Spark Streaming jobs using Dr Elephant. We will also discuss the experiences of our users (over 120 users as of Sept 2016): how they manage their Kafka topics and quotas, patterns for how users share topics between projects, and our novel solutions for helping researchers debug and optimize Spark applications.
To conclude, we will also give an overview on our course ID2223 on Large Scale Learning and Deep Learning, in which 60 students designed and ran SparkML applications on the platform.
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...Spark Summit
Devops engineers have applied a great deal of creativity and energy to invent tools that automate infrastructure management, in the service of deploying capable and functional applications. For data-driven applications running on Apache Spark, the details of instantiating and managing the backing Spark cluster can be a distraction from focusing on the application logic. In the spirit of devops, automating Spark cluster management tasks allows engineers to focus their attention on application code that provides value to end-users.
Using Openshift Origin as a laboratory, we implemented a platform where Apache Spark applications create their own clusters and then dynamically manage their own scale via host-platform APIs. This makes it possible to launch a fully elastic Spark application with little more than the click of a button.
We will present a live demo of turn-key deployment for elastic Apache Spark applications, and share what we’ve learned about developing Spark applications that manage their own resources dynamically with platform APIs.
The audience for this talk will be anyone looking for ways to streamline their Apache Spark cluster management, reduce the workload for Spark application deployment, or create self-scaling elastic applications. Attendees can expect to learn about leveraging APIs in the Kubernetes ecosystem that enable application deployments to manipulate their own scale elastically.
Bringing HPC Algorithms to Big Data Platforms: Spark Summit East talk by Niko...Spark Summit
The talk will present a MPI-based extension of the Spark platform developed in the context of light source facilities. The background and rationale of this extension are described in the attached paper “Bringing the HPC reconstruction algorithms to Big Data platforms”[1], which has been presented at New York Scientific Data Summit (NYSDS), August 14-17, 2016 (talk: https://www.bnl.gov/nysds16/files/pdf/talks/NYSDS16%20Malitsky.pdf) Specifically, the paper highlighted a gap between two modern driving forces of the scientific discovery process: HPC and Big Data technologies. As a result, it proposed to extend the Spark platform with inter-worker communication for supporting scientific-oriented parallel applications. The approach was illustrated in the context of the Spark-based deployment of the SHARP MPI/GPU ptychographic solver. Aside from its practical value, this application represents a reference use case that captures the major technical aspects of other reconstruction tasks. In the NYSDS’16 paper, the implemented approach followed the CaffeOnSpark RDMA peer-to-peer model and augmented it with the RDMA address exchange server. By the Spark Summit, we plan to further advance this direction with the Spark-MPI generic solution based on the Hydra process management framework for supporting two major MPI implementations, MPICH and MVAPICH.
APACHE TOREE: A JUPYTER KERNEL FOR SPARK by Marius van NiekerkSpark Summit
Many data scientists are already making heavy usage of the Jupyter ecosystem for analyzing data using interactive notebooks.
Apache Toree (incubating) is a Jupyter kernel designed to act as a gateway to Spark by enabling users Spark from standard Jupyter notebooks. This allows users to easily integrate Spark into their existing Jupyter deployments, This allows users to easily move between languages and contexts without needing to switch to a different set of tools.
Apache Toree is designed expressly for interactive work. It supports interpreters in Scala, Python, and R.
In this talk, I will cover the design of Toree, how it interacts with the Jupyter ecosystem and various ways in which users can extend the functionality of Apache Toree via a powerful plugin system.
Lens: Data exploration with Dask and Jupyter widgetsVíctor Zabalza
The first step in any data-intensive project is understanding the available data. To this end, data scientists spend a significant part of their time carrying out data quality assessments and data exploration. In spite of this being a crucial step, it usually requires repeating a series of menial tasks before the data scientist gains an understanding of the dataset and can progress to the next steps in the project.
In this talk I will present Lens (https://github.com/asidatascience/lens), a Python package which automates this drudge work, enables efficient data exploration, and kickstarts data science projects. A summary is generated for each dataset, including:
- General information about the dataset, including data quality of each of the columns;
- Distribution of each of the columns through statistics and plots (histogram, CDF, KDE), optionally grouped by other categorical variables;
- 2D distribution between pairs of columns;
- Correlation coefficient matrix for all numerical columns.
Building this tool has provided a unique view into the full Python data stack, from the parallelised analysis of a dataframe within a Dask custom execution graph, to the interactive visualisation with Jupyter widgets and Plotly. During the talk, I will also introduce how Dask works, and demonstrate how to migrate data pipelines to take advantage of its scalable capabilities.
In this talk, I gave an overview of Dask and Kubernetes and how they can be used together to perform various tasks like data wrangling, monte-carlo simulations, machine learning, hyper-parameter optimisation
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...Spark Summit
Since April 2016, Spark-as-a-service has been available to researchers in Sweden from the Swedish ICT SICS Data Center at www.hops.site. Researchers work in an entirely UI-driven environment on a platform built with only open-source software.
Spark applications can be either deployed as jobs (batch or streaming) or written and run directly from Apache Zeppelin. Spark applications are run within a project on a YARN cluster with the novel property that Spark applications are metered and charged to projects. Projects are also securely isolated from each other and include support for project-specific Kafka topics. That is, Kafka topics are protected from access by users that are not members of the project. In this talk we will discuss the challenges in building multi-tenant Spark streaming applications on YARN that are metered and easy-to-debug. We show how we use the ELK stack (Elasticsearch, Logstash, and Kibana) for logging and debugging running Spark streaming applications, how we use Graphana and Graphite for monitoring Spark streaming applications, and how users can debug and optimize terminated Spark Streaming jobs using Dr Elephant. We will also discuss the experiences of our users (over 120 users as of Sept 2016): how they manage their Kafka topics and quotas, patterns for how users share topics between projects, and our novel solutions for helping researchers debug and optimize Spark applications.
To conclude, we will also give an overview on our course ID2223 on Large Scale Learning and Deep Learning, in which 60 students designed and ran SparkML applications on the platform.
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...Spark Summit
Devops engineers have applied a great deal of creativity and energy to invent tools that automate infrastructure management, in the service of deploying capable and functional applications. For data-driven applications running on Apache Spark, the details of instantiating and managing the backing Spark cluster can be a distraction from focusing on the application logic. In the spirit of devops, automating Spark cluster management tasks allows engineers to focus their attention on application code that provides value to end-users.
Using Openshift Origin as a laboratory, we implemented a platform where Apache Spark applications create their own clusters and then dynamically manage their own scale via host-platform APIs. This makes it possible to launch a fully elastic Spark application with little more than the click of a button.
We will present a live demo of turn-key deployment for elastic Apache Spark applications, and share what we’ve learned about developing Spark applications that manage their own resources dynamically with platform APIs.
The audience for this talk will be anyone looking for ways to streamline their Apache Spark cluster management, reduce the workload for Spark application deployment, or create self-scaling elastic applications. Attendees can expect to learn about leveraging APIs in the Kubernetes ecosystem that enable application deployments to manipulate their own scale elastically.
Bringing HPC Algorithms to Big Data Platforms: Spark Summit East talk by Niko...Spark Summit
The talk will present a MPI-based extension of the Spark platform developed in the context of light source facilities. The background and rationale of this extension are described in the attached paper “Bringing the HPC reconstruction algorithms to Big Data platforms”[1], which has been presented at New York Scientific Data Summit (NYSDS), August 14-17, 2016 (talk: https://www.bnl.gov/nysds16/files/pdf/talks/NYSDS16%20Malitsky.pdf) Specifically, the paper highlighted a gap between two modern driving forces of the scientific discovery process: HPC and Big Data technologies. As a result, it proposed to extend the Spark platform with inter-worker communication for supporting scientific-oriented parallel applications. The approach was illustrated in the context of the Spark-based deployment of the SHARP MPI/GPU ptychographic solver. Aside from its practical value, this application represents a reference use case that captures the major technical aspects of other reconstruction tasks. In the NYSDS’16 paper, the implemented approach followed the CaffeOnSpark RDMA peer-to-peer model and augmented it with the RDMA address exchange server. By the Spark Summit, we plan to further advance this direction with the Spark-MPI generic solution based on the Hydra process management framework for supporting two major MPI implementations, MPICH and MVAPICH.
APACHE TOREE: A JUPYTER KERNEL FOR SPARK by Marius van NiekerkSpark Summit
Many data scientists are already making heavy usage of the Jupyter ecosystem for analyzing data using interactive notebooks.
Apache Toree (incubating) is a Jupyter kernel designed to act as a gateway to Spark by enabling users Spark from standard Jupyter notebooks. This allows users to easily integrate Spark into their existing Jupyter deployments, This allows users to easily move between languages and contexts without needing to switch to a different set of tools.
Apache Toree is designed expressly for interactive work. It supports interpreters in Scala, Python, and R.
In this talk, I will cover the design of Toree, how it interacts with the Jupyter ecosystem and various ways in which users can extend the functionality of Apache Toree via a powerful plugin system.
Lens: Data exploration with Dask and Jupyter widgetsVíctor Zabalza
The first step in any data-intensive project is understanding the available data. To this end, data scientists spend a significant part of their time carrying out data quality assessments and data exploration. In spite of this being a crucial step, it usually requires repeating a series of menial tasks before the data scientist gains an understanding of the dataset and can progress to the next steps in the project.
In this talk I will present Lens (https://github.com/asidatascience/lens), a Python package which automates this drudge work, enables efficient data exploration, and kickstarts data science projects. A summary is generated for each dataset, including:
- General information about the dataset, including data quality of each of the columns;
- Distribution of each of the columns through statistics and plots (histogram, CDF, KDE), optionally grouped by other categorical variables;
- 2D distribution between pairs of columns;
- Correlation coefficient matrix for all numerical columns.
Building this tool has provided a unique view into the full Python data stack, from the parallelised analysis of a dataframe within a Dask custom execution graph, to the interactive visualisation with Jupyter widgets and Plotly. During the talk, I will also introduce how Dask works, and demonstrate how to migrate data pipelines to take advantage of its scalable capabilities.
AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with DaskVíctor Zabalza
# Talk given at PyCon UK 2017
The first step in any data-intensive project is understanding the available data. To this end, data scientists spend a significant part of their time carrying out data quality assessments and data exploration. In spite of this being a crucial step, it usually requires repeating a series of menial tasks before the data scientist gains an understanding ofthe dataset and can progress to the next steps in the project.
In this talk I will detail the inner workings of a Python package that we have built which automates this drudge work, enables efficient data exploration, and kickstarts data science projects. A summary is generated for each dataset, including:
- General information about the dataset, including data quality of each of the columns;
- Distribution of each of the columns through statistics and plots (histogram, CDF, KDE), optionally grouped by other categorical variables;
- 2D distribution between pairs of columns;
- Correlation coefficient matrix for all numerical columns.
Building this tool has provided a unique view into the full Python data stack, from the parallelised analysis of a dataframe within a Dask custom execution graph, to the interactive visualisation with Jupyter widgets and Plotly. During the talk, I will also introduce how Dask works, and demonstrate how to migrate data pipelines to take advantage of its scalable capabilities.
Very Large Data Files, Object Stores, and Deep Learning—Lessons Learned While...Databricks
In this session, IBM will present details on advanced Apache Spark analytics currently being performed through a collaborative project with the SETI Institute, NASA, Swinburne University, Stanford University and IBM. The Allen Telescope Array in northern California has been continuously scanning the skies for over two decades, generating data archives with over 200 million signal events.
Come and learn how astronomers and researchers are using Apache Spark, in conjunction with assets such as IBM’s Cognitive Compute Cluster with over 700 GPUs, to train neural net models for signal classification, and to perform computationally intensive Spark workloads on multi-terabyte binary signal files. The speakers will also share details on one of the key components of this implementation: Stocator, an open source (Apache License 2.0) object store connector for Hadoop and Apache Spark, specifically designed to optimize their performance with object stores. Learn how Stocator works, and see how it was able to greatly improve performance and reduce the quantity of resources used, both for ground-to-cloud uploads of very large signal files, and for subsequent access of radio data for analysis using Spark.
Using SparkR to Scale Data Science Applications in Production. Lessons from t...Spark Summit
R is a hugely popular platform for Data Scientists to create analytic models in many different domains. But when these applications should move from the science lab to the production environment of large enterprises a new set of challenges arises. Independently of R, Spark has been very successful as a powerful general-purpose computing platform. With the introduction of SparkR an exciting new option to productionize Data Science applications has been made available. This talk will give insight into two real-life projects at major enterprises where Data Science applications in R have been migrated to SparkR.
• Dealing with platform challenges: R was not installed on the cluster. We show how to execute SparkR on a Yarn cluster with a dynamic deployment of R.
• Integrating Data Engineering and Data Science: we highlight the technical and cultural challenges that arise from closely integrating these two different areas.
• Separation of concerns: we describe how to disentangle ETL and data preparation from analytic computing and statistical methods.
• Scaling R with SparkR: we present what options SparkR offers to scale R applications and how we applied them to different areas such as time series forecasting and web analytics.
• Performance Improvements: we will show benchmarks for an R applications that took over 20 hours on a single server/single-threaded setup. With moderate effort we have been able to reduce that number to 15 minutes with SparkR. And we will show how we plan to further reduces this to less than a minute in the future.
• Mixing SparkR, SparkSQL and MLlib: we show how we combined the three different libraries to maximize efficiency.
• Summary and Outlook: we describe what we have learnt so far, what the biggest gaps currently are and what challenges we expect to solve in the short- to mid-term.
Deep Learning on Apache® Spark™ : Workflows and Best PracticesJen Aman
The combination of Deep Learning with Apache Spark has the potential for tremendous impact in many sectors of the industry. This webinar, based on the experience gained in assisting customers with the Databricks Virtual Analytics Platform, will present some best practices for building deep learning pipelines with Spark.
Rather than comparing deep learning systems or specific optimizations, this webinar will focus on issues that are common to deep learning frameworks when running on a Spark cluster, including:
* optimizing cluster setup;
* configuring the cluster;
* ingesting data; and
* monitoring long-running jobs.
We will demonstrate the techniques we cover using Google’s popular TensorFlow library. More specifically, we will cover typical issues users encounter when integrating deep learning libraries with Spark clusters.
Clusters can be configured to avoid task conflicts on GPUs and to allow using multiple GPUs per worker. Setting up pipelines for efficient data ingest improves job throughput, and monitoring facilitates both the work of configuration and the stability of deep learning jobs.
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Josef A. Habdank
Presentation consists of an amazing bundle of Pro tips and tricks for building an insanely scalable Apache Spark and Spark Streaming based data pipeline.
Presentation consists of 4 parts:
* Quick intro to Spark
* N-billion rows/day system architecture
* Data Warehouse and Messaging
* How to deploy spark so it does not backfire
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...Databricks
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spark and Scala
Talk given by Reynold Xin at Scala Days SF 2015
In this talk, Reynold talks about the underlying techniques used to achieve high performance sorting using Spark and Scala, among which are sun.misc.Unsafe, exploiting cache locality, high-level resource pipelining.
Deep Learning to Production with MLflow & RedisAIDatabricks
Taking deep learning models to production and doing so reliably is one of the next frontiers of MLOps. With the advent of Redis modules and the availability of C APIs for the major deep learning frameworks, it is now possible to turn Redis into a reliable runtime for deep learning workloads, providing a simple solution for a model serving microservice. RedisAI is shipped with several cool features such as support for multiple frameworks, CPU and GPU backend, auto batching, DAGing, and soon will be with automatic monitoring abilities. In this talk, we'll explore some of these features of RedisAI and see how easy it is to integrate MLflow and RedisAI to build an efficient productionization pipeline.
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...Databricks
At the end of day, the only thing that data scientists want is tabular data for their analysis. They do not want to spend hours or days preparing data. How does a data engineer handle the massive amount of data that is being streamed at them from IoT devices and apps, and at the same time add structure to it so that data scientists can focus on finding insights and not preparing data? By the way, you need to do this within minutes (sometimes seconds). Oh… and there are a lot of other data sources that you need to ingest, and the current providers of data are changing their structure.
GoPro has massive amounts of heterogeneous data being streamed from their consumer devices and applications, and they have developed the concept of “dynamic DDL” to structure their streamed data on the fly using Spark Streaming, Kafka, HBase, Hive and S3. The idea is simple: Add structure (schema) to the data as soon as possible; allow the providers of the data to dictate the structure; and automatically create event-based and state-based tables (DDL) for all data sources to allow data scientists to access the data via their lingua franca, SQL, within minutes.
Running Emerging AI Applications on Big Data Platforms with Ray On Apache SparkDatabricks
With the rapid evolution of AI in recent years, we need to embrace advanced and emerging AI technologies to gain insights and make decisions based on massive amounts of data. Ray (https://github.com/ray-project/ray) is a fast and simple framework open-sourced by UC Berkeley RISELab particularly designed for easily building advanced AI applications in a distributed fashion.
Apache Spark Performance is too hard. Let's make it easierDatabricks
Apache Spark is a dynamic execution engine that can take relatively simple Scala code and create complex and optimized execution plans. In this talk, we will describe how user code translates into Spark drivers, executors, stages, tasks, transformations, and shuffles. We will then describe how this is critical to the design of Spark and how this tight interplay allows very efficient execution. We will also discuss various sources of metrics on how Spark applications use hardware resources, and show how application developers can use this information to write more efficient code. Users and operators who are aware of these concepts will become more effective at their interactions with Spark.
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...Databricks
Apache Spark is an excellent tool to accelerate your analytics, whether you’re doing ETL, Machine Learning, or Data Warehousing. However, to really make the most of Spark it pays to understand best practices for data storage, file formats, and query optimization. This talk will cover best practices I’ve applied over years in the field helping customers write Spark applications as well as identifying what patterns make sense for your use case.
VEGAS: The Missing Matplotlib for Scala/Apache Spark with Roger Menezes and D...Spark Summit
In this talk, we’ll present techniques for visualizing large scale machine learning systems in Spark. These are techniques that are employed by Netflix to understand and refine the machine learning models behind Netflix’s famous recommender systems that are used to personalize the Netflix experience for their 99 millions members around the world. Essential to these techniques is Vegas, a new OSS Scala library that aims to be the “missing MatPlotLib” for Spark/Scala. We’ll talk about the design of Vegas and its usage in Scala notebooks to visualize Machine Learning Models.
Sparking up Data Engineering: Spark Summit East talk by Rohan SharmaSpark Summit
Learn about the Big Data Processing ecosystem at Netflix and how Apache Spark sits in this platform. I talk about typical data flows and data pipeline architectures that are used in Netflix and address how Spark is helping us gain efficiency in our processes. As a bonus – i’ll touch on some unconventional use-cases contrary to typical warehousing / analytics solutions that are being served by Apache Spark.
AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with DaskVíctor Zabalza
# Talk given at PyCon UK 2017
The first step in any data-intensive project is understanding the available data. To this end, data scientists spend a significant part of their time carrying out data quality assessments and data exploration. In spite of this being a crucial step, it usually requires repeating a series of menial tasks before the data scientist gains an understanding ofthe dataset and can progress to the next steps in the project.
In this talk I will detail the inner workings of a Python package that we have built which automates this drudge work, enables efficient data exploration, and kickstarts data science projects. A summary is generated for each dataset, including:
- General information about the dataset, including data quality of each of the columns;
- Distribution of each of the columns through statistics and plots (histogram, CDF, KDE), optionally grouped by other categorical variables;
- 2D distribution between pairs of columns;
- Correlation coefficient matrix for all numerical columns.
Building this tool has provided a unique view into the full Python data stack, from the parallelised analysis of a dataframe within a Dask custom execution graph, to the interactive visualisation with Jupyter widgets and Plotly. During the talk, I will also introduce how Dask works, and demonstrate how to migrate data pipelines to take advantage of its scalable capabilities.
Very Large Data Files, Object Stores, and Deep Learning—Lessons Learned While...Databricks
In this session, IBM will present details on advanced Apache Spark analytics currently being performed through a collaborative project with the SETI Institute, NASA, Swinburne University, Stanford University and IBM. The Allen Telescope Array in northern California has been continuously scanning the skies for over two decades, generating data archives with over 200 million signal events.
Come and learn how astronomers and researchers are using Apache Spark, in conjunction with assets such as IBM’s Cognitive Compute Cluster with over 700 GPUs, to train neural net models for signal classification, and to perform computationally intensive Spark workloads on multi-terabyte binary signal files. The speakers will also share details on one of the key components of this implementation: Stocator, an open source (Apache License 2.0) object store connector for Hadoop and Apache Spark, specifically designed to optimize their performance with object stores. Learn how Stocator works, and see how it was able to greatly improve performance and reduce the quantity of resources used, both for ground-to-cloud uploads of very large signal files, and for subsequent access of radio data for analysis using Spark.
Using SparkR to Scale Data Science Applications in Production. Lessons from t...Spark Summit
R is a hugely popular platform for Data Scientists to create analytic models in many different domains. But when these applications should move from the science lab to the production environment of large enterprises a new set of challenges arises. Independently of R, Spark has been very successful as a powerful general-purpose computing platform. With the introduction of SparkR an exciting new option to productionize Data Science applications has been made available. This talk will give insight into two real-life projects at major enterprises where Data Science applications in R have been migrated to SparkR.
• Dealing with platform challenges: R was not installed on the cluster. We show how to execute SparkR on a Yarn cluster with a dynamic deployment of R.
• Integrating Data Engineering and Data Science: we highlight the technical and cultural challenges that arise from closely integrating these two different areas.
• Separation of concerns: we describe how to disentangle ETL and data preparation from analytic computing and statistical methods.
• Scaling R with SparkR: we present what options SparkR offers to scale R applications and how we applied them to different areas such as time series forecasting and web analytics.
• Performance Improvements: we will show benchmarks for an R applications that took over 20 hours on a single server/single-threaded setup. With moderate effort we have been able to reduce that number to 15 minutes with SparkR. And we will show how we plan to further reduces this to less than a minute in the future.
• Mixing SparkR, SparkSQL and MLlib: we show how we combined the three different libraries to maximize efficiency.
• Summary and Outlook: we describe what we have learnt so far, what the biggest gaps currently are and what challenges we expect to solve in the short- to mid-term.
Deep Learning on Apache® Spark™ : Workflows and Best PracticesJen Aman
The combination of Deep Learning with Apache Spark has the potential for tremendous impact in many sectors of the industry. This webinar, based on the experience gained in assisting customers with the Databricks Virtual Analytics Platform, will present some best practices for building deep learning pipelines with Spark.
Rather than comparing deep learning systems or specific optimizations, this webinar will focus on issues that are common to deep learning frameworks when running on a Spark cluster, including:
* optimizing cluster setup;
* configuring the cluster;
* ingesting data; and
* monitoring long-running jobs.
We will demonstrate the techniques we cover using Google’s popular TensorFlow library. More specifically, we will cover typical issues users encounter when integrating deep learning libraries with Spark clusters.
Clusters can be configured to avoid task conflicts on GPUs and to allow using multiple GPUs per worker. Setting up pipelines for efficient data ingest improves job throughput, and monitoring facilitates both the work of configuration and the stability of deep learning jobs.
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Josef A. Habdank
Presentation consists of an amazing bundle of Pro tips and tricks for building an insanely scalable Apache Spark and Spark Streaming based data pipeline.
Presentation consists of 4 parts:
* Quick intro to Spark
* N-billion rows/day system architecture
* Data Warehouse and Messaging
* How to deploy spark so it does not backfire
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...Databricks
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spark and Scala
Talk given by Reynold Xin at Scala Days SF 2015
In this talk, Reynold talks about the underlying techniques used to achieve high performance sorting using Spark and Scala, among which are sun.misc.Unsafe, exploiting cache locality, high-level resource pipelining.
Deep Learning to Production with MLflow & RedisAIDatabricks
Taking deep learning models to production and doing so reliably is one of the next frontiers of MLOps. With the advent of Redis modules and the availability of C APIs for the major deep learning frameworks, it is now possible to turn Redis into a reliable runtime for deep learning workloads, providing a simple solution for a model serving microservice. RedisAI is shipped with several cool features such as support for multiple frameworks, CPU and GPU backend, auto batching, DAGing, and soon will be with automatic monitoring abilities. In this talk, we'll explore some of these features of RedisAI and see how easy it is to integrate MLflow and RedisAI to build an efficient productionization pipeline.
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...Databricks
At the end of day, the only thing that data scientists want is tabular data for their analysis. They do not want to spend hours or days preparing data. How does a data engineer handle the massive amount of data that is being streamed at them from IoT devices and apps, and at the same time add structure to it so that data scientists can focus on finding insights and not preparing data? By the way, you need to do this within minutes (sometimes seconds). Oh… and there are a lot of other data sources that you need to ingest, and the current providers of data are changing their structure.
GoPro has massive amounts of heterogeneous data being streamed from their consumer devices and applications, and they have developed the concept of “dynamic DDL” to structure their streamed data on the fly using Spark Streaming, Kafka, HBase, Hive and S3. The idea is simple: Add structure (schema) to the data as soon as possible; allow the providers of the data to dictate the structure; and automatically create event-based and state-based tables (DDL) for all data sources to allow data scientists to access the data via their lingua franca, SQL, within minutes.
Running Emerging AI Applications on Big Data Platforms with Ray On Apache SparkDatabricks
With the rapid evolution of AI in recent years, we need to embrace advanced and emerging AI technologies to gain insights and make decisions based on massive amounts of data. Ray (https://github.com/ray-project/ray) is a fast and simple framework open-sourced by UC Berkeley RISELab particularly designed for easily building advanced AI applications in a distributed fashion.
Apache Spark Performance is too hard. Let's make it easierDatabricks
Apache Spark is a dynamic execution engine that can take relatively simple Scala code and create complex and optimized execution plans. In this talk, we will describe how user code translates into Spark drivers, executors, stages, tasks, transformations, and shuffles. We will then describe how this is critical to the design of Spark and how this tight interplay allows very efficient execution. We will also discuss various sources of metrics on how Spark applications use hardware resources, and show how application developers can use this information to write more efficient code. Users and operators who are aware of these concepts will become more effective at their interactions with Spark.
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...Databricks
Apache Spark is an excellent tool to accelerate your analytics, whether you’re doing ETL, Machine Learning, or Data Warehousing. However, to really make the most of Spark it pays to understand best practices for data storage, file formats, and query optimization. This talk will cover best practices I’ve applied over years in the field helping customers write Spark applications as well as identifying what patterns make sense for your use case.
VEGAS: The Missing Matplotlib for Scala/Apache Spark with Roger Menezes and D...Spark Summit
In this talk, we’ll present techniques for visualizing large scale machine learning systems in Spark. These are techniques that are employed by Netflix to understand and refine the machine learning models behind Netflix’s famous recommender systems that are used to personalize the Netflix experience for their 99 millions members around the world. Essential to these techniques is Vegas, a new OSS Scala library that aims to be the “missing MatPlotLib” for Spark/Scala. We’ll talk about the design of Vegas and its usage in Scala notebooks to visualize Machine Learning Models.
Sparking up Data Engineering: Spark Summit East talk by Rohan SharmaSpark Summit
Learn about the Big Data Processing ecosystem at Netflix and how Apache Spark sits in this platform. I talk about typical data flows and data pipeline architectures that are used in Netflix and address how Spark is helping us gain efficiency in our processes. As a bonus – i’ll touch on some unconventional use-cases contrary to typical warehousing / analytics solutions that are being served by Apache Spark.
Today, most any application can be “Dockerized.” However, there are special challenges when deploying a distributed application such as Spark on containers. This session will describe how to overcome these challenges in deploying Spark on Docker containers, with many practical tips and techniques for running Spark in a container environment.
Containers are typically used to run stateless applications on a single host. There are significant real-world enterprise requirements that need to be addressed when running a stateful, distributed application in a secure multi-host container environment.
There are decisions that need to be made concerning which tools and infrastructure to use. There are many choices with respect to container managers, orchestration frameworks, and resource schedulers that are readily available today and some that may be available tomorrow including:]
• Mesos
• Kubernetes
• Docker Swarm
Each has its own strengths and weaknesses; each has unique characteristics that may make it suitable, or unsuitable, for Spark. Understanding these differences is critical to the successful deployment of Spark on Docker containers.
This session will describe the work done by the BlueData engineering team to run Spark inside containers, on a distributed platform, including the evaluation of various orchestration frameworks and lessons learned. You will learn how to apply practical networking and storage techniques to achieve high performance and agility in a distributed, container environment.
Speaker
Thomas Phelan, Chief Architect, Blue Data, Inc
Deep Learning on Apache® Spark™: Workflows and Best PracticesDatabricks
The combination of Deep Learning with Apache Spark has the potential for tremendous impact in many sectors of the industry. This webinar, based on the experience gained in assisting customers with the Databricks Virtual Analytics Platform, will present some best practices for building deep learning pipelines with Spark.
Rather than comparing deep learning systems or specific optimizations, this webinar will focus on issues that are common to deep learning frameworks when running on a Spark cluster, including:
* optimizing cluster setup;
* configuring the cluster;
* ingesting data; and
* monitoring long-running jobs.
We will demonstrate the techniques we cover using Google’s popular TensorFlow library. More specifically, we will cover typical issues users encounter when integrating deep learning libraries with Spark clusters.
Clusters can be configured to avoid task conflicts on GPUs and to allow using multiple GPUs per worker. Setting up pipelines for efficient data ingest improves job throughput, and monitoring facilitates both the work of configuration and the stability of deep learning jobs.
Deep Learning on Apache® Spark™: Workflows and Best PracticesJen Aman
The combination of Deep Learning with Apache Spark has the potential for tremendous impact in many sectors of the industry. This webinar, based on the experience gained in assisting customers with the Databricks Virtual Analytics Platform, will present some best practices for building deep learning pipelines with Spark.
Rather than comparing deep learning systems or specific optimizations, this webinar will focus on issues that are common to deep learning frameworks when running on a Spark cluster, including:
* optimizing cluster setup;
* configuring the cluster;
* ingesting data; and
* monitoring long-running jobs.
We will demonstrate the techniques we cover using Google’s popular TensorFlow library. More specifically, we will cover typical issues users encounter when integrating deep learning libraries with Spark clusters.
Clusters can be configured to avoid task conflicts on GPUs and to allow using multiple GPUs per worker. Setting up pipelines for efficient data ingest improves job throughput, and monitoring facilitates both the work of configuration and the stability of deep learning jobs.
Data Science at Scale: Using Apache Spark for Data Science at BitlySarah Guido
Given at Data Day Seattle 2015.
Bitly generates over 9 billion clicks on shortened links a month, as well as over 100 million unique link shortens. Analyzing data of this scale is not without its challenges. At Bitly, we have started adopting Apache Spark as a way to process our data. In this talk, I’ll elaborate on how I use Spark as part of my data science workflow. I’ll cover how Spark fits into our existing architecture, the kind of problems I’m solving with Spark, and the benefits and challenges of using Spark for large-scale data science.
Demi Ben Ari - Apache Spark 101 - First Steps into distributed computing:
The world has changed, having one huge server won’t do the job, the ability to Scale Out would be your savior. Apache Spark is a fast and general engine for big data processing, with streaming, SQL, machine learning and graph processing. Showing the basics of Apache Spark and distributed computing.
Demi is a Software engineer, Entrepreneur and an International Tech Speaker.
Demi has over 10 years of experience in building various systems both from the field of near real time applications and Big Data distributed systems.
Co-Founder of the “Big Things” Big Data community and Google Developer Group Cloud.
Big Data Expert, but interested in all kinds of technologies, from front-end to backend, whatever moves data around.
Lessons Learned from Dockerizing Spark WorkloadsBlueData, Inc.
Many initiatives for running applications inside containers have been scoped to run on a single host. Using Docker containers for large-scale production environments poses interesting challenges, especially when deploying distributed Big Data applications like Apache Spark.
Some of these challenges include container lifecycle management, smart scheduling for optimal resource utilization, network configuration and security, and performance. BlueData is “all in” on Docker containers – with a specific focus on Spark applications. They’ve learned first-hand how to address these challenges for Fortune 500 enterprises and government organizations that want to deploy Big Data workloads using Docker.
This session at Spark Summit in February 2017 (by Thomas Phelan, co-founder and chief architect at BlueData) described lessons learned as well as some tips and tricks on how to Dockerize your Big Data applications in a reliable, scalable, and high-performance environment.
In this session, Tom described how to network Docker containers across multiple hosts securely. He discussed ways to achieve high availability across distributed Big Data applications and hosts in your data center. And since we’re talking about very large volumes of data, performance is a key factor. So Tom discussed some of the storage options that BlueData explored and implemented to achieve near bare-metal I/O performance for Spark using Docker.
https://spark-summit.org/east-2017/events/lessons-learned-from-dockerizing-spark-workloads
Scala and Spark are Ideal for Big Data - Data Science Pop-up SeattleDomino Data Lab
Scala and Spark are each great tools for data processing and they work well together. They can process data via small simple interactive queries as well as in very large highly-available and scalable production systems. They provide an integrated framework for an ever growing wide range of data processing capabilities. We examine the reasons for this and also look a couple of simple data processing examples written in Scala. Presented by John Nestor, Sr Architect at 47 Degrees.
Integrating Deep Learning Libraries with Apache SparkDatabricks
The combination of deep learning with Apache Spark has the potential to make a huge impact. Joseph Bradley and Xiangrui Meng share best practices for integrating popular deep learning libraries with Apache Spark. Rather than comparing deep learning systems or specific optimizations, Joseph and Xiangrui focus on issues that are common to many deep learning frameworks when running on a Spark cluster, such as optimizing cluster setup and data ingest (clusters can be configured to avoid task conflicts on GPUs and to allow using multiple GPUs per worker), configuring the cluster (setting up pipelines for efficient data ingest improves job throughput), and monitoring long-running jobs (interactive monitoring facilitates both the work of configuration and checking the stability of deep learning jobs). Joseph and Xiangrui then demonstrate the techniques using Google’s popular TensorFlow library.
Lessons Learned from Dockerizing Spark Workloads: Spark Summit East talk by T...Spark Summit
Many initiatives for running applications inside containers have been scoped to run on a single host. Using Docker containers for large-scale production environments poses interesting challenges, especially when deploying distributed Big Data applications like Apache Spark.
Some of these challenges include container lifecycle management, smart scheduling for optimal resource utilization, network configuration and security, and performance. At BlueData, we’re “all in” on Docker containers – with a specific focus on Spark applications. We’ve learned first-hand how to address these challenges for Fortune 500 enterprises and government organizations that want to deploy Big Data workloads using Docker.
In this session, you’ll learn about networking Docker containers across multiple hosts securely. We’ll discuss ways to achieve high availability across distributed Big Data applications and hosts in your data center. And since we’re talking about very large volumes of data, performance is a key factor. So we’ll discuss some of the storage options we explored and implemented at BlueData to achieve near bare-metal I/O performance for Spark using Docker. We’ll share our lessons learned as well as some tips and tricks on how to Dockerize your Big Data applications in a reliable, scalable, and high-performance environment.
Your data is getting bigger while your boss is getting anxious to have insights! This tutorial covers Apache Spark that makes data analytics fast to write and fast to run. Tackle big datasets quickly through a simple API in Python, and learn one programming paradigm in order to deploy interactive, batch, and streaming applications while connecting to data sources incl. HDFS, Hive, JSON, and S3.
Jeremy Nixon, Machine Learning Engineer, Spark Technology Center at MLconf AT...MLconf
Convolutional Neural Networks at scale in Spark MLlib:
Jeremy Nixon will focus on the engineering and applications of a new algorithm built on top of MLlib. The presentation will focus on the methods the algorithm uses to automatically generate features to capture nonlinear structure in data, as well as the process by which it’s trained. Major aspects of that include compositional transformations over the data, convolution, and distributed backpropagation via SGD with adaptive gradients and an adaptive learning rate. Applications will look into how to use convolutional neural networks to model data in computer vision, natural language and signal processing. Details around optimal preprocessing, the type of structure that can be learned, and managing its ability to generalize will inform developers looking to apply nonlinear modeling tools to problems that they face.
In the last decade, Apache Parquet has become the standard format to store tabular data on disk regardless of the technology stack used. This is due to its read/write performance, efficient compression technology, interoperability and especially outstanding performance with the default settings.
While these default settings and access patterns already provide decent performance, by understanding the format in more detail and using recent developments, one can get much better performance, smaller files, and utilise Parquet's newer partial reading features to read even smaller subsets of a file for a given query.
This talk aims to provide insight into the Parquet format and its recent development that are useful for end users' daily workflows. One only needs prior knowledge to know what a DataFrame/tabular data is.
pandas.(to/from)_sql is simple but not fastUwe Korn
Pandas provides convenience methods to read and write to databases using to_sql and read_sql. They provide great usability and a uniform interface for all databases that support an SQL Alchemy connection. Sadly, the layer of convenience also introduces a performance loss. Luckily, for a lot of databases, a performant access layer is available.
PyData Frankfurt - (Efficient) Data Exchange with "Foreign" EcosystemsUwe Korn
As a Data Scientist/Engineer in Python, we focus in our work to solve problems with large amounts of data but still stay in Python. This is where we are the most effective and feel comfortable. Libraries like Pandas and NumPy provide us with efficient interfaces to deal with this data while still getting optimal performance. The main problem appears when we have to deal with systems outside of our comfort ecosystem. We need to write cumbersome and mostly slow conversion code that ingests data from there into our pipeline until we can work efficiently. Using Apache Arrow and Parquet as base technologies, we get a set of tools that eases this interaction and also brings us a huge performance improvement. As part of the talk we will show a basic problem where we take data coming from a Java application through Python into using these tools.
Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...Uwe Korn
In the space of building products with data, either by dealing with huge amounts of data or by applying machine learning, many different ecosystems meet. Larger volumes of data have to be passed between these systems. The handling of the data is not only down to divide between systems written in Java that need to pass it on to the machine learning model in Python. When you take into account that you want to integrate with the existing business infrastructure, you also need to cater for legacy systems as well do you need to bring the large volumes of data to the user via UIs.
PyConDE / PyData Karlsruhe 2017 – Connecting PyData to other Big Data Landsca...Uwe Korn
How Apache Arrow and Apache Parquet are helpful technologies to connect the Python Data ecosystem to other landscapes such as the Java/Scala based Big Data ecosystem.
ApacheCon Europe Big Data 2016 – Parquet in practice & detailUwe Korn
Apache Parquet is among the most commonly used column-oriented data formats in the big data processing space. It leverages various techniques to store data in a CPU- and I/O-efficient way. Furthermore, it has the capabilities to push-down analytical queries on the data to the I/O layer to avoid the loading of nonrelevant data chunks. With various Java and a C++ implementation, Parquet is also the perfect choice to exchange data between different technology stacks.
As part of this talk, a general introduction to the format and its techniques will be given. Their benefits and some of the inner workings will be explained to give a better understanding how Parquet achieves its performance. At the end, benchmarks comparing the new C++ & Python implementation with other formats will be shown.
Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copyUwe Korn
Apache Arrow's promise was to reduce the (serialization & copy) overhead of working with columnar data between different systems. Using the latest Pandas release and Arrow's ability to share memory between the JVM and Python as ingredients, we demonstrate that Arrow can fulfill this bold statement. The performance benefits of this will be shown using a typical data engineering use-case that produces data in the JVM and then passes it on to a Python-based machine learning model.
Extending Pandas using Apache Arrow and NumbaUwe Korn
With the latest release of Pandas the ability to extend it with custom dtypes was introduced. Using Apache Arrow as the in-memory storage and Numba for fast, vectorized computations on these memory regions, it is possible to extend Pandas in pure Python while achieving the same performance of the built-in types. In the talk we implement a native string type as an example.
PyData Amsterdam 2018 – Building customer-visible data science dashboards wit...Uwe Korn
There are several tools to build ML dashboards and visulisations. Their focus is often on making it as simple as possible for a (Python) data scientist. Shipping them as part of our product means that other roles like frontend developers get involved. Aspects that ease development for one role, create pains for others. We want to show how balance this using Altair, Vega and Vue.
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...2023240532
Quantitative data Analysis
Overview
Reliability Analysis (Cronbach Alpha)
Common Method Bias (Harman Single Factor Test)
Frequency Analysis (Demographic)
Descriptive Analysis
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Scalable Scientific Computing with Dask
1. 1
PyCon.DE / PyData Karlsruhe 2018
Uwe L. Korn
Scalable Scientific Computing with
Dask
2. 2
• Senior Data Scientist at Blue Yonder
(@BlueYonderTech)
• Apache {Arrow, Parquet} PMC
• Data Engineer and Architect with heavy
focus around Pandas
About me
xhochy
mail@uwekorn.com
3. 3
• Execution and definition of task graphs
• a parallel computing library that scales the existing Python ecosystem.
• scales down to your laptop laptop
• sclaes up to a cluster
What is Dask?
4. 4
• multi-core and distributed parallel execution
• low-level: task schedulers for computation graphs
• high-level: Array, Bag and DataFrame
More than a single CPU
5. 5
Dask is
• More light-weight
• In Python, operates well with C/C++/Fortran/LLVM or other natively
compiled code
• Part of the Python ecosystem
What about Spark?
6. 6
Spark is
• Written in Scala and works well within the JVM
• Python support is very limited
• Brings its own ecosystem
• Able to provide more higher level optimizations
What about Spark?