- A brief introduction to Spark Core
- Introduction to Spark Streaming
- A Demo of Streaming by evaluation top hashtags being used
- Introduction to Spark MLlib
- A Demo of MLlib by building a simple movie recommendation engine
Spark summit 2019 infrastructure for deep learning in apache spark 0425Wee Hyong Tok
In machine learning projects, the preparation of large datasets is a key phase which can be complex and expensive. It was traditionally done by data engineers before the handover to data scientists or ML engineers. They operated in different environments due to the differences in the tools, frameworks and runtimes required in each phase. Spark's support for different types of workloads brought data engineering closer to the downstream activities like machine learning that depended on the data. Unifying data acquisition, preprocessing, training models and batch inferencing under a single platform enabled by Spark not only provided seamless experience between different phases and helped accelerate the end-to-end ML lifecycle but also lowered the TCO in the building, managing the infrastructure to cover different phases. With that, the needs of a shared infrastructure expanded to include specialized hardware like GPUs and support deep learning workloads as well. Spark can effectively make use of such infrastructure as it integrates with popular deep learning frameworks and supports acceleration of deep learning jobs using GPUs. In this talk, we share learnings and experiences in supporting different types of workloads in shared clusters equipped for doing deep learning as well as data engineering. We will cover the following topics: * Considerations for sharing the infrastructure for big data and deep learning in Spark * Deep learning in Spark in clusters with and without GPUs * Differences between distributed data processing and distributed machine learning * Multitenancy and isolation in shared infrastructure.
https://databricks.com/sparkaisummit/north-america/sessions-single-2019?id=97
End-to-End Data Pipelines with Apache SparkBurak Yavuz
This presentation is about building a data product backed by Apache Spark. The source code for the demo can be found at http://brkyvz.github.io/spark-pipeline
Unlocking Your Hadoop Data with Apache Spark and CDH5SAP Concur
Spark/Mesos Seattle Meetup group shares the latest presentation from their recent meetup event on showcasing real world implementations of working with Spark within the context of your Big Data Infrastructure.
Session are demo heavy and slide light focusing on getting your development environments up and running including getting up and running, configuration issues, SparkSQL vs. Hive, etc.
To learn more about the Seattle meetup: http://www.meetup.com/Seattle-Spark-Meetup/members/21698691/
Using SparkR to Scale Data Science Applications in Production. Lessons from t...Spark Summit
R is a hugely popular platform for Data Scientists to create analytic models in many different domains. But when these applications should move from the science lab to the production environment of large enterprises a new set of challenges arises. Independently of R, Spark has been very successful as a powerful general-purpose computing platform. With the introduction of SparkR an exciting new option to productionize Data Science applications has been made available. This talk will give insight into two real-life projects at major enterprises where Data Science applications in R have been migrated to SparkR.
• Dealing with platform challenges: R was not installed on the cluster. We show how to execute SparkR on a Yarn cluster with a dynamic deployment of R.
• Integrating Data Engineering and Data Science: we highlight the technical and cultural challenges that arise from closely integrating these two different areas.
• Separation of concerns: we describe how to disentangle ETL and data preparation from analytic computing and statistical methods.
• Scaling R with SparkR: we present what options SparkR offers to scale R applications and how we applied them to different areas such as time series forecasting and web analytics.
• Performance Improvements: we will show benchmarks for an R applications that took over 20 hours on a single server/single-threaded setup. With moderate effort we have been able to reduce that number to 15 minutes with SparkR. And we will show how we plan to further reduces this to less than a minute in the future.
• Mixing SparkR, SparkSQL and MLlib: we show how we combined the three different libraries to maximize efficiency.
• Summary and Outlook: we describe what we have learnt so far, what the biggest gaps currently are and what challenges we expect to solve in the short- to mid-term.
Spark Summit EU talk by Ahsan Javed AwanSpark Summit
The document discusses performance characterization of in-memory data analytics using Apache Spark on a scale-up server. It identifies problems like poor multicore scalability, thread load imbalance, I/O wait times, and GC overhead. Solutions proposed include NUMA awareness, hyperthreading, disabling next-line prefetchers, using parallel scavenge GC, multiple small executors, and a future node architecture based on a hybrid in-storage processing and 2D processing-in-memory design. The work aims to improve node-level performance through architecture support for emerging big data workloads.
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceDatabricks
Zeus is an efficient, highly scalable and distributed shuffle as a service which is powering all Data processing (Spark and Hive) at Uber. Uber runs one of the largest Spark and Hive clusters on top of YARN in industry which leads to many issues such as hardware failures (Burn out Disks), reliability and scalability challenges.
The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...Databricks
We present the Azure Cognitive Services on Spark, a simple and easy to use extension of the SparkML Library to all Azure Cognitive Services. This integration allows Spark Users to embed cloud intelligence directly into their spark computations, enabling a new generation of intelligent applications on Spark. Furthermore, we show that with our new Containerized Cognitive Services, one can embed cloud intelligence directly into the Spark cluster for ultra-low latency, on-prem, and offline applications. We show how using our Integration, one can compose these cognitive services with other services, SQL computations, and Deep Networks to create sophisticated and intelligent heterogenous applications. Moreover, we show how to redeploy these compositions as Restful Services with Spark Serving. We will also explore the architecture of these contributions which leverage HTTP on Spark, a novel integration between Spark with the widely used Hypertext Transfer Protocol (HTTP). This library can integrate any framework into the Spark ecosystem that is capable of communicating through HTTP. Finally, we demonstrate how to use these services to create a large class of intelligent applications such as custom search engines, realtime facial recognition systems, and unsupervised object detectors.
Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...Databricks
This document provides an overview of Spark: Data Science as a Service by Sridhar Alla and Kiran Muglurmath of Comcast. It discusses Comcast's data science challenges due to massive data size and lack of scalable architecture. It introduces Roadrunner, Comcast's solution built on Spark, which provides a centralized processing system with SQL and machine learning capabilities to enable data ingestion, quality checks, feature engineering, modeling and workflow management. Roadrunner is accessed through REST APIs and helps multiple teams work with the same large datasets. Examples of transformations, joins, aggregations and anomaly detection algorithms demonstrated in Roadrunner are also included.
Spark summit 2019 infrastructure for deep learning in apache spark 0425Wee Hyong Tok
In machine learning projects, the preparation of large datasets is a key phase which can be complex and expensive. It was traditionally done by data engineers before the handover to data scientists or ML engineers. They operated in different environments due to the differences in the tools, frameworks and runtimes required in each phase. Spark's support for different types of workloads brought data engineering closer to the downstream activities like machine learning that depended on the data. Unifying data acquisition, preprocessing, training models and batch inferencing under a single platform enabled by Spark not only provided seamless experience between different phases and helped accelerate the end-to-end ML lifecycle but also lowered the TCO in the building, managing the infrastructure to cover different phases. With that, the needs of a shared infrastructure expanded to include specialized hardware like GPUs and support deep learning workloads as well. Spark can effectively make use of such infrastructure as it integrates with popular deep learning frameworks and supports acceleration of deep learning jobs using GPUs. In this talk, we share learnings and experiences in supporting different types of workloads in shared clusters equipped for doing deep learning as well as data engineering. We will cover the following topics: * Considerations for sharing the infrastructure for big data and deep learning in Spark * Deep learning in Spark in clusters with and without GPUs * Differences between distributed data processing and distributed machine learning * Multitenancy and isolation in shared infrastructure.
https://databricks.com/sparkaisummit/north-america/sessions-single-2019?id=97
End-to-End Data Pipelines with Apache SparkBurak Yavuz
This presentation is about building a data product backed by Apache Spark. The source code for the demo can be found at http://brkyvz.github.io/spark-pipeline
Unlocking Your Hadoop Data with Apache Spark and CDH5SAP Concur
Spark/Mesos Seattle Meetup group shares the latest presentation from their recent meetup event on showcasing real world implementations of working with Spark within the context of your Big Data Infrastructure.
Session are demo heavy and slide light focusing on getting your development environments up and running including getting up and running, configuration issues, SparkSQL vs. Hive, etc.
To learn more about the Seattle meetup: http://www.meetup.com/Seattle-Spark-Meetup/members/21698691/
Using SparkR to Scale Data Science Applications in Production. Lessons from t...Spark Summit
R is a hugely popular platform for Data Scientists to create analytic models in many different domains. But when these applications should move from the science lab to the production environment of large enterprises a new set of challenges arises. Independently of R, Spark has been very successful as a powerful general-purpose computing platform. With the introduction of SparkR an exciting new option to productionize Data Science applications has been made available. This talk will give insight into two real-life projects at major enterprises where Data Science applications in R have been migrated to SparkR.
• Dealing with platform challenges: R was not installed on the cluster. We show how to execute SparkR on a Yarn cluster with a dynamic deployment of R.
• Integrating Data Engineering and Data Science: we highlight the technical and cultural challenges that arise from closely integrating these two different areas.
• Separation of concerns: we describe how to disentangle ETL and data preparation from analytic computing and statistical methods.
• Scaling R with SparkR: we present what options SparkR offers to scale R applications and how we applied them to different areas such as time series forecasting and web analytics.
• Performance Improvements: we will show benchmarks for an R applications that took over 20 hours on a single server/single-threaded setup. With moderate effort we have been able to reduce that number to 15 minutes with SparkR. And we will show how we plan to further reduces this to less than a minute in the future.
• Mixing SparkR, SparkSQL and MLlib: we show how we combined the three different libraries to maximize efficiency.
• Summary and Outlook: we describe what we have learnt so far, what the biggest gaps currently are and what challenges we expect to solve in the short- to mid-term.
Spark Summit EU talk by Ahsan Javed AwanSpark Summit
The document discusses performance characterization of in-memory data analytics using Apache Spark on a scale-up server. It identifies problems like poor multicore scalability, thread load imbalance, I/O wait times, and GC overhead. Solutions proposed include NUMA awareness, hyperthreading, disabling next-line prefetchers, using parallel scavenge GC, multiple small executors, and a future node architecture based on a hybrid in-storage processing and 2D processing-in-memory design. The work aims to improve node-level performance through architecture support for emerging big data workloads.
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceDatabricks
Zeus is an efficient, highly scalable and distributed shuffle as a service which is powering all Data processing (Spark and Hive) at Uber. Uber runs one of the largest Spark and Hive clusters on top of YARN in industry which leads to many issues such as hardware failures (Burn out Disks), reliability and scalability challenges.
The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...Databricks
We present the Azure Cognitive Services on Spark, a simple and easy to use extension of the SparkML Library to all Azure Cognitive Services. This integration allows Spark Users to embed cloud intelligence directly into their spark computations, enabling a new generation of intelligent applications on Spark. Furthermore, we show that with our new Containerized Cognitive Services, one can embed cloud intelligence directly into the Spark cluster for ultra-low latency, on-prem, and offline applications. We show how using our Integration, one can compose these cognitive services with other services, SQL computations, and Deep Networks to create sophisticated and intelligent heterogenous applications. Moreover, we show how to redeploy these compositions as Restful Services with Spark Serving. We will also explore the architecture of these contributions which leverage HTTP on Spark, a novel integration between Spark with the widely used Hypertext Transfer Protocol (HTTP). This library can integrate any framework into the Spark ecosystem that is capable of communicating through HTTP. Finally, we demonstrate how to use these services to create a large class of intelligent applications such as custom search engines, realtime facial recognition systems, and unsupervised object detectors.
Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...Databricks
This document provides an overview of Spark: Data Science as a Service by Sridhar Alla and Kiran Muglurmath of Comcast. It discusses Comcast's data science challenges due to massive data size and lack of scalable architecture. It introduces Roadrunner, Comcast's solution built on Spark, which provides a centralized processing system with SQL and machine learning capabilities to enable data ingestion, quality checks, feature engineering, modeling and workflow management. Roadrunner is accessed through REST APIs and helps multiple teams work with the same large datasets. Examples of transformations, joins, aggregations and anomaly detection algorithms demonstrated in Roadrunner are also included.
1. The document discusses using Lambda architecture principles to perform fast data analytics on streaming and batch data from IoT sources using Spark Streaming and MLlib.
2. A proposed smart parking use case would recommend the best parking garage by combining streaming GPS data from cars with batch updates on garage capacity, scoring each garage using machine learning models.
3. The Lambda architecture is implemented using Kafka to ingest streaming GPS updates and batch capacity updates, Spark Streaming and Spark SQL to prepare, transform, and join the data, and MLlib to score and rank the garages in real-time.
Introduction to Apache Spark Workshop at Lambda World 2015 on October 23th and 24th, 2015, celebrated in Cádiz. Speakers: @fperezp and @juanpedromoreno
Github Repo: https://github.com/47deg/spark-workshop
Pandas UDF: Scalable Analysis with Python and PySparkLi Jin
Over the past few years, Python has become the default language for data scientists. Packages such as pandas, numpy, statsmodel, and scikit-learn have gained great adoption and become the mainstream toolkits. At the same time, Apache Spark has become the de facto standard in processing big data. Spark ships with a Python interface, aka PySpark, however, because Spark’s runtime is implemented on top of JVM, using PySpark with native Python library sometimes results in poor performance and usability.
In this talk, we introduce a new type of PySpark UDF designed to solve this problem – Vectorized UDF. Vectorized UDF is built on top of Apache Arrow and bring you the best of both worlds – the ability to define easy to use, high performance UDFs and scale up your analysis with Spark.
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks DeltaDatabricks
Hopsworks is an open-source data platform that can be used to both develop and operate horizontally scalable machine learning pipelines. A key part of our pipelines is the world’s first open-source Feature Store, based on Apache Hive, that acts as a data warehouse for features, providing a natural API between data engineers – who write feature engineering code in Spark (in Scala or Python) – and Data Scientists, who select features from the feature store to generate training/test data for models. In this talk, we will discuss how Databricks Delta solves several of the key challenges in building both feature engineering pipelines that feed our Feature Store and in managing the feature data itself.
Firstly, we will show how expectations and schema enforcement in Databricks Delta can be used to provide data validation, ensuring that feature data does not have missing or invalid values that could negatively affect model training. Secondly, time-travel in Databricks Delta can be used to provide version management and experiment reproducability for training/test datasets. That is, given a model, you can re-run the training experiment for that model using the same version of the data that was used to train the model.
We will also discuss the next steps needed to take this work to the next level. Finally, we will perform a live demo, showing how Delta can be used in end-to-end ML pipelines using Spark on Hopsworks.
Gurpreet Singh from Microsoft gave a talk on scaling Python for data analysis and machine learning using DASK and Apache Spark. He discussed the challenges of scaling the Python data stack and compared options like DASK, Spark, and Spark MLlib. He provided examples of using DASK and PySpark DataFrames for parallel processing and showed how DASK-ML can be used to parallelize Scikit-Learn models. Distributed deep learning with tools like Project Hydrogen was also covered.
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...Databricks
At the end of day, the only thing that data scientists want is tabular data for their analysis. They do not want to spend hours or days preparing data. How does a data engineer handle the massive amount of data that is being streamed at them from IoT devices and apps, and at the same time add structure to it so that data scientists can focus on finding insights and not preparing data? By the way, you need to do this within minutes (sometimes seconds). Oh… and there are a lot of other data sources that you need to ingest, and the current providers of data are changing their structure.
GoPro has massive amounts of heterogeneous data being streamed from their consumer devices and applications, and they have developed the concept of “dynamic DDL” to structure their streamed data on the fly using Spark Streaming, Kafka, HBase, Hive and S3. The idea is simple: Add structure (schema) to the data as soon as possible; allow the providers of the data to dictate the structure; and automatically create event-based and state-based tables (DDL) for all data sources to allow data scientists to access the data via their lingua franca, SQL, within minutes.
Performance Troubleshooting Using Apache Spark MetricsDatabricks
Luca Canali, a data engineer at CERN, presented on performance troubleshooting using Apache Spark metrics at the UnifiedDataAnalytics #SparkAISummit. CERN runs large Hadoop and Spark clusters to process over 300 PB of data from the Large Hadron Collider experiments. Luca discussed how to gather, analyze, and visualize Spark metrics to identify bottlenecks and improve performance.
Spark Summit EU talk by Christos ErotocritouSpark Summit
This document discusses Apache Ignite and how it can be used with Apache Spark for fast data applications. It provides an overview of Ignite's in-memory data fabric capabilities, how it compares to Spark, and how Ignite can be integrated with Spark to provide shared resilient storage and distributed computing. Examples are given of reading and writing data between Ignite and Spark and using Ignite's in-memory file system and SQL support from Spark.
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)Spark Summit
Spark on YARN provides resource management and security features through YARN, but still has areas for improvement. Dynamic allocation in YARN allows Spark applications to grow and shrink executors based on task demand, though latency and data locality could be enhanced. Security supports Kerberos authentication and delegation tokens, but long-lived applications face token expiration issues and encryption needs improvement for control plane, shuffle files, and user interfaces. Overall, usability, security, and performance remain areas of focus.
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Spark Summit
Elasticsearch provides native integration with Apache Spark through ES-Hadoop. However, especially during development, it is at best cumbersome to have Elasticsearch running in a separate machine/instance. Leveraging Spark Cluster with Elasticsearch Inside it is possible to run an embedded instance of Elasticsearch in the driver node of a Spark Cluster. This opens up new opportunities to develop cutting-edge applications. One such application is Dataset Search.
Oscar will give a demo of a Dataset Search Engine built on Spark Cluster with Elasticsearch Inside. Motivation is that once Elasticsearch is running on Spark it becomes possible and interesting to have the Elasticsearch in-memory instance join an (existing) Elasticsearch cluster. And this in turn enables indexing of Datasets that are processed as part of Data Pipelines running on Spark. Dataset Search and Data Management are R&D topics that should be of interest to Spark Summit East attendees who are looking for a way to organize their Data Lake and make it searchable.
Choose Your Weapon: Comparing Spark on FPGAs vs GPUsDatabricks
Today, general-purpose CPU clusters are the most widely used environment for data analytics workloads. Recently, acceleration solutions employing field-programmable hardware have emerged providing cost, performance and power consumption advantages. Field programmable gate arrays (FPGAs) and graphics processing units (GPUs) are two leading technologies being applied. GPUs are well-known for high-performance dense-matrix, highly regular operations such as graphics processing and matrix manipulation. FPGAs are flexible in terms of programming architecture and are adept at providing performance for operations that contain conditionals and/or branches. These architectural differences have significant performance impacts, which manifest all the way up to the application layer. It is therefore critical that data scientists and engineers understand these impacts in order to inform decisions about if and how to accelerate.
This talk will characterize the architectural aspects of the two hardware types as applied to analytics, with the ultimate goal of informing the application programmer. Recently, both GPUs and FPGAs have been applied to Apache SparkSQL, via services on Amazon Web Services (AWS) cloud. These solutions’ goal is providing Spark users high performance and cost savings. We first characterize the key aspects of the two hardware platforms. Based on this characterization, we examine and contrast the sets and types of SparkSQL operations they accelerate well, how they accelerate them, and the implications for the user’s application. Finally, we present and analyze a performance comparison of the two AWS solutions (one FPGA-based, one GPU-based). The tests employ the TPC-DS (decision support) benchmark suite, a widely used performance test for data analytics.
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...Databricks
DeepLearning4J (DL4J) is a powerful Open Source distributed framework that brings Deep Learning to the JVM (it can serve as a DIY tool for Java, Scala, Clojure and Kotlin programmers). It can be used on distributed GPUs and CPUs. It is integrated with Hadoop and Apache Spark. ND4J is a Open Source, distributed and GPU-enabled library that brings the intuitive scientific computing tools of the Python community to the JVM. Training neural network models using DL4J, ND4J and Spark is a powerful combination, but it presents some unexpected issues that can compromise performance and nullify the benefits of well written code and good model design. In this talk I will walk through some of those problems and will present some best practices to prevent them, coming from lessons learned when putting things in production.
Apache Spark is a fast, general engine for large-scale data processing. It provides unified analytics engine for batch, interactive, and stream processing using an in-memory abstraction called resilient distributed datasets (RDDs). Spark's speed comes from its ability to run computations directly on data stored in cluster memory and optimize performance through caching. It also integrates well with other big data technologies like HDFS, Hive, and HBase. Many large companies are using Spark for its speed, ease of use, and support for multiple workloads and languages.
Spark Summit 2016: Connecting Python to the Spark EcosystemDaniel Rodriguez
This document discusses connecting Python to the Spark ecosystem. It covers the PyData and Spark ecosystems, package management for Python libraries in a cluster, leveraging Spark with tools like Sparkonda and Conda, and using Python with Spark including with NLTK, scikit-learn, TensorFlow, and Dask. Future directions include Apache Arrow for efficient data structures and leveraging alternative computing frameworks like TensorFlow.
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...Databricks
You will learn how CERN has implemented an Apache Spark-based data pipeline to support deep learning research work in High Energy Physics (HEP). HEP is a data-intensive domain. For example, the amount of data flowing through the online systems at LHC experiments is currently of the order of 1 PB/s, with particle collision events happening every 25 ns. Filtering is applied before storing data for later processing.
Improvements in the accuracy of the online event filtering system are key to optimize usage and cost of compute and storage resources. A novel prototype of event filtering system based on a classifier trained using deep neural networks has recently been proposed. This presentation covers how we implemented the data pipeline to train the neural network classifier using solutions from the Apache Spark and Big Data ecosystem, integrated with tools, software, and platforms familiar to scientists and data engineers at CERN. Data preparation and feature engineering make use of PySpark, Spark SQL and Python code run via Jupyter notebooks.
We will discuss key integrations and libraries that make Apache Spark able to ingest data stored using HEP data format (ROOT) and the integration with CERN storage and compute systems. You will learn about the neural network models used, defined using the Keras API, and how the models have been trained in a distributed fashion on Spark clusters using BigDL and Analytics Zoo. We will discuss the implementation and results of the distributed training, as well as the lessons learned.
Spark Summit EU talk by Berni SchieferSpark Summit
This document summarizes experiences using the TPC-DS benchmark with Spark SQL 2.0 and 2.1 on a large cluster designed for Spark. It describes the configuration of the "F1" cluster including its hardware, operating system, Spark, and network settings. Initial results show that Spark SQL 2.0 provides significant improvements over earlier versions. While most queries completed successfully, some queries failed or ran very slowly, indicating areas for further optimization.
Apache Spark is a In Memory Data Processing Solution that can work with existing data source like HDFS and can make use of your existing computation infrastructure like YARN/Mesos etc. This talk will cover a basic introduction of Apache Spark with its various components like MLib, Shark, GrpahX and with few examples.
Running Emerging AI Applications on Big Data Platforms with Ray On Apache SparkDatabricks
With the rapid evolution of AI in recent years, we need to embrace advanced and emerging AI technologies to gain insights and make decisions based on massive amounts of data. Ray (https://github.com/ray-project/ray) is a fast and simple framework open-sourced by UC Berkeley RISELab particularly designed for easily building advanced AI applications in a distributed fashion.
Scalable Deep Learning Platform On Spark In BaiduJen Aman
This document summarizes Baidu's work on scaling deep learning using Spark. It discusses:
1) Baidu's goals of implementing Spark ML abstractions to train deep learning models with minimal code changes and leveraging Paddle's distributed training capabilities.
2) The system architecture which uses Spark and YARN for resource management and Paddle for distributed training.
3) Experiments showing Paddle can efficiently train large models like image recognition on ImageNet using many machines and GPUs, as well as sparse models.
This document discusses ARIMA (autoregressive integrated moving average) models for time series forecasting. It covers the basic steps for identifying and fitting ARIMA models, including plotting the data, identifying possible AR or MA components using the autocorrelation function (ACF) and partial autocorrelation function (PACF), estimating model parameters, checking the residuals to validate the model fit, and choosing the best model. An example analyzes quarterly US GNP data to demonstrate these steps.
Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLl...Chris Fregly
This document summarizes a presentation on advanced analytics using Apache Spark. It discusses Spark's capabilities for real-time streaming, machine learning, graph analytics, text processing and recommendations. It then covers various Spark features and techniques for parallelism, performance, streaming, approximations, similarities and personalized recommendations. Code examples and demos are provided to illustrate Spark's capabilities.
1. The document discusses using Lambda architecture principles to perform fast data analytics on streaming and batch data from IoT sources using Spark Streaming and MLlib.
2. A proposed smart parking use case would recommend the best parking garage by combining streaming GPS data from cars with batch updates on garage capacity, scoring each garage using machine learning models.
3. The Lambda architecture is implemented using Kafka to ingest streaming GPS updates and batch capacity updates, Spark Streaming and Spark SQL to prepare, transform, and join the data, and MLlib to score and rank the garages in real-time.
Introduction to Apache Spark Workshop at Lambda World 2015 on October 23th and 24th, 2015, celebrated in Cádiz. Speakers: @fperezp and @juanpedromoreno
Github Repo: https://github.com/47deg/spark-workshop
Pandas UDF: Scalable Analysis with Python and PySparkLi Jin
Over the past few years, Python has become the default language for data scientists. Packages such as pandas, numpy, statsmodel, and scikit-learn have gained great adoption and become the mainstream toolkits. At the same time, Apache Spark has become the de facto standard in processing big data. Spark ships with a Python interface, aka PySpark, however, because Spark’s runtime is implemented on top of JVM, using PySpark with native Python library sometimes results in poor performance and usability.
In this talk, we introduce a new type of PySpark UDF designed to solve this problem – Vectorized UDF. Vectorized UDF is built on top of Apache Arrow and bring you the best of both worlds – the ability to define easy to use, high performance UDFs and scale up your analysis with Spark.
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks DeltaDatabricks
Hopsworks is an open-source data platform that can be used to both develop and operate horizontally scalable machine learning pipelines. A key part of our pipelines is the world’s first open-source Feature Store, based on Apache Hive, that acts as a data warehouse for features, providing a natural API between data engineers – who write feature engineering code in Spark (in Scala or Python) – and Data Scientists, who select features from the feature store to generate training/test data for models. In this talk, we will discuss how Databricks Delta solves several of the key challenges in building both feature engineering pipelines that feed our Feature Store and in managing the feature data itself.
Firstly, we will show how expectations and schema enforcement in Databricks Delta can be used to provide data validation, ensuring that feature data does not have missing or invalid values that could negatively affect model training. Secondly, time-travel in Databricks Delta can be used to provide version management and experiment reproducability for training/test datasets. That is, given a model, you can re-run the training experiment for that model using the same version of the data that was used to train the model.
We will also discuss the next steps needed to take this work to the next level. Finally, we will perform a live demo, showing how Delta can be used in end-to-end ML pipelines using Spark on Hopsworks.
Gurpreet Singh from Microsoft gave a talk on scaling Python for data analysis and machine learning using DASK and Apache Spark. He discussed the challenges of scaling the Python data stack and compared options like DASK, Spark, and Spark MLlib. He provided examples of using DASK and PySpark DataFrames for parallel processing and showed how DASK-ML can be used to parallelize Scikit-Learn models. Distributed deep learning with tools like Project Hydrogen was also covered.
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...Databricks
At the end of day, the only thing that data scientists want is tabular data for their analysis. They do not want to spend hours or days preparing data. How does a data engineer handle the massive amount of data that is being streamed at them from IoT devices and apps, and at the same time add structure to it so that data scientists can focus on finding insights and not preparing data? By the way, you need to do this within minutes (sometimes seconds). Oh… and there are a lot of other data sources that you need to ingest, and the current providers of data are changing their structure.
GoPro has massive amounts of heterogeneous data being streamed from their consumer devices and applications, and they have developed the concept of “dynamic DDL” to structure their streamed data on the fly using Spark Streaming, Kafka, HBase, Hive and S3. The idea is simple: Add structure (schema) to the data as soon as possible; allow the providers of the data to dictate the structure; and automatically create event-based and state-based tables (DDL) for all data sources to allow data scientists to access the data via their lingua franca, SQL, within minutes.
Performance Troubleshooting Using Apache Spark MetricsDatabricks
Luca Canali, a data engineer at CERN, presented on performance troubleshooting using Apache Spark metrics at the UnifiedDataAnalytics #SparkAISummit. CERN runs large Hadoop and Spark clusters to process over 300 PB of data from the Large Hadron Collider experiments. Luca discussed how to gather, analyze, and visualize Spark metrics to identify bottlenecks and improve performance.
Spark Summit EU talk by Christos ErotocritouSpark Summit
This document discusses Apache Ignite and how it can be used with Apache Spark for fast data applications. It provides an overview of Ignite's in-memory data fabric capabilities, how it compares to Spark, and how Ignite can be integrated with Spark to provide shared resilient storage and distributed computing. Examples are given of reading and writing data between Ignite and Spark and using Ignite's in-memory file system and SQL support from Spark.
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)Spark Summit
Spark on YARN provides resource management and security features through YARN, but still has areas for improvement. Dynamic allocation in YARN allows Spark applications to grow and shrink executors based on task demand, though latency and data locality could be enhanced. Security supports Kerberos authentication and delegation tokens, but long-lived applications face token expiration issues and encryption needs improvement for control plane, shuffle files, and user interfaces. Overall, usability, security, and performance remain areas of focus.
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Spark Summit
Elasticsearch provides native integration with Apache Spark through ES-Hadoop. However, especially during development, it is at best cumbersome to have Elasticsearch running in a separate machine/instance. Leveraging Spark Cluster with Elasticsearch Inside it is possible to run an embedded instance of Elasticsearch in the driver node of a Spark Cluster. This opens up new opportunities to develop cutting-edge applications. One such application is Dataset Search.
Oscar will give a demo of a Dataset Search Engine built on Spark Cluster with Elasticsearch Inside. Motivation is that once Elasticsearch is running on Spark it becomes possible and interesting to have the Elasticsearch in-memory instance join an (existing) Elasticsearch cluster. And this in turn enables indexing of Datasets that are processed as part of Data Pipelines running on Spark. Dataset Search and Data Management are R&D topics that should be of interest to Spark Summit East attendees who are looking for a way to organize their Data Lake and make it searchable.
Choose Your Weapon: Comparing Spark on FPGAs vs GPUsDatabricks
Today, general-purpose CPU clusters are the most widely used environment for data analytics workloads. Recently, acceleration solutions employing field-programmable hardware have emerged providing cost, performance and power consumption advantages. Field programmable gate arrays (FPGAs) and graphics processing units (GPUs) are two leading technologies being applied. GPUs are well-known for high-performance dense-matrix, highly regular operations such as graphics processing and matrix manipulation. FPGAs are flexible in terms of programming architecture and are adept at providing performance for operations that contain conditionals and/or branches. These architectural differences have significant performance impacts, which manifest all the way up to the application layer. It is therefore critical that data scientists and engineers understand these impacts in order to inform decisions about if and how to accelerate.
This talk will characterize the architectural aspects of the two hardware types as applied to analytics, with the ultimate goal of informing the application programmer. Recently, both GPUs and FPGAs have been applied to Apache SparkSQL, via services on Amazon Web Services (AWS) cloud. These solutions’ goal is providing Spark users high performance and cost savings. We first characterize the key aspects of the two hardware platforms. Based on this characterization, we examine and contrast the sets and types of SparkSQL operations they accelerate well, how they accelerate them, and the implications for the user’s application. Finally, we present and analyze a performance comparison of the two AWS solutions (one FPGA-based, one GPU-based). The tests employ the TPC-DS (decision support) benchmark suite, a widely used performance test for data analytics.
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...Databricks
DeepLearning4J (DL4J) is a powerful Open Source distributed framework that brings Deep Learning to the JVM (it can serve as a DIY tool for Java, Scala, Clojure and Kotlin programmers). It can be used on distributed GPUs and CPUs. It is integrated with Hadoop and Apache Spark. ND4J is a Open Source, distributed and GPU-enabled library that brings the intuitive scientific computing tools of the Python community to the JVM. Training neural network models using DL4J, ND4J and Spark is a powerful combination, but it presents some unexpected issues that can compromise performance and nullify the benefits of well written code and good model design. In this talk I will walk through some of those problems and will present some best practices to prevent them, coming from lessons learned when putting things in production.
Apache Spark is a fast, general engine for large-scale data processing. It provides unified analytics engine for batch, interactive, and stream processing using an in-memory abstraction called resilient distributed datasets (RDDs). Spark's speed comes from its ability to run computations directly on data stored in cluster memory and optimize performance through caching. It also integrates well with other big data technologies like HDFS, Hive, and HBase. Many large companies are using Spark for its speed, ease of use, and support for multiple workloads and languages.
Spark Summit 2016: Connecting Python to the Spark EcosystemDaniel Rodriguez
This document discusses connecting Python to the Spark ecosystem. It covers the PyData and Spark ecosystems, package management for Python libraries in a cluster, leveraging Spark with tools like Sparkonda and Conda, and using Python with Spark including with NLTK, scikit-learn, TensorFlow, and Dask. Future directions include Apache Arrow for efficient data structures and leveraging alternative computing frameworks like TensorFlow.
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...Databricks
You will learn how CERN has implemented an Apache Spark-based data pipeline to support deep learning research work in High Energy Physics (HEP). HEP is a data-intensive domain. For example, the amount of data flowing through the online systems at LHC experiments is currently of the order of 1 PB/s, with particle collision events happening every 25 ns. Filtering is applied before storing data for later processing.
Improvements in the accuracy of the online event filtering system are key to optimize usage and cost of compute and storage resources. A novel prototype of event filtering system based on a classifier trained using deep neural networks has recently been proposed. This presentation covers how we implemented the data pipeline to train the neural network classifier using solutions from the Apache Spark and Big Data ecosystem, integrated with tools, software, and platforms familiar to scientists and data engineers at CERN. Data preparation and feature engineering make use of PySpark, Spark SQL and Python code run via Jupyter notebooks.
We will discuss key integrations and libraries that make Apache Spark able to ingest data stored using HEP data format (ROOT) and the integration with CERN storage and compute systems. You will learn about the neural network models used, defined using the Keras API, and how the models have been trained in a distributed fashion on Spark clusters using BigDL and Analytics Zoo. We will discuss the implementation and results of the distributed training, as well as the lessons learned.
Spark Summit EU talk by Berni SchieferSpark Summit
This document summarizes experiences using the TPC-DS benchmark with Spark SQL 2.0 and 2.1 on a large cluster designed for Spark. It describes the configuration of the "F1" cluster including its hardware, operating system, Spark, and network settings. Initial results show that Spark SQL 2.0 provides significant improvements over earlier versions. While most queries completed successfully, some queries failed or ran very slowly, indicating areas for further optimization.
Apache Spark is a In Memory Data Processing Solution that can work with existing data source like HDFS and can make use of your existing computation infrastructure like YARN/Mesos etc. This talk will cover a basic introduction of Apache Spark with its various components like MLib, Shark, GrpahX and with few examples.
Running Emerging AI Applications on Big Data Platforms with Ray On Apache SparkDatabricks
With the rapid evolution of AI in recent years, we need to embrace advanced and emerging AI technologies to gain insights and make decisions based on massive amounts of data. Ray (https://github.com/ray-project/ray) is a fast and simple framework open-sourced by UC Berkeley RISELab particularly designed for easily building advanced AI applications in a distributed fashion.
Scalable Deep Learning Platform On Spark In BaiduJen Aman
This document summarizes Baidu's work on scaling deep learning using Spark. It discusses:
1) Baidu's goals of implementing Spark ML abstractions to train deep learning models with minimal code changes and leveraging Paddle's distributed training capabilities.
2) The system architecture which uses Spark and YARN for resource management and Paddle for distributed training.
3) Experiments showing Paddle can efficiently train large models like image recognition on ImageNet using many machines and GPUs, as well as sparse models.
This document discusses ARIMA (autoregressive integrated moving average) models for time series forecasting. It covers the basic steps for identifying and fitting ARIMA models, including plotting the data, identifying possible AR or MA components using the autocorrelation function (ACF) and partial autocorrelation function (PACF), estimating model parameters, checking the residuals to validate the model fit, and choosing the best model. An example analyzes quarterly US GNP data to demonstrate these steps.
Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLl...Chris Fregly
This document summarizes a presentation on advanced analytics using Apache Spark. It discusses Spark's capabilities for real-time streaming, machine learning, graph analytics, text processing and recommendations. It then covers various Spark features and techniques for parallelism, performance, streaming, approximations, similarities and personalized recommendations. Code examples and demos are provided to illustrate Spark's capabilities.
Spark Summit EU 2015: Reynold Xin KeynoteDatabricks
This document summarizes Spark's development over the past 12 months and provides a look ahead. It discusses improvements to both the frontend, such as DataFrames and machine learning pipelines, and the backend through projects like Tungsten for performance optimizations. Going forward, it mentions new features like the Dataset API, streaming DataFrames, and potential hardware improvements from technologies like 3D XPoint memory. The overall goal is to provide a unified engine and APIs that can automatically optimize analytics workloads across languages and domains.
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...Databricks
1) Reynold Xin presented on using sketches like Bloom filters, HyperLogLog, count-min sketches, and stratified sampling to summarize and analyze large datasets in Spark.
2) Sketches allow analyzing data in small space and in one pass to identify frequent items, estimate cardinality, and sample data.
3) Spark incorporates sketches to speed up exploration, feature engineering, and building faster exact algorithms for processing large datasets.
A Deep Dive into Structured Streaming in Apache Spark Anyscale
This document provides an overview of Structured Streaming in Apache Spark. It begins with a brief history of streaming in Spark and outlines some of the limitations of the previous DStream API. It then introduces the new Structured Streaming API, which allows for continuous queries to be expressed as standard Spark SQL queries against continuously arriving data. It describes the new processing model and how queries are executed incrementally. It also covers features like event-time processing, windows, joins, and fault-tolerance guarantees through checkpointing and write-ahead logging. Overall, the document presents Structured Streaming as providing a simpler way to perform streaming analytics by allowing streaming queries to be expressed using the same APIs as batch queries.
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das Databricks
“In Spark 2.0, we have extended DataFrames and Datasets to handle real time streaming data. This not only provides a single programming abstraction for batch and streaming data, it also brings support for event-time based processing, out-or-order/delayed data, sessionization and tight integration with non-streaming data sources and sinks. In this talk, I will take a deep dive into the concepts and the API and show how this simplifies building complex “Continuous Applications”.” - T.D.
Databricks Blog: "Structured Streaming In Apache Spark 2.0: A new high-level API for streaming"
https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html
// About the Presenter //
Tathagata Das is an Apache Spark Committer and a member of the PMC. He’s the lead developer behind Spark Streaming, and is currently employed at Databricks. Before Databricks, you could find him at the AMPLab of UC Berkeley, researching datacenter frameworks and networks with professors Scott Shenker and Ion Stoica.
Follow T.D. on -
Twitter: https://twitter.com/tathadas
LinkedIn: https://www.linkedin.com/in/tathadas
Making Structured Streaming Ready for ProductionDatabricks
In mid-2016, we introduced Structured Steaming, a new stream processing engine built on Spark SQL that revolutionized how developers can write stream processing application without having to reason about having to reason about streaming. It allows the user to express their streaming computations the same way you would express a batch computation on static data. The Spark SQL engine takes care of running it incrementally and continuously updating the final result as streaming data continues to arrive. It truly unifies batch, streaming and interactive processing in the same Datasets/DataFrames API and the same optimized Spark SQL processing engine.
The initial alpha release of Structured Streaming in Apache Spark 2.0 introduced the basic aggregation APIs and files as streaming source and sink. Since then, we have put in a lot of work to make it ready for production use. In this talk, Tathagata Das will cover in more detail about the major features we have added, the recipes for using them in production, and the exciting new features we have plans for in future releases. Some of these features are as follows:
- Design and use of the Kafka Source
- Support for watermarks and event-time processing
- Support for more operations and output modes
Speaker: Tathagata Das
This talk was originally presented at Spark Summit East 2017.
This document compares Apache Spark and Apache Flink. Both are open-source platforms for distributed data processing. Spark was created in 2009 at UC Berkeley and donated to the Apache Foundation in 2013. It uses resilient distributed datasets (RDDs) and lazy evaluation. Flink was started in 2010 as a collaboration between universities in Germany and became an Apache project in 2014. It uses cyclic data flows and supports both batch and stream processing. While Spark is currently more mature with more components and community support, Flink claims to be faster for stream and batch processing. Overall, both platforms continue to evolve and improve.
This presentation on Spark Architecture will give an idea of what is Apache Spark, the essential features in Spark, the different Spark components. Here, you will learn about Spark Core, Spark SQL, Spark Streaming, Spark MLlib, and Graphx. You will understand how Spark processes an application and runs it on a cluster with the help of its architecture. Finally, you will perform a demo on Apache Spark. So, let's get started with Apache Spark Architecture.
YouTube Video: https://www.youtube.com/watch?v=CF5Ewk0GxiQ
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
Simplilearn’s Apache Spark and Scala certification training are designed to:
1. Advance your expertise in the Big Data Hadoop Ecosystem
2. Help you master essential Apache and Spark skills, such as Spark Streaming, Spark SQL, machine learning programming, GraphX programming and Shell Scripting Spark
3. Help you land a Hadoop developer job requiring Apache Spark expertise by giving you a real-life industry project coupled with 30 demos
What skills will you learn?
By completing this Apache Spark and Scala course you will be able to:
1. Understand the limitations of MapReduce and the role of Spark in overcoming these limitations
2. Understand the fundamentals of the Scala programming language and its features
3. Explain and master the process of installing Spark as a standalone cluster
4. Develop expertise in using Resilient Distributed Datasets (RDD) for creating applications in Spark
5. Master Structured Query Language (SQL) using SparkSQL
6. Gain a thorough understanding of Spark streaming features
7. Master and describe the features of Spark ML programming and GraphX programming
Who should take this Scala course?
1. Professionals aspiring for a career in the field of real-time big data analytics
2. Analytics professionals
3. Research professionals
4. IT developers and testers
5. Data scientists
6. BI and reporting professionals
7. Students who wish to gain a thorough understanding of Apache Spark
Learn more at https://www.simplilearn.com/big-data-and-analytics/apache-spark-scala-certification-training
An Engine to process big data in faster(than MR), easy and extremely scalable way. An Open Source, parallel, in-memory processing, cluster computing framework. Solution for loading, processing and end to end analyzing large scale data. Iterative and Interactive : Scala, Java, Python, R and with Command line interface.
Getting Started with Apache Spark (Scala)Knoldus Inc.
In this session, we are going to cover Apache Spark, the architecture of Apache Spark, Data Lineage, Direct Acyclic Graph(DAG), and many more concepts. Apache Spark is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters.
This document provides an overview and introduction to Apache Flink, a stream-based big data processing engine. It discusses the evolution of big data frameworks to platforms and the shortcomings of Spark's RDD abstraction for streaming workloads. The document then introduces Flink, covering its history, key differences from Spark like its use of streaming as the core abstraction, and examples of using Flink for batch and stream processing.
In the past, emerging technologies took years to mature. In the case of big data, while effective tools are still emerging, the analytics requirements are changing rapidly resulting in businesses to either make it or be left behind
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn
This presentation about Apache Spark covers all the basics that a beginner needs to know to get started with Spark. It covers the history of Apache Spark, what is Spark, the difference between Hadoop and Spark. You will learn the different components in Spark, and how Spark works with the help of architecture. You will understand the different cluster managers on which Spark can run. Finally, you will see the various applications of Spark and a use case on Conviva. Now, let's get started with what is Apache Spark.
Below topics are explained in this Spark presentation:
1. History of Spark
2. What is Spark
3. Hadoop vs Spark
4. Components of Apache Spark
5. Spark architecture
6. Applications of Spark
7. Spark usecase
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
Simplilearn’s Apache Spark and Scala certification training are designed to:
1. Advance your expertise in the Big Data Hadoop Ecosystem
2. Help you master essential Apache and Spark skills, such as Spark Streaming, Spark SQL, machine learning programming, GraphX programming and Shell Scripting Spark
3. Help you land a Hadoop developer job requiring Apache Spark expertise by giving you a real-life industry project coupled with 30 demos
What skills will you learn?
By completing this Apache Spark and Scala course you will be able to:
1. Understand the limitations of MapReduce and the role of Spark in overcoming these limitations
2. Understand the fundamentals of the Scala programming language and its features
3. Explain and master the process of installing Spark as a standalone cluster
4. Develop expertise in using Resilient Distributed Datasets (RDD) for creating applications in Spark
5. Master Structured Query Language (SQL) using SparkSQL
6. Gain a thorough understanding of Spark streaming features
7. Master and describe the features of Spark ML programming and GraphX programming
Who should take this Scala course?
1. Professionals aspiring for a career in the field of real-time big data analytics
2. Analytics professionals
3. Research professionals
4. IT developers and testers
5. Data scientists
6. BI and reporting professionals
7. Students who wish to gain a thorough understanding of Apache Spark
Learn more at https://www.simplilearn.com/big-data-and-analytics/apache-spark-scala-certification-training
This document provides an overview of analyzing large datasets using Apache Spark. It discusses:
- The evolution of big data systems to address high volume, velocity, and variety of data including Hadoop, Pig, Hive, and Spark.
- Common data processing models like batch, lambda architecture, and kappa architecture.
- Running Spark on OpenShift for scalable data analysis across clusters of machines.
- A demonstration of using Spark on OpenShift with Oshinko to process streaming data from Kafka on OpenShift.
The exponential growth of data is not a problem but processing/managing the huge diversity of data is a concern. In this session, we are gonna discuss about one of the most popular big data distributed processing framework "Spark" where we'll be developing API using "Scala", a gained prominence amid big data developers.
spark example spark example spark examplespark examplespark examplespark exampleShidrokhGoudarzi1
Spark is a fast general-purpose engine for large-scale data processing. It has advantages over MapReduce like speed, ease of use, and running everywhere. Spark supports SQL querying, streaming, machine learning, and graph processing. It can run on Scala, Java, Python. Spark applications have drivers, executors, tasks and run RDDs and shared variables. The Spark shell provides an interactive way to learn the API and analyze data.
This document discusses applying Apache Spark to data science challenges in media and entertainment. It introduces Spark as a unifying framework for content personalization using recommendation systems and streaming data, as well as social media analytics using GraphFrames. Specific use cases discussed include content personalization with recommendations, churn analysis, analyzing social networks with GraphFrames, sentiment analysis, and viewership prediction using topic modeling. The document also discusses continuous applications with Spark Streaming, and how Spark ML can be used for machine learning workflows and optimization.
Spark is an open-source cluster computing framework that provides high performance for both batch and streaming data processing. It addresses limitations of other distributed processing systems like MapReduce by providing in-memory computing capabilities and supporting a more general programming model. Spark core provides basic functionalities and serves as the foundation for higher-level modules like Spark SQL, MLlib, GraphX, and Spark Streaming. RDDs are Spark's basic abstraction for distributed datasets, allowing immutable distributed collections to be operated on in parallel. Key benefits of Spark include speed through in-memory computing, ease of use through its APIs, and a unified engine supporting multiple workloads.
This document provides an agenda and summaries for a meetup on introducing DataFrames and R on Apache Spark. The agenda includes overviews of Apache Spark 1.3, DataFrames, R on Spark, and large scale machine learning on Spark. There will also be discussions on news items, contributions so far, what's new in Spark 1.3, more data source APIs, what DataFrames are, writing DataFrames, and DataFrames with RDDs and Parquet. Presentations will cover Spark components, an introduction to SparkR, and Spark machine learning experiences.
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
London Spark Meetup 2014-11-11 @Skimlinks
http://www.meetup.com/Spark-London/events/217362972/
To paraphrase the immortal crooner Don Ho: "Tiny Batches, in the wine, make me happy, make me feel fine." http://youtu.be/mlCiDEXuxxA
Apache Spark provides support for streaming use cases, such as real-time analytics on log files, by leveraging a model called discretized streams (D-Streams). These "micro batch" computations operated on small time intervals, generally from 500 milliseconds up. One major innovation of Spark Streaming is that it leverages a unified engine. In other words, the same business logic can be used across multiple uses cases: streaming, but also interactive, iterative, machine learning, etc.
This talk will compare case studies for production deployments of Spark Streaming, emerging design patterns for integration with popular complementary OSS frameworks, plus some of the more advanced features such as approximation algorithms, and take a look at what's ahead — including the new Python support for Spark Streaming that will be in the upcoming 1.2 release.
Also, let's chat a bit about the new Databricks + O'Reilly developer certification for Apache Spark…
This introductory level talk is about Apache Flink: a multi-purpose Big Data analytics framework leading a movement towards the unification of batch and stream processing in the open source.
With the many technical innovations it brings along with its unique vision and philosophy, it is considered the 4 G (4th Generation) of Big Data Analytics frameworks providing the only hybrid (Real-Time Streaming + Batch) open source distributed data processing engine supporting many use cases: batch, streaming, relational queries, machine learning and graph processing.
In this talk, you will learn about:
1. What is Apache Flink stack and how it fits into the Big Data ecosystem?
2. How Apache Flink integrates with Hadoop and other open source tools for data input and output as well as deployment?
3. Why Apache Flink is an alternative to Apache Hadoop MapReduce, Apache Storm and Apache Spark.
4. Who is using Apache Flink?
5. Where to learn more about Apache Flink?
This document provides an overview of Apache Spark, including:
- Spark allows for fast iterative processing by keeping data in memory across parallel jobs for faster sharing than MapReduce.
- The core of Spark is the resilient distributed dataset (RDD) which allows parallel operations on distributed data.
- Spark comes with libraries for SQL queries, streaming, machine learning, and graph processing.
- Apache Spark is an open-source cluster computing framework that is faster than Hadoop for batch processing and also supports real-time stream processing.
- Spark was created to be faster than Hadoop for interactive queries and iterative algorithms by keeping data in-memory when possible.
- Spark consists of Spark Core for the basic RDD API and also includes modules for SQL, streaming, machine learning, and graph processing. It can run on several cluster managers including YARN and Mesos.
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)Jason Dai
This document outlines an agenda for a talk on building deep learning applications on big data platforms using Analytics Zoo. The agenda covers motivations around trends in big data, deep learning frameworks on Apache Spark like BigDL and TensorFlowOnSpark, an introduction to Analytics Zoo and its high-level pipeline APIs, built-in models, and reference use cases. It also covers distributed training in BigDL, advanced applications, and real-world use cases of deep learning on big data at companies like JD.com and World Bank. The talk concludes with a question and answer session.
Similar to Spark Streaming and MLlib - Hyderabad Spark Group (20)
Fueling AI with Great Data with Airbyte WebinarZilliz
This talk will focus on how to collect data from a variety of sources, leveraging this data for RAG and other GenAI use cases, and finally charting your course to productionalization.
Taking AI to the Next Level in Manufacturing.pdfssuserfac0301
Read Taking AI to the Next Level in Manufacturing to gain insights on AI adoption in the manufacturing industry, such as:
1. How quickly AI is being implemented in manufacturing.
2. Which barriers stand in the way of AI adoption.
3. How data quality and governance form the backbone of AI.
4. Organizational processes and structures that may inhibit effective AI adoption.
6. Ideas and approaches to help build your organization's AI strategy.
How information systems are built or acquired puts information, which is what they should be about, in a secondary place. Our language adapted accordingly, and we no longer talk about information systems but applications. Applications evolved in a way to break data into diverse fragments, tightly coupled with applications and expensive to integrate. The result is technical debt, which is re-paid by taking even bigger "loans", resulting in an ever-increasing technical debt. Software engineering and procurement practices work in sync with market forces to maintain this trend. This talk demonstrates how natural this situation is. The question is: can something be done to reverse the trend?
AppSec PNW: Android and iOS Application Security with MobSFAjin Abraham
Mobile Security Framework - MobSF is a free and open source automated mobile application security testing environment designed to help security engineers, researchers, developers, and penetration testers to identify security vulnerabilities, malicious behaviours and privacy concerns in mobile applications using static and dynamic analysis. It supports all the popular mobile application binaries and source code formats built for Android and iOS devices. In addition to automated security assessment, it also offers an interactive testing environment to build and execute scenario based test/fuzz cases against the application.
This talk covers:
Using MobSF for static analysis of mobile applications.
Interactive dynamic security assessment of Android and iOS applications.
Solving Mobile app CTF challenges.
Reverse engineering and runtime analysis of Mobile malware.
How to shift left and integrate MobSF/mobsfscan SAST and DAST in your build pipeline.
Introduction of Cybersecurity with OSS at Code Europe 2024Hiroshi SHIBATA
I develop the Ruby programming language, RubyGems, and Bundler, which are package managers for Ruby. Today, I will introduce how to enhance the security of your application using open-source software (OSS) examples from Ruby and RubyGems.
The first topic is CVE (Common Vulnerabilities and Exposures). I have published CVEs many times. But what exactly is a CVE? I'll provide a basic understanding of CVEs and explain how to detect and handle vulnerabilities in OSS.
Next, let's discuss package managers. Package managers play a critical role in the OSS ecosystem. I'll explain how to manage library dependencies in your application.
I'll share insights into how the Ruby and RubyGems core team works to keep our ecosystem safe. By the end of this talk, you'll have a better understanding of how to safeguard your code.
HCL Notes and Domino License Cost Reduction in the World of DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-and-domino-license-cost-reduction-in-the-world-of-dlau/
The introduction of DLAU and the CCB & CCX licensing model caused quite a stir in the HCL community. As a Notes and Domino customer, you may have faced challenges with unexpected user counts and license costs. You probably have questions on how this new licensing approach works and how to benefit from it. Most importantly, you likely have budget constraints and want to save money where possible. Don’t worry, we can help with all of this!
We’ll show you how to fix common misconfigurations that cause higher-than-expected user counts, and how to identify accounts which you can deactivate to save money. There are also frequent patterns that can cause unnecessary cost, like using a person document instead of a mail-in for shared mailboxes. We’ll provide examples and solutions for those as well. And naturally we’ll explain the new licensing model.
Join HCL Ambassador Marc Thomas in this webinar with a special guest appearance from Franz Walder. It will give you the tools and know-how to stay on top of what is going on with Domino licensing. You will be able lower your cost through an optimized configuration and keep it low going forward.
These topics will be covered
- Reducing license cost by finding and fixing misconfigurations and superfluous accounts
- How do CCB and CCX licenses really work?
- Understanding the DLAU tool and how to best utilize it
- Tips for common problem areas, like team mailboxes, functional/test users, etc
- Practical examples and best practices to implement right away
Digital Banking in the Cloud: How Citizens Bank Unlocked Their MainframePrecisely
Inconsistent user experience and siloed data, high costs, and changing customer expectations – Citizens Bank was experiencing these challenges while it was attempting to deliver a superior digital banking experience for its clients. Its core banking applications run on the mainframe and Citizens was using legacy utilities to get the critical mainframe data to feed customer-facing channels, like call centers, web, and mobile. Ultimately, this led to higher operating costs (MIPS), delayed response times, and longer time to market.
Ever-changing customer expectations demand more modern digital experiences, and the bank needed to find a solution that could provide real-time data to its customer channels with low latency and operating costs. Join this session to learn how Citizens is leveraging Precisely to replicate mainframe data to its customer channels and deliver on their “modern digital bank” experiences.
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-und-domino-lizenzkostenreduzierung-in-der-welt-von-dlau/
DLAU und die Lizenzen nach dem CCB- und CCX-Modell sind für viele in der HCL-Community seit letztem Jahr ein heißes Thema. Als Notes- oder Domino-Kunde haben Sie vielleicht mit unerwartet hohen Benutzerzahlen und Lizenzgebühren zu kämpfen. Sie fragen sich vielleicht, wie diese neue Art der Lizenzierung funktioniert und welchen Nutzen sie Ihnen bringt. Vor allem wollen Sie sicherlich Ihr Budget einhalten und Kosten sparen, wo immer möglich. Das verstehen wir und wir möchten Ihnen dabei helfen!
Wir erklären Ihnen, wie Sie häufige Konfigurationsprobleme lösen können, die dazu führen können, dass mehr Benutzer gezählt werden als nötig, und wie Sie überflüssige oder ungenutzte Konten identifizieren und entfernen können, um Geld zu sparen. Es gibt auch einige Ansätze, die zu unnötigen Ausgaben führen können, z. B. wenn ein Personendokument anstelle eines Mail-Ins für geteilte Mailboxen verwendet wird. Wir zeigen Ihnen solche Fälle und deren Lösungen. Und natürlich erklären wir Ihnen das neue Lizenzmodell.
Nehmen Sie an diesem Webinar teil, bei dem HCL-Ambassador Marc Thomas und Gastredner Franz Walder Ihnen diese neue Welt näherbringen. Es vermittelt Ihnen die Tools und das Know-how, um den Überblick zu bewahren. Sie werden in der Lage sein, Ihre Kosten durch eine optimierte Domino-Konfiguration zu reduzieren und auch in Zukunft gering zu halten.
Diese Themen werden behandelt
- Reduzierung der Lizenzkosten durch Auffinden und Beheben von Fehlkonfigurationen und überflüssigen Konten
- Wie funktionieren CCB- und CCX-Lizenzen wirklich?
- Verstehen des DLAU-Tools und wie man es am besten nutzt
- Tipps für häufige Problembereiche, wie z. B. Team-Postfächer, Funktions-/Testbenutzer usw.
- Praxisbeispiele und Best Practices zum sofortigen Umsetzen
Your One-Stop Shop for Python Success: Top 10 US Python Development Providersakankshawande
Simplify your search for a reliable Python development partner! This list presents the top 10 trusted US providers offering comprehensive Python development services, ensuring your project's success from conception to completion.
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/06/temporal-event-neural-networks-a-more-efficient-alternative-to-the-transformer-a-presentation-from-brainchip/
Chris Jones, Director of Product Management at BrainChip , presents the “Temporal Event Neural Networks: A More Efficient Alternative to the Transformer” tutorial at the May 2024 Embedded Vision Summit.
The expansion of AI services necessitates enhanced computational capabilities on edge devices. Temporal Event Neural Networks (TENNs), developed by BrainChip, represent a novel and highly efficient state-space network. TENNs demonstrate exceptional proficiency in handling multi-dimensional streaming data, facilitating advancements in object detection, action recognition, speech enhancement and language model/sequence generation. Through the utilization of polynomial-based continuous convolutions, TENNs streamline models, expedite training processes and significantly diminish memory requirements, achieving notable reductions of up to 50x in parameters and 5,000x in energy consumption compared to prevailing methodologies like transformers.
Integration with BrainChip’s Akida neuromorphic hardware IP further enhances TENNs’ capabilities, enabling the realization of highly capable, portable and passively cooled edge devices. This presentation delves into the technical innovations underlying TENNs, presents real-world benchmarks, and elucidates how this cutting-edge approach is positioned to revolutionize edge AI across diverse applications.
Dandelion Hashtable: beyond billion requests per second on a commodity serverAntonios Katsarakis
This slide deck presents DLHT, a concurrent in-memory hashtable. Despite efforts to optimize hashtables, that go as far as sacrificing core functionality, state-of-the-art designs still incur multiple memory accesses per request and block request processing in three cases. First, most hashtables block while waiting for data to be retrieved from memory. Second, open-addressing designs, which represent the current state-of-the-art, either cannot free index slots on deletes or must block all requests to do so. Third, index resizes block every request until all objects are copied to the new index. Defying folklore wisdom, DLHT forgoes open-addressing and adopts a fully-featured and memory-aware closed-addressing design based on bounded cache-line-chaining. This design offers lock-free index operations and deletes that free slots instantly, (2) completes most requests with a single memory access, (3) utilizes software prefetching to hide memory latencies, and (4) employs a novel non-blocking and parallel resizing. In a commodity server and a memory-resident workload, DLHT surpasses 1.6B requests per second and provides 3.5x (12x) the throughput of the state-of-the-art closed-addressing (open-addressing) resizable hashtable on Gets (Deletes).
Essentials of Automations: Exploring Attributes & Automation ParametersSafe Software
Building automations in FME Flow can save time, money, and help businesses scale by eliminating data silos and providing data to stakeholders in real-time. One essential component to orchestrating complex automations is the use of attributes & automation parameters (both formerly known as “keys”). In fact, it’s unlikely you’ll ever build an Automation without using these components, but what exactly are they?
Attributes & automation parameters enable the automation author to pass data values from one automation component to the next. During this webinar, our FME Flow Specialists will cover leveraging the three types of these output attributes & parameters in FME Flow: Event, Custom, and Automation. As a bonus, they’ll also be making use of the Split-Merge Block functionality.
You’ll leave this webinar with a better understanding of how to maximize the potential of automations by making use of attributes & automation parameters, with the ultimate goal of setting your enterprise integration workflows up on autopilot.
Have you ever been confused by the myriad of choices offered by AWS for hosting a website or an API?
Lambda, Elastic Beanstalk, Lightsail, Amplify, S3 (and more!) can each host websites + APIs. But which one should we choose?
Which one is cheapest? Which one is fastest? Which one will scale to meet our needs?
Join me in this session as we dive into each AWS hosting service to determine which one is best for your scenario and explain why!
Main news related to the CCS TSI 2023 (2023/1695)Jakub Marek
An English 🇬🇧 translation of a presentation to the speech I gave about the main changes brought by CCS TSI 2023 at the biggest Czech conference on Communications and signalling systems on Railways, which was held in Clarion Hotel Olomouc from 7th to 9th November 2023 (konferenceszt.cz). Attended by around 500 participants and 200 on-line followers.
The original Czech 🇨🇿 version of the presentation can be found here: https://www.slideshare.net/slideshow/hlavni-novinky-souvisejici-s-ccs-tsi-2023-2023-1695/269688092 .
The videorecording (in Czech) from the presentation is available here: https://youtu.be/WzjJWm4IyPk?si=SImb06tuXGb30BEH .
1. Spark Streaming and MLlib
The stack for distributed,
massively scalable, (near) real-time
data processing and machine learning
present
Phaneendra Chiruvella
http://twitter.com/pcx66
Hyderabad Spark Group & Zemoso Technologies
2. Agenda
● Brief intro to Spark Core
● Introduction to Spark Streaming
● What is the world talking about?: A demo of Spark
Streaming with Twitter
● Introduction to Spark MLlib
● Let’s see what movies you might like: A demo of Spark
MLlib by building a Movie Recommendation Engine
3. Spark: Lightning-fast cluster computing
● Data processing engine
● Distributed
● Massively scalable: Known largest cluster size is 8,000
machines with PBs of data processed
● Programmable in Scala, Java, Python and R
● Interactive shell
● Both Batch & Stream processing
● Stable and robust: being used in production at many
companies
● Known to work well with other “Big data” tools like Kafka,
Cassandra, HDFS, HBase, etc.
Image source: http://spark.apache.org/docs/latest/cluster-overview.html
4. Spark: How it works?
● Every application has it’s own
SparkContext
● Cluster Managers available are:
Spark Standalone, YARN, Mesos
Image source: http://spark.apache.org/docs/latest/cluster-overview.html
5. Spark: Resilient Distributed Datasets
RDD is the fundamental abstraction of Spark,
providing a rich, fault-tolerant layer over a cluster of
machines
Executors
SparkContext
RDD
7. Spark Streaming:
batch processing not
enuf!
● Extension to Core API
● Micro-batches processed in
realtime
● Minimize latency to seconds
8. Spark Streaming: How it works?
● DStreams - Just a chain of RDDs
● Batch Interval, Input DStreams and Receivers
● Some Input Sources: Sockets, File systems, Kafka, Twitter
Image source: spark.apache.org/docs/latest/streaming-programming-guide.html
9. Spark Streaming: How it works?
● Windowed operations
● DStream Transformations are translated to RDD Transformations
● Direct access to RDDs underneath
Image source: spark.apache.org/docs/latest/streaming-programming-guide.html
11. Spark MLlib: Just analytics not enuf!
● Practical, scalable ML library with
implementations of several common
algorithms and more being added
● Alternative spark.ml high-level API based
on spark.sql.DataFrame. Out of scope
for our current talk.
12. Spark MLlib: Demo
● Let’s see what movies you might like: A demo of Spark MLlib by building a
Movie Recommendation Engine
13. Spark: Streaming and
MLlib, match made in
heaven!
● MLlib provides algorithms that can
learn on streaming data and
simultaneously apply on the
streaming data!
● Also, a large set of algorithms that
can learn offline and be applied on
the streaming data
14. Spark: What next?
● Spark SQL - A SQL-like layer over RDDs
● spark.ml
● Spark GraphX - A graph-processing abstraction over RDDs
● Apache Storm and Apache Flink - Modern streaming-first systems
Q&A
15. Thank you!
Slide deck will be made available at:
http://blog.zemosolabs.com/
Spark Docs are a great place to get started
http://spark.apache.org/docs/latest/programming-
guide.html
Acknowledgements:
Code demos are from Databricks Training
Memes generated from ImgFlip.com