Spark is a general purpose computational framework that provides more flexibility than MapReduce. It leverages distributed memory and uses directed acyclic graphs for data parallel computations while retaining MapReduce properties like scalability, fault tolerance, and data locality. Cloudera has embraced Spark and is working to integrate it into their Hadoop ecosystem through projects like Hive on Spark and optimizations in Spark Core, MLlib, and Spark Streaming. Cloudera positions Spark as the future general purpose framework for Hadoop, while other specialized frameworks may still be needed for tasks like SQL, search, and graphs.
In this talk, we present a comprehensive framework for assessing the correctness, stability, and performance of the Spark SQL engine. Apache Spark is one of the most actively developed open source projects, with more than 1200 contributors from all over the world. At this scale and pace of development, mistakes bound to happen. To automatically identify correctness issues and performance regressions, we have build a testing pipeline that consists of two complementary stages: randomized testing and benchmarking.
Randomized query testing aims at extending the coverage of the typical unit testing suites, while we use micro and application-like benchmarks to measure new features and make sure existing ones do not regress. We will discuss various approaches we take, including random query generation, random data generation, random fault injection, and longevity stress tests. We will demonstrate the effectiveness of the framework by highlighting several correctness issues we have found through random query generation and critical performance regressions we were able to diagnose within hours due to our automated benchmarking tools.
Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...Databricks
The physicists at CERN are increasingly turning to Spark to process large physics datasets in a distributed fashion with the aim of reducing time-to-physics with increased interactivity. The physics data itself is stored in CERN’s mass storage system: EOS and CERN’s IT department runs on-premise private cloud based on OpenStack as a way to provide on-demand compute resources to physicists. This provides both opportunity and challenges to Big Data team at CERN to provide elastic, scalable, reliable spark-as-a-service on OpenStack.
The talk focuses on the design choices made and challenges faced while developing spark-as-a-service over kubernetes on openstack to simplify provisioning, automate management, and minimize the operating burden of managing Spark Clusters. In addition, the service tooling simplifies submitting applications on the behalf of the users, mounting user-specified ConfigMaps, copying application logs to s3 buckets for troubleshooting, performance analysis and accounting of spark applications and support for stateful spark streaming applications. We will also share results from running large scale sustained workloads over terabytes of physics data.
Integrating Existing C++ Libraries into PySpark with Esther KundinDatabricks
Bloomberg’s Machine Learning/Text Analysis team has developed many machine learning libraries for fast real-time sentiment analysis of incoming news stories. These models were developed using smaller training sets, implemented in C++ for minimal latency, and are currently running in production. To facilitate backtesting our production models across our full data set, we needed to be able to parallelize our workloads, while using the actual production code.
We also wanted to integrate the C++ code with PySpark and use it to run our models. In this talk, I will discuss some of the challenges we faced, decisions we made, and other options when dealing with integrating existing C++ code into a Spark system. The techniques we developed have been used successfully by our team multiple times and I am sure others will benefit from the gotchas that we were able to identify.
In this talk, we present a comprehensive framework for assessing the correctness, stability, and performance of the Spark SQL engine. Apache Spark is one of the most actively developed open source projects, with more than 1200 contributors from all over the world. At this scale and pace of development, mistakes bound to happen. To automatically identify correctness issues and performance regressions, we have build a testing pipeline that consists of two complementary stages: randomized testing and benchmarking.
Randomized query testing aims at extending the coverage of the typical unit testing suites, while we use micro and application-like benchmarks to measure new features and make sure existing ones do not regress. We will discuss various approaches we take, including random query generation, random data generation, random fault injection, and longevity stress tests. We will demonstrate the effectiveness of the framework by highlighting several correctness issues we have found through random query generation and critical performance regressions we were able to diagnose within hours due to our automated benchmarking tools.
Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...Databricks
The physicists at CERN are increasingly turning to Spark to process large physics datasets in a distributed fashion with the aim of reducing time-to-physics with increased interactivity. The physics data itself is stored in CERN’s mass storage system: EOS and CERN’s IT department runs on-premise private cloud based on OpenStack as a way to provide on-demand compute resources to physicists. This provides both opportunity and challenges to Big Data team at CERN to provide elastic, scalable, reliable spark-as-a-service on OpenStack.
The talk focuses on the design choices made and challenges faced while developing spark-as-a-service over kubernetes on openstack to simplify provisioning, automate management, and minimize the operating burden of managing Spark Clusters. In addition, the service tooling simplifies submitting applications on the behalf of the users, mounting user-specified ConfigMaps, copying application logs to s3 buckets for troubleshooting, performance analysis and accounting of spark applications and support for stateful spark streaming applications. We will also share results from running large scale sustained workloads over terabytes of physics data.
Integrating Existing C++ Libraries into PySpark with Esther KundinDatabricks
Bloomberg’s Machine Learning/Text Analysis team has developed many machine learning libraries for fast real-time sentiment analysis of incoming news stories. These models were developed using smaller training sets, implemented in C++ for minimal latency, and are currently running in production. To facilitate backtesting our production models across our full data set, we needed to be able to parallelize our workloads, while using the actual production code.
We also wanted to integrate the C++ code with PySpark and use it to run our models. In this talk, I will discuss some of the challenges we faced, decisions we made, and other options when dealing with integrating existing C++ code into a Spark system. The techniques we developed have been used successfully by our team multiple times and I am sure others will benefit from the gotchas that we were able to identify.
700 Updatable Queries Per Second: Spark as a Real-Time Web ServiceEvan Chan
700 Updatable Queries Per Second: Spark as a Real-Time Web Service. Find out how to use Apache Spark with FiloDb for low-latency queries - something you never thought possible with Spark. Scale it down, not just scale it up!
Mobility insights at Swisscom - Understanding collective mobility in SwitzerlandFrançois Garillot
Swisscom is the leading mobile-service provider in Switzerland, with a market share high enough to enable us to model and understand the collective mobility in every area of the country. To accomplish that, we built an urban planning tool that helps cities better manage their infrastructure based on data-based insights, produced with Apache Spark, YARN, Kafka and a good dose of machine learning. In this talk, we will explain how building such a tool involves mining a massive amount of raw data (1.5E9 records/day) to extract fine-grained mobility features from raw network traces. These features are obtained using different machine learning algorithms. For example, we built an algorithm that segments a trajectory into mobile and static periods and trained classifiers that enable us to distinguish between different means of transport. As we sketch the different algorithmic components, we will present our approach to continuously run and test them, which involves complex pipelines managed with Oozie and fuelled with ground truth data. Finally, we will delve into the streaming part of our analytics and see how network events allow Swisscom to understand the characteristics of the flow of people on roads and paths of interest. This requires making a link between network coverage information and geographical positioning in the space of milliseconds and using Spark streaming with libraries that were originally designed for batch processing. We will conclude on the advantages and pitfalls of Spark involved in running this kind of pipeline on a multi-tenant cluster. Audiences should come back from this talk with an overall picture of the use of Apache Spark and related components of its ecosystem in the field of trajectory mining.
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark ClustersDataWorks Summit
In recent releases, TensorFlow has been enhanced for distributed learning and HDFS access. Outside of the Google cloud, however, users still needed a dedicated cluster for TensorFlow applications. There are several community projects wiring TensorFlow onto Apache Spark clusters. Unfortunately, they are limited to support synchronous distributed learning only, and don’t allow TensorFlow servers to communicate with each other directly.
In this talk, we will introduce a new framework, TensorFlowOnSpark, for scalable TensorFlow learning, which will be open sourced in Q1 2017. This new framework enables easy experimentation for algorithm designs, and supports scalable training & inferencing on Spark clusters. It supports all TensorFlow functionalities including synchronous & asynchronous learning, model & data parallelism, and TensorBoard. It provides architectural flexibility for data ingestion to TensorFlow and network protocols for server-to-server communication. With a few lines of code changes, an existing TensorFlow algorithm can be transformed into a scalable application.
Apache Spark on K8S Best Practice and Performance in the CloudDatabricks
Kubernetes As of Spark 2.3, Spark can run on clusters managed by Kubernetes. we will describes the best practices about running Spark SQL on Kubernetes upon Tencent cloud includes how to deploy Kubernetes against public cloud platform to maximum resource utilization and how to tune configurations of Spark to take advantage of Kubernetes resource manager to achieve best performance. To evaluate performance, the TPC-DS benchmarking tool will be used to analysis performance impact of queries between configurations set.
Speakers: Junjie Chen, Junping Du
Apache Spark 2.4 Bridges the Gap Between Big Data and Deep LearningDataWorks Summit
Big data and AI are joined at the hip: AI applications require massive amounts of training data to build state-of-the-art models. The problem is, big data frameworks like Apache Spark and distributed deep learning frameworks like TensorFlow don’t play well together due to the disparity between how big data jobs are executed and how deep learning jobs are executed.
Apache Spark 2.4 introduced a new scheduling primitive: barrier scheduling. User can indicate Spark whether it should be using the MapReduce mode or barrier mode at each stage of the pipeline, thus it’s easy to embed distributed deep learning training as a Spark stage to simplify the training workflow. In this talk, I will demonstrate how to build a real case pipeline which combines data processing with Spark and deep learning training with TensorFlow step by step. I will also share the best practices and hands-on experiences to show the power of this new features, and bring more discussion on this topic.
Speeding Up Spark with Data Compression on Xeon+FPGA with David OjikaDatabricks
Data compression is a key aspect in big data processing frameworks, such as Apache Hadoop and Spark, because compression enables the size of the input, shuffle and output data to be reduced, thus potentially speeding up overall processing time by orders of magnitude, especially for large-scale systems. However, since many compression algorithms with good compression ratio are also very CPU-intensive, developers are often forced to use algorithms that are less CPU-intensive at the cost of reduced compression ratio.
In this session, you’ll learn about a field-programmable gate array (FPGA)-based approach for accelerating data compression in Spark. By opportunistically offloading compute-heavy, compression tasks to the FPGA, the CPU is freed to perform other tasks, resulting in an improved overall performance for end-user applications. In contrast to existing GPU methods for acceleration, this approach affords more performance/energy efficiency, which can translate to significant savings in power and cooling costs, especially for large datacenters. In addition, this implementation offers the benefit of reconfigurability, allowing for the FPGA to be rapidly reprogrammed with a different algorithm to meet system or user requirements.
Using the Intel Xeon+FPGA platform, Ojika will share how they ported Swif (simplified workload-intuitive framework) to Spark, and the method used to enable an end-to-end, FPGA-aware Spark deployment. Swif is an in-house framework developed to democratize and simplify the deployment of FPGAs in heterogeneous datacenters. Using Swif’s application programmable interface (API), he’ll describe how system architects and software developers can seamlessly integrate FPGAs into their Spark workflow, and in particular, deploy FPGA-based compression schemes that achieve improved performance compared to software-only approaches. In general, Swif’s software stack, along with the underlying Xeon+FPGA hardware platform, provides a workload-centric processing environment that streamlines the process of offloading CPU-intensive tasks to shared FPGA resources, while providing improved system throughput and high resource utilization.
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...Databricks
You will learn how CERN has implemented an Apache Spark-based data pipeline to support deep learning research work in High Energy Physics (HEP). HEP is a data-intensive domain. For example, the amount of data flowing through the online systems at LHC experiments is currently of the order of 1 PB/s, with particle collision events happening every 25 ns. Filtering is applied before storing data for later processing.
Improvements in the accuracy of the online event filtering system are key to optimize usage and cost of compute and storage resources. A novel prototype of event filtering system based on a classifier trained using deep neural networks has recently been proposed. This presentation covers how we implemented the data pipeline to train the neural network classifier using solutions from the Apache Spark and Big Data ecosystem, integrated with tools, software, and platforms familiar to scientists and data engineers at CERN. Data preparation and feature engineering make use of PySpark, Spark SQL and Python code run via Jupyter notebooks.
We will discuss key integrations and libraries that make Apache Spark able to ingest data stored using HEP data format (ROOT) and the integration with CERN storage and compute systems. You will learn about the neural network models used, defined using the Keras API, and how the models have been trained in a distributed fashion on Spark clusters using BigDL and Analytics Zoo. We will discuss the implementation and results of the distributed training, as well as the lessons learned.
Elastify Cloud-Native Spark Application with Persistent MemoryDatabricks
Cloud native deployment has become one of the major trends for large scale Big Data analytics. Compared to on-premise data center, cloud offers much stronger scalability and higher elasticity to Big Data applications. However, cloud is also considered to be less performance than on-premise alternatives due to virtualization and cluster resource disaggregation. We present a new cloud native Spark application architecture backed by persistent memory technology. The key ingredient of this architecture is a novel acceleration engine that uses Intel's 3DXPoint technology as external memory. We discuss how the performance of multiple aspects of data processing can be improved using this new architecture. As a key takeaway, audience will gain understanding on the benefits of latest persistent memory technology, and how such new technology could be leveraged in cloud data processing architecture.
In this tutorial we will present Koalas, a new open source project that we announced at the Spark + AI Summit in April. Koalas is an open-source Python package that implements the pandas API on top of Apache Spark, to make the pandas API scalable to big data. Using Koalas, data scientists can make the transition from a single machine to a distributed environment without needing to learn a new framework.
We will demonstrate Koalas’ new functionalities since its initial release, discuss its roadmaps, and how we think Koalas could become the standard API for large scale data science.
What you will learn:
How to get started with Koalas
Easy transition from Pandas to Koalas on Apache Spark
Similarities between Pandas and Koalas APIs for DataFrame transformation and feature engineering
Single machine Pandas vs distributed environment of Koalas
Prerequisites:
A fully-charged laptop (8-16GB memory) with Chrome or Firefox
Python 3 and pip pre-installed
pip install koalas from PyPI
Pre-register for Databricks Community Edition
Read koalas docs
Let Spark Fly: Advantages and Use Cases for Spark on HadoopMapR Technologies
http://bit.ly/1BTaXZP – Apache Spark is currently one of the most active projects in the Hadoop ecosystem, and as such, there’s been plenty of hype about it in recent months, but how much of the discussion is marketing spin? And what are the facts? MapR and Databricks, the company that created and led the development of the Spark stack, will cut through the noise to uncover practical advantages for having the full set of Spark technologies at your disposal and reveal the benefits for running Spark on Hadoop
This presentation was given at a webinar hosted by Data Science Central and co-presented by MapR + Databricks.
To see the webinar, please go to: http://www.datasciencecentral.com/video/let-spark-fly-advantages-and-use-cases-for-spark-on-hadoop
Distributed Deep Learning with Apache Spark and TensorFlow with Jim DowlingDatabricks
Methods that scale with available computation are the future of AI. Distributed deep learning is one such method that enables data scientists to massively increase their productivity by (1) running parallel experiments over many devices (GPUs/TPUs/servers) and (2) massively reducing training time by distributing the training of a single network over many devices. Apache Spark is a key enabling platform for distributed deep learning, as it enables different deep learning frameworks to be embedded in Spark workflows in a secure end-to-end pipeline. In this talk, we examine the different ways in which Tensorflow can be included in Spark workflows to build distributed deep learning applications.
We will analyse the different frameworks for integrating Spark with Tensorflow, from Horovod to TensorflowOnSpark to Databrick’s Deep Learning Pipelines. We will also look at where you will find the bottlenecks when training models (in your frameworks, the network, GPUs, and with your data scientists) and how to get around them. We will look at how to use Spark Estimator model to perform hyper-parameter optimization with Spark/TensorFlow and model-architecture search, where Spark executors perform experiments in parallel to automatically find good model architectures.
The talk will include a live demonstration of training and inference for a Tensorflow application embedded in a Spark pipeline written in a Jupyter notebook on the Hops platform. We will show how to debug the application using both Spark UI and Tensorboard, and how to examine logs and monitor training. The demo will be run on the Hops platform, currently used by over 450 researchers and students in Sweden, as well as at companies such as Scania and Ericsson.
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...Sumeet Singh
Building a real-time monitoring service that handles millions of custom events per second while satisfying complex rules, varied throughput requirements, and numerous dimensions simultaneously is a complex endeavor. Sumeet Singh and Mridul Jain explain how Yahoo approached these challenges with Apache Storm Trident, Kafka, HBase, and OpenTSDB and discuss the lessons learned along the way.
Sumeet and Mridul explain scaling patterns backed by real scenarios and data to help attendees develop their own architectures and strategies for dealing with the scale challenges that come with real-time big data systems. They also explore the tradeoffs made in catering to a diverse set of daily users and the associated usability challenges that motivated Yahoo to build a self-serve, easy-to-use platform that requires minimal programming experience. Sumeet and Mridul then discuss event-level tracking for debugging and troubleshooting problems that our users may encounter at this scale. Over the course of their talk, they also address building infrastructure and operational intelligence with anomaly detection, alert correlation, and trend analysis based on the monitoring platform.
Mobius talk in Seattle Spark Meetup (Feb 2106). Mobius adds C# language binding to Apache Spark, enabling the implementation of Spark driver code and data processing operations in C#. More info @ https://github.com/Microsoft/Mobius. Tweet to @MobiusForSpark.
700 Updatable Queries Per Second: Spark as a Real-Time Web ServiceEvan Chan
700 Updatable Queries Per Second: Spark as a Real-Time Web Service. Find out how to use Apache Spark with FiloDb for low-latency queries - something you never thought possible with Spark. Scale it down, not just scale it up!
Mobility insights at Swisscom - Understanding collective mobility in SwitzerlandFrançois Garillot
Swisscom is the leading mobile-service provider in Switzerland, with a market share high enough to enable us to model and understand the collective mobility in every area of the country. To accomplish that, we built an urban planning tool that helps cities better manage their infrastructure based on data-based insights, produced with Apache Spark, YARN, Kafka and a good dose of machine learning. In this talk, we will explain how building such a tool involves mining a massive amount of raw data (1.5E9 records/day) to extract fine-grained mobility features from raw network traces. These features are obtained using different machine learning algorithms. For example, we built an algorithm that segments a trajectory into mobile and static periods and trained classifiers that enable us to distinguish between different means of transport. As we sketch the different algorithmic components, we will present our approach to continuously run and test them, which involves complex pipelines managed with Oozie and fuelled with ground truth data. Finally, we will delve into the streaming part of our analytics and see how network events allow Swisscom to understand the characteristics of the flow of people on roads and paths of interest. This requires making a link between network coverage information and geographical positioning in the space of milliseconds and using Spark streaming with libraries that were originally designed for batch processing. We will conclude on the advantages and pitfalls of Spark involved in running this kind of pipeline on a multi-tenant cluster. Audiences should come back from this talk with an overall picture of the use of Apache Spark and related components of its ecosystem in the field of trajectory mining.
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark ClustersDataWorks Summit
In recent releases, TensorFlow has been enhanced for distributed learning and HDFS access. Outside of the Google cloud, however, users still needed a dedicated cluster for TensorFlow applications. There are several community projects wiring TensorFlow onto Apache Spark clusters. Unfortunately, they are limited to support synchronous distributed learning only, and don’t allow TensorFlow servers to communicate with each other directly.
In this talk, we will introduce a new framework, TensorFlowOnSpark, for scalable TensorFlow learning, which will be open sourced in Q1 2017. This new framework enables easy experimentation for algorithm designs, and supports scalable training & inferencing on Spark clusters. It supports all TensorFlow functionalities including synchronous & asynchronous learning, model & data parallelism, and TensorBoard. It provides architectural flexibility for data ingestion to TensorFlow and network protocols for server-to-server communication. With a few lines of code changes, an existing TensorFlow algorithm can be transformed into a scalable application.
Apache Spark on K8S Best Practice and Performance in the CloudDatabricks
Kubernetes As of Spark 2.3, Spark can run on clusters managed by Kubernetes. we will describes the best practices about running Spark SQL on Kubernetes upon Tencent cloud includes how to deploy Kubernetes against public cloud platform to maximum resource utilization and how to tune configurations of Spark to take advantage of Kubernetes resource manager to achieve best performance. To evaluate performance, the TPC-DS benchmarking tool will be used to analysis performance impact of queries between configurations set.
Speakers: Junjie Chen, Junping Du
Apache Spark 2.4 Bridges the Gap Between Big Data and Deep LearningDataWorks Summit
Big data and AI are joined at the hip: AI applications require massive amounts of training data to build state-of-the-art models. The problem is, big data frameworks like Apache Spark and distributed deep learning frameworks like TensorFlow don’t play well together due to the disparity between how big data jobs are executed and how deep learning jobs are executed.
Apache Spark 2.4 introduced a new scheduling primitive: barrier scheduling. User can indicate Spark whether it should be using the MapReduce mode or barrier mode at each stage of the pipeline, thus it’s easy to embed distributed deep learning training as a Spark stage to simplify the training workflow. In this talk, I will demonstrate how to build a real case pipeline which combines data processing with Spark and deep learning training with TensorFlow step by step. I will also share the best practices and hands-on experiences to show the power of this new features, and bring more discussion on this topic.
Speeding Up Spark with Data Compression on Xeon+FPGA with David OjikaDatabricks
Data compression is a key aspect in big data processing frameworks, such as Apache Hadoop and Spark, because compression enables the size of the input, shuffle and output data to be reduced, thus potentially speeding up overall processing time by orders of magnitude, especially for large-scale systems. However, since many compression algorithms with good compression ratio are also very CPU-intensive, developers are often forced to use algorithms that are less CPU-intensive at the cost of reduced compression ratio.
In this session, you’ll learn about a field-programmable gate array (FPGA)-based approach for accelerating data compression in Spark. By opportunistically offloading compute-heavy, compression tasks to the FPGA, the CPU is freed to perform other tasks, resulting in an improved overall performance for end-user applications. In contrast to existing GPU methods for acceleration, this approach affords more performance/energy efficiency, which can translate to significant savings in power and cooling costs, especially for large datacenters. In addition, this implementation offers the benefit of reconfigurability, allowing for the FPGA to be rapidly reprogrammed with a different algorithm to meet system or user requirements.
Using the Intel Xeon+FPGA platform, Ojika will share how they ported Swif (simplified workload-intuitive framework) to Spark, and the method used to enable an end-to-end, FPGA-aware Spark deployment. Swif is an in-house framework developed to democratize and simplify the deployment of FPGAs in heterogeneous datacenters. Using Swif’s application programmable interface (API), he’ll describe how system architects and software developers can seamlessly integrate FPGAs into their Spark workflow, and in particular, deploy FPGA-based compression schemes that achieve improved performance compared to software-only approaches. In general, Swif’s software stack, along with the underlying Xeon+FPGA hardware platform, provides a workload-centric processing environment that streamlines the process of offloading CPU-intensive tasks to shared FPGA resources, while providing improved system throughput and high resource utilization.
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...Databricks
You will learn how CERN has implemented an Apache Spark-based data pipeline to support deep learning research work in High Energy Physics (HEP). HEP is a data-intensive domain. For example, the amount of data flowing through the online systems at LHC experiments is currently of the order of 1 PB/s, with particle collision events happening every 25 ns. Filtering is applied before storing data for later processing.
Improvements in the accuracy of the online event filtering system are key to optimize usage and cost of compute and storage resources. A novel prototype of event filtering system based on a classifier trained using deep neural networks has recently been proposed. This presentation covers how we implemented the data pipeline to train the neural network classifier using solutions from the Apache Spark and Big Data ecosystem, integrated with tools, software, and platforms familiar to scientists and data engineers at CERN. Data preparation and feature engineering make use of PySpark, Spark SQL and Python code run via Jupyter notebooks.
We will discuss key integrations and libraries that make Apache Spark able to ingest data stored using HEP data format (ROOT) and the integration with CERN storage and compute systems. You will learn about the neural network models used, defined using the Keras API, and how the models have been trained in a distributed fashion on Spark clusters using BigDL and Analytics Zoo. We will discuss the implementation and results of the distributed training, as well as the lessons learned.
Elastify Cloud-Native Spark Application with Persistent MemoryDatabricks
Cloud native deployment has become one of the major trends for large scale Big Data analytics. Compared to on-premise data center, cloud offers much stronger scalability and higher elasticity to Big Data applications. However, cloud is also considered to be less performance than on-premise alternatives due to virtualization and cluster resource disaggregation. We present a new cloud native Spark application architecture backed by persistent memory technology. The key ingredient of this architecture is a novel acceleration engine that uses Intel's 3DXPoint technology as external memory. We discuss how the performance of multiple aspects of data processing can be improved using this new architecture. As a key takeaway, audience will gain understanding on the benefits of latest persistent memory technology, and how such new technology could be leveraged in cloud data processing architecture.
In this tutorial we will present Koalas, a new open source project that we announced at the Spark + AI Summit in April. Koalas is an open-source Python package that implements the pandas API on top of Apache Spark, to make the pandas API scalable to big data. Using Koalas, data scientists can make the transition from a single machine to a distributed environment without needing to learn a new framework.
We will demonstrate Koalas’ new functionalities since its initial release, discuss its roadmaps, and how we think Koalas could become the standard API for large scale data science.
What you will learn:
How to get started with Koalas
Easy transition from Pandas to Koalas on Apache Spark
Similarities between Pandas and Koalas APIs for DataFrame transformation and feature engineering
Single machine Pandas vs distributed environment of Koalas
Prerequisites:
A fully-charged laptop (8-16GB memory) with Chrome or Firefox
Python 3 and pip pre-installed
pip install koalas from PyPI
Pre-register for Databricks Community Edition
Read koalas docs
Let Spark Fly: Advantages and Use Cases for Spark on HadoopMapR Technologies
http://bit.ly/1BTaXZP – Apache Spark is currently one of the most active projects in the Hadoop ecosystem, and as such, there’s been plenty of hype about it in recent months, but how much of the discussion is marketing spin? And what are the facts? MapR and Databricks, the company that created and led the development of the Spark stack, will cut through the noise to uncover practical advantages for having the full set of Spark technologies at your disposal and reveal the benefits for running Spark on Hadoop
This presentation was given at a webinar hosted by Data Science Central and co-presented by MapR + Databricks.
To see the webinar, please go to: http://www.datasciencecentral.com/video/let-spark-fly-advantages-and-use-cases-for-spark-on-hadoop
Distributed Deep Learning with Apache Spark and TensorFlow with Jim DowlingDatabricks
Methods that scale with available computation are the future of AI. Distributed deep learning is one such method that enables data scientists to massively increase their productivity by (1) running parallel experiments over many devices (GPUs/TPUs/servers) and (2) massively reducing training time by distributing the training of a single network over many devices. Apache Spark is a key enabling platform for distributed deep learning, as it enables different deep learning frameworks to be embedded in Spark workflows in a secure end-to-end pipeline. In this talk, we examine the different ways in which Tensorflow can be included in Spark workflows to build distributed deep learning applications.
We will analyse the different frameworks for integrating Spark with Tensorflow, from Horovod to TensorflowOnSpark to Databrick’s Deep Learning Pipelines. We will also look at where you will find the bottlenecks when training models (in your frameworks, the network, GPUs, and with your data scientists) and how to get around them. We will look at how to use Spark Estimator model to perform hyper-parameter optimization with Spark/TensorFlow and model-architecture search, where Spark executors perform experiments in parallel to automatically find good model architectures.
The talk will include a live demonstration of training and inference for a Tensorflow application embedded in a Spark pipeline written in a Jupyter notebook on the Hops platform. We will show how to debug the application using both Spark UI and Tensorboard, and how to examine logs and monitor training. The demo will be run on the Hops platform, currently used by over 450 researchers and students in Sweden, as well as at companies such as Scania and Ericsson.
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...Sumeet Singh
Building a real-time monitoring service that handles millions of custom events per second while satisfying complex rules, varied throughput requirements, and numerous dimensions simultaneously is a complex endeavor. Sumeet Singh and Mridul Jain explain how Yahoo approached these challenges with Apache Storm Trident, Kafka, HBase, and OpenTSDB and discuss the lessons learned along the way.
Sumeet and Mridul explain scaling patterns backed by real scenarios and data to help attendees develop their own architectures and strategies for dealing with the scale challenges that come with real-time big data systems. They also explore the tradeoffs made in catering to a diverse set of daily users and the associated usability challenges that motivated Yahoo to build a self-serve, easy-to-use platform that requires minimal programming experience. Sumeet and Mridul then discuss event-level tracking for debugging and troubleshooting problems that our users may encounter at this scale. Over the course of their talk, they also address building infrastructure and operational intelligence with anomaly detection, alert correlation, and trend analysis based on the monitoring platform.
Mobius talk in Seattle Spark Meetup (Feb 2106). Mobius adds C# language binding to Apache Spark, enabling the implementation of Spark driver code and data processing operations in C#. More info @ https://github.com/Microsoft/Mobius. Tweet to @MobiusForSpark.
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
London Spark Meetup 2014-11-11 @Skimlinks
http://www.meetup.com/Spark-London/events/217362972/
To paraphrase the immortal crooner Don Ho: "Tiny Batches, in the wine, make me happy, make me feel fine." http://youtu.be/mlCiDEXuxxA
Apache Spark provides support for streaming use cases, such as real-time analytics on log files, by leveraging a model called discretized streams (D-Streams). These "micro batch" computations operated on small time intervals, generally from 500 milliseconds up. One major innovation of Spark Streaming is that it leverages a unified engine. In other words, the same business logic can be used across multiple uses cases: streaming, but also interactive, iterative, machine learning, etc.
This talk will compare case studies for production deployments of Spark Streaming, emerging design patterns for integration with popular complementary OSS frameworks, plus some of the more advanced features such as approximation algorithms, and take a look at what's ahead — including the new Python support for Spark Streaming that will be in the upcoming 1.2 release.
Also, let's chat a bit about the new Databricks + O'Reilly developer certification for Apache Spark…
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...DataStax Academy
This session covers our experience with using the Spark and Shark frameworks for running real-time queries on top of Cassandra data.We will start by surveying the current Cassandra analytics landscape, including Hadoop and HIVE, and touch on the use of custom input formats to extract data from Cassandra. We will then dive into Spark and Shark, two memory-based cluster computing frameworks, and how they enable often dramatic improvements in query speed and productivity, over the standard solutions today.
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftChester Chen
Talk 1. Scaling Apache Spark on Kubernetes at Lyft
As part of this mission Lyft invests heavily in open source infrastructure and tooling. At Lyft Kubernetes has emerged as the next generation of cloud native infrastructure to support a wide variety of distributed workloads. Apache Spark at Lyft has evolved to solve both Machine Learning and large scale ETL workloads. By combining the flexibility of Kubernetes with the data processing power of Apache Spark, Lyft is able to drive ETL data processing to a different level. In this talk, We will talk about challenges the Lyft team faced and solutions they developed to support Apache Spark on Kubernetes in production and at scale. Topics Include: - Key traits of Apache Spark on Kubernetes. - Deep dive into Lyft's multi-cluster setup and operationality to handle petabytes of production data. - How Lyft extends and enhances Apache Spark to support capabilities such as Spark pod life cycle metrics and state management, resource prioritization, and queuing and throttling. - Dynamic job scale estimation and runtime dynamic job configuration. - How Lyft powers internal Data Scientists, Business Analysts, and Data Engineers via a multi-cluster setup.
Speaker: Li Gao
Li Gao is the tech lead in the cloud native spark compute initiative at Lyft. Prior to Lyft, Li worked at Salesforce, Fitbit, Marin Software, and a few startups etc. on various technical leadership positions on cloud native and hybrid cloud data platforms at scale. Besides Spark, Li has scaled and productionized other open source projects, such as Presto, Apache HBase, Apache Phoenix, Apache Kafka, Apache Airflow, Apache Hive, and Apache Cassandra.
Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015Mike Broberg
Use Apache Spark Streaming in with IBM Watson on Bluemix to perform sentiment analysis and track how a conversation is trending on Twitter.
By David Taieb: https://twitter.com/DTAIEB55
Video: https://youtu.be/KLc_wazud3s
Tutorial: https://developer.ibm.com/clouddataservices/sentiment-analysis-of-twitter-hashtags/
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...Databricks
This talk is about methods and tools for troubleshooting Spark workloads at scale and is aimed at developers, administrators and performance practitioners. You will find examples illustrating the importance of using the right tools and right methodologies for measuring and understanding performance, in particular highlighting the importance of using data and root cause analysis to understand and improve the performance of Spark applications. The talk has a strong focus on practical examples and on tools for collecting data relevant for performance analysis. This includes tools for collecting Spark metrics and tools for collecting OS metrics. Among others, the talk will cover sparkMeasure, a tool developed by the author to collect Spark task metric and SQL metrics data, tools for analysing I/O and network workloads, tools for analysing CPU usage and memory bandwidth, tools for profiling CPU usage and for Flame Graph visualization.
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKzmhassan
As spark applications move to a containerized environment, there are many questions about how to best configure server systems in the container world. In this talk we will demonstrate a set of tools to better monitor performance and identify optimal configuration settings. We will demonstrate how Prometheus, a project that is now part of the Cloud Native Computing Foundation (CNCF), can be applied to monitor and archive system performance data in a containerized spark environment. In our examples, we will gather spark metric output through Prometheus and present the data with Grafana dashboards. We will use our examples to demonstrate how performance can be enhanced through different tuned configuration settings. Our demo will show how to configure settings across the cluster as well as within each node.
What's New in Apache Spark 2.3 & Why Should You CareDatabricks
The Apache Spark 2.3 release marks a big step forward in speed, unification, and API support.
This talk will quickly walk through what’s new and how you can benefit from the upcoming improvements:
* Continuous Processing in Structured Streaming.
* PySpark support for vectorization, giving Python developers the ability to run native Python code fast.
* Native Kubernetes support, marrying the best of container orchestration and distributed data processing.
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformYao Yao
Yao Yao Mooyoung Lee
https://github.com/yaowser/learn-spark/tree/master/Final%20project
https://www.youtube.com/watch?v=IVMbSDS4q3A
https://www.academia.edu/35646386/Teaching_Apache_Spark_Demonstrations_on_the_Databricks_Cloud_Platform
https://www.slideshare.net/YaoYao44/teaching-apache-spark-demonstrations-on-the-databricks-cloud-platform-86063070/
Apache Spark is a fast and general engine for big data analytics processing with libraries for SQL, streaming, and advanced analytics
Cloud Computing, Structured Streaming, Unified Analytics Integration, End-to-End Applications
Apache Spark 2.4 comes packed with a lot of new functionalities and improvements, including the new barrier execution mode, flexible streaming sink, the native AVRO data source, PySpark’s eager evaluation mode, Kubernetes support, higher-order functions, Scala 2.12 support, and more.
Conquering Hadoop and Apache Spark with Operational Intelligence with Akshay RaiDatabricks
At Linkedin, we have thousands of Hadoop and Spark users ranging from amateurs to experts who run a variety of jobs on our huge 2000-plus node clusters. In just a few years, the number of Hadoop and Spark jobs have grown from hundreds to thousands. With this ever increasing number of users and jobs, it becomes very crucial to have an efficient way to find answers to frequently asked questions like:
1) Why is my job running slow?
2) Why did my job get killed?
3) Can you send me an alert when my job is about to fail or miss SLA?
4) Do we have enough resources on the Hadoop cluster?
Having this information available with us will help in quicker debugging, alert based on anomalies, perform root cause analysis(RCA), identify workload patterns and perform capacity planning. To address this problem, we at Linkedin have built a Unified Grid Metrics Platform that captures and stores, current and historical job metrics. In our experience debugging and tuning jobs and interacting with our users, we have learnt a lot of lessons and have been integrating ideas and solutions into this system. For example, we have learned that capturing and storing the complete set of metrics and its history though fascinating is actually rarely useful just like the verbose logs in Spark. We have come up with some derived metrics and curated list of metrics which we track very closely at LinkedIn.
In this talk, we will discuss the architecture of how we built this platform for both Hadoop and Spark along with the huge challenges in collecting all the standard, derived and custom user metrics in real-time. We will see how this system allows users to build reporting dashboards, perform trend analysis, dimension analysis and view correlated metrics together.
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"IT Event
In this talk we’ll explore Apache Spark — the most popular cluster computing framework right now. We’ll look at the improvements that Spark brought over Hadoop MapReduce and what makes Spark so fast; explore Spark programming model and RDDs; and look at some sample use cases for Spark and big data in general.
This talk will be interesting for people who have little or no experience with Spark and would like to learn more about it. It will also be interesting to a general engineering audience as we’ll go over the Spark programming model and some engineering tricks that make Spark fast.
Search and Society: Reimagining Information Access for Radical FuturesBhaskar Mitra
The field of Information retrieval (IR) is currently undergoing a transformative shift, at least partly due to the emerging applications of generative AI to information access. In this talk, we will deliberate on the sociotechnical implications of generative AI for information access. We will argue that there is both a critical necessity and an exciting opportunity for the IR community to re-center our research agendas on societal needs while dismantling the artificial separation between the work on fairness, accountability, transparency, and ethics in IR and the rest of IR research. Instead of adopting a reactionary strategy of trying to mitigate potential social harms from emerging technologies, the community should aim to proactively set the research agenda for the kinds of systems we should build inspired by diverse explicitly stated sociotechnical imaginaries. The sociotechnical imaginaries that underpin the design and development of information access technologies needs to be explicitly articulated, and we need to develop theories of change in context of these diverse perspectives. Our guiding future imaginaries must be informed by other academic fields, such as democratic theory and critical theory, and should be co-developed with social science scholars, legal scholars, civil rights and social justice activists, and artists, among others.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Let's dive deeper into the world of ODC! Ricardo Alves (OutSystems) will join us to tell all about the new Data Fabric. After that, Sezen de Bruijn (OutSystems) will get into the details on how to best design a sturdy architecture within ODC.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
"Impact of front-end architecture on development cost", Viktor TurskyiFwdays
I have heard many times that architecture is not important for the front-end. Also, many times I have seen how developers implement features on the front-end just following the standard rules for a framework and think that this is enough to successfully launch the project, and then the project fails. How to prevent this and what approach to choose? I have launched dozens of complex projects and during the talk we will analyze which approaches have worked for me and which have not.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Track A-2 基於 Spark 的數據分析
1. 1
Spark Drives Big Data
Analytics Application
基於Spark的數據分析
James Chen
Etu CTO
June 16, 2015
2. 2
• Spark Brief
• What Cloudera is doing on Spark
• Spark Use Cases
• Cloudera’s Position on Spark
• Etu and Cloudera
Agenda
3. 3
Key Advances by MapReduce:
• Data Locality: Automatic split computation and launch of mappers
appropriately
• Fault-Tolerance: Write out of intermediate results and restartable mappers
meant ability to run on commodity hardware
• Linear Scalability: Combination of locality + programming model that forces
developers to write generally scalable solutions to problems
A Brief Review of MapReduce
M
ap
M
ap
M
ap
M
ap
M
ap
M
ap
M
ap
M
ap
M
ap
M
ap
M
ap
M
ap
Redu
ce
Redu
ce
Redu
ce
Redu
ce
4. 4
MapReduce: Good
The Good:
• Built in fault tolerance
• Optimized IO path
• Scalable
• Developer focuses on Map/Reduce, not infrastructure
• Simple? API
5. 5
MapReduce: Bad
The Bad:
•Optimized for disk IO
– Doesn’t leverage memory
– Iterative algorithms go through disk IO path again and
again
•Primitive API
– Developer’s have to build on very simple abstraction
– Key/Value in/out
– Even basic things like join require extensive code
•Result often many files that need to be combined appropriately
6. 6
Spark is a general purpose computational framework
with more flexibility than MapReduce
Key properties:
• Leverages distributed memory
• Full Directed Graph expressions for data parallel computations
• Improved developer experience
Yet retains:
Linear scalability, Fault-tolerance, and Data Locality based computations
Reference:
https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
What is Spark?
7. 7
Easy to Develop
– High productive
language support
– Clean and expressive
APIs
– Interactive shell
– Out of box
functionality
Spark: Easy and Fast Big Data
Fast to Run
–General execution
graphs
–In-memory storage
2-5× less code
Up to 10× faster on disk,
100× in memory
8. 8
Spark
Easy: Example – Word Count
Hadoop MapReduce
public static class WordCountMapClass extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, one);
}
}
}
public static class WorkdCountReduce extends MapReduceBase
implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
val spark = new SparkContext(master, appName,
[sparkHome], [jars])
val file = spark.textFile("hdfs://...")
val counts = file.(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
9. 9
Hadoop Integration
• Works with Hadoop Data
• Runs With YARN
Libraries
• MLlib
• Spark Streaming
• GraphX (alpha)
Out-of-the-Box Functionality
Language support:
• Improved Python support
• SparkR
• Java 8
• Schema support in Spark’s
APIs
10. 10
data = spark.textFile(...).map(readPoint).cache()
w = numpy.random.rand(D)
for i in range(iterations):
gradient = data
.map(lambda p: (1 / (1 + exp(-p.y * w.dot(p.x))))
* p.y * p.x)
.reduce(lambda x, y: x + y)
w -= gradient
print “Final w: %s” % w
Example: Logistic Regression
11. 11
• Hadoop cluster with 100 nodes
contains 10+TB of RAM today and
will double next year
• 1 GB RAM ~ $10-$20
• Trends:
• ½ price every 18 months
• 2x bandwidth every 3 years
Memory Management Leads to Greater
Performance
64-‐128GB RAM
16 cores
50 GB per
sec
Memory can be enabler for high
performance big data applications
12. 12
In-memory Caching
• Data Partitions read from
RAM instead of disk
Operator Graphs
• Scheduling Optimizations
• Fault Tolerance
Fast: Using RAM, Operator Graphs
join
filter
groupBy
B: B:
C: D: E:
F:
Ç√
Ω
map
A:
map
take
= cached partition= RDD
13. 13
Expressiveness of Programming Model
Map
Reduce
Map
Map
Reduce
Map
Reduce Efficient group-‐by aggregations
and other analytics
Pipelined MapReduce Jobs
Ma
p
Reduc
e
Ma
p
Reduc
eX X X
Ma
p
Reduc
e
Iterative jobs (Machine Learning)
14. 14
Logistic Regression Performance (Data
Fits in Memory)
0
500
1000
1500
2000
2500
3000
3500
4000
1 5 10 20 30
Running Time (s)
Number of Iterations
Hadoop
Spark
110 s / iteration
first iteration 80 s
further iterations 1 s
15. 15
• Spark Brief
• What Cloudera is doing on Spark
• Spark Use Cases
• Cloudera’s Position on Spark
• Etu and Cloudera
Agenda
16. 16
Spark Engineering in Cloudera
• Cloudera embraced Spark in early 2014
• Engineering with Intel to broaden Spark ecosystem
– Hive-on-Spark
– Pig-on-Spark
– Spark-over-YARN
– Spark Streaming Reliability
– General Spark Optimization
17. 17
Hive on Spark
• Technology
– Hive: “standard” SQL tool in Hadoop
– Spark: next-gen distributed processing framework
– Hive + Spark
• Performance
• Minimum feature gap
• Industry
– A lot of customers heavily invest in Hive
– Want to leverage the Spark engine
18. 18
Design Principles
• No or limited impact on Hive’s existing code path
• Maximize code reuse
• Minimum feature customization
• Low future maintenance cost
20. 20
Work – Metadata for Task
• MapReduceWork contains one MapWork and a possible ReduceWork
• SparkWork contains a graph of MapWorks and ReduceWorks
MapWork1
ReduceWork1
MapWork2
ReduceWork2
MapWork1
ReduceWork1
ReduceWork2
Query: select name,
sum(value) as v
from dec
group by name
order by v;
Spark Job
MR Job 2
MR Job 1
21. 21
Data Processing via Spark
• Treat Table as HadoopRDD (input RDD)
• Apply the function that wraps MR’s map-side processing
• Shuffle map output using Spark’s transformations (groupByKey,
sortByKey, etc)
• Apply the function that wraps MR’s reduce-side processing
22. 22
Spark Plan
• MapInput – encapsulate a table
• MapTran – map-side processing
• ShuffleTran – shuffling
• ReduceTran – reduce-side processing
Query: Select name, sum(value) as v from dec group by name order by v;
23. 23
Current Status
• All functionality in Hive is implemented
• First round of optimization is completed
– Map join, SMB
– Split generation and grouping
– CBO, vectorization
• More optimization and benchmarking coming
• Beta in CDH
– http://archive-primary.cloudera.com/cloudera-labs/hive-on-spark/
– http://www.cloudera.com/content/cloudera/en/documentation/hive-
spark/latest/PDF/hive-spark-get-started.pdf
24. 24
• Spark Brief
• What Cloudera is doing on Spark
• Spark Use Cases
• Cloudera’s Position on Spark
Agenda
25. 25
User Use Case Spark’s Value
Conviva
通過實時分析流量規則以及更精細的流
量控制,優化終端用戶的在線視頻體驗
• 快速原型開發
• 共享的離線和在線計算業務邏輯
• 開源的機器學習算法
Yahoo!
加速廣告投放的模型訓練週期,特徵提
取提高3備,採用協同過濾算法進行內容
推薦
• 降低數據管道的延遲
• 迭代式機器學習
• 高效的P2P廣播
Anonymous
(Large Tech
Company)
準實時日誌聚合於分析,實現監控和告
警
• 低延遲、高頻度的運行“mini”批
總也來處理最新數據
Technicolor
為(電信)客戶提供實時分析;提供流
處理和實時查詢能力
• 部署簡單,只需要Spark和Spark
Streaming
• 在線數據的隨機查詢
Sample Use Cases
26. 26
Large Tech Company – Spark is used for new machine learning
investigations for search personalization
Financial Services – Process millions of stock positions and future
scenarios in 4hrs with Spark (compared with 1 week using
MapReduce)
University – Genomics research using Spark pipelines
Video – Spark and Spark Streaming for video streaming and analysis
Hospital – Spark for predictive modeling of disease conditions
Cloudera Use Cases in Verticals
27. 27
• Run ETL on Spark using PIG
– To achieve very tight SLA’s.
– Accenture Smart Water Application.
• Spark Analytics over Hbase
– Patients physiological data, experiment and user data
– Serving Researchers.
• Traffic analysis using MLlib/Clustering at Dylan
• Annotated Variants analysis on Spark
– Using the Spark/Java framework in Duke
• Sepsis detection with Spark Streaming
Cloudera Use cases with different
Components
28. 28
• A car shopping website where people
from all across the nation come to read
reviews, compare prices, and in general
get help in all matters car related.
• The goal was to build a near real-time
dashboard that would provide both
unique visitor and page view counts per
make and make/model that could be
engineered in a couple of weeks.
• In the past, these updates have been
restricted to hourly granularities with an
additional hour delay.
• Furthermore, as this data was not
available in an easy-to-use dashboard,
manual processing was needed to
visualize the data.
Near real-time dashboard by
Edmunds.com
33. 33
Case Study in Etu Insight
l Problem domain:
− Analyze user behavior from web site interaction log
− Analyze users behavior from existing offline data
− Make data aggregation on the data grouping by time
and users
l Approach:
− ETL process from the web log to Hive structure data
− Import existing database data
− Define and implement the aggregation function in Spark
(with Scala)
− Output the calculation result to HBase
36. 36
Advanced Analytics with Spark
• Written by Cloudera data science team
– First ever book bridging ML with
Hadoop ecosystem
– Focusing on use cases and examples
rather than a manual
– Target for data scientist solving real
word analysis problems
– Generally available in May 2015
37. 37
Analyzing Big Data
• Building a model to detect credit card fraud using thousands
of features and billions of transactions
• Intelligently recommend millions of products to millions of
users
• Estimate financial risk through simulations of portfolios
including millions of instruments
• Easily manipulate data from thousands of human genomes to
detect genetic associations with disease
38. 38
• Spark Brief
• What Cloudera is doing on Spark
• Spark Use Cases
• Cloudera’s Position on Spark
• Etu and Cloudera
Agenda
39. 39
Spark is a fully integrated and supported part of Cloudera’s
enterprise data hub
• First vendor to ship and support Spark
– Invested early to make it a cohesive part of the platform
– Complemented by Intel’s early investment
– Developed and supported in collaboration with Databricks to
ensure success
• Only vendor with Spark committers on staff
• Several Spark use cases in production
• Well-trained support staff and external Training Courses
Cloudera’s Investment in Spark
40. 40
Hadoop in the Spark World
YARN
Spark
Spark
Streaming
GraphX MLlib
HDFS, HBase
HivePig
Impala
MapReduce2
SparkSQL
Search
Core Hadoop
Support Spark components
Unsupported add-‐ons
41. 41
Focusing on Open Standards, not just Open Source
Open Standards are just as
important as Open Source.
Why does it matter?
• Diverse engineering is more sustainable.
• Broad support ensures vendor
portability.
• Project utility depends on ecosystem
compatibility, which depends on
standards.
Cloudera leads in defining
the de facto open standards
adopted by the market.
Vendor Support
Component
(Founder)
Cloudera Pivota
l
MapR Amaz
on
IBM Hortonwo
rks
Spark
(UC
Berkeley)
✔ ✔ ✔ ✔ ✔ ✔
Impala (Cloudera) ✔ ✖ ✔ ✔ ✖ ✖
Hue (Cloudera) ✔ ✖ ✔ ✔ ✖ ✔
Sentry (Cloudera) ✔ ✔ ✔ ✖ ✔ ✖
Flume (Cloudera) ✔ ✔ ✔ ✖ ✔ ✔
Parquet
(Cloudera/Twitter)
✔ ✔ ✔ ✔ ✔ ✖
Sqoop (Cloudera) ✔ ✔ ✔ ✔ ✔ ✔
Falcon ✖ ✖ ✖ ✖ ✖ ✔
Knox ✖ ✖ ✖ ✖ ✖ ✔
Tez ✖ ✖ ✔ ✖ ✖ ✔
Ranger ✖ ✖ ✖ ✖ ✖ ✔
42. 42
Cloudera is a member of, and aligned with, the broader Spark
community
Spark:
• Will replace MapReduce as the general purpose Hadoop framework
– Broad community and vendor adoption
– Hadoop ecosystem integration (native & 3rd party)
• Goes beyond data science/machine learning
– Cloudera working on Spark Core, Streaming, Security, YARN, and MLlib
• Does not replace special purpose frameworks
– One size does not fit all for SQL, Search, Graph, Stream
Cloudera’s Position on Spark
43. 43
• Spark Brief
• What Cloudera is doing on Spark
• Spark Use Cases
• Cloudera’s Position on Spark
• Etu and Cloudera
Agenda
46. 46
Etu Support
Etu Professional
Service
Etu Consulting
Cloudera
Support
Etu Manager Etu Services
Etu Big Data 軟體平台與服務
Cloudera
Manager
Etu
Manager
Cloudera Manager inside
Etu Training
53. 53
• Read-only partitioned collection of records
• Created through:
– Transformation of data in storage
– Transformation of RDDs
• Contains lineage to compute from storage
• Lazy materialization
• Users control persistence and partitioning
RDD – Resilient Distributed Dataset
55. 55
• Transformations create new RDD from an
existing one
• Actions run computation on RDD and return a
value
• Transformations are lazy
• Actions materialize RDDs by computing
transformations
• RDDs can be cached to avoid re-computing
Operations
56. 56
• RDDs contain lineage
• Lineage – source location and list of transformations
• Lost partitions can be re-computed from source data
Fault-Tolerance
msgs = textFile.filter(lambda s:
s.startsWith(“ERROR”))
.map(lambda s: s.split(“t”)[2])
HDFS File Filtered RDD Mapped RDD
filter
(func = startsWith(…))
map
(func = split(...))
57. 57
• Persist() and cache() mark data
• RDD is cached after first action
• Fault-tolerant – lost partitions will re-compute
• If not enough memory, some partitions will not be
cached
• Future actions are performed on cached
partitioned, so they are much faster
Use caching for iterative algorithms
Caching