The document discusses productionizing Apache Spark and using the Spark REST Job Server. It provides an overview of Spark deployment options like YARN, Mesos, and Spark Standalone mode. It also covers Spark configuration topics like jars management, classpath configuration, and tuning garbage collection. The document then discusses running Spark applications in a cluster using tools like spark-submit and the Spark Job Server. It highlights features of the Spark Job Server like enabling low-latency Spark queries and sharing cached RDDs across jobs. Finally, it provides examples of using the Spark Job Server in production environments.
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
A presentation cum workshop on Real time Analytics with Apache Kafka and Apache Spark. Apache Kafka is a distributed publish-subscribe messaging while other side Spark Streaming brings Spark's language-integrated API to stream processing, allows to write streaming applications very quickly and easily. It supports both Java and Scala. In this workshop we are going to explore Apache Kafka, Zookeeper and Spark with a Web click streaming example using Spark Streaming. A clickstream is the recording of the parts of the screen a computer user clicks on while web browsing.
Tuning and Monitoring Deep Learning on Apache SparkDatabricks
Deep Learning on Apache Spark has the potential for huge impact in research and industry. This talk will describe best practices for building deep learning pipelines with Spark.
Rather than comparing deep learning systems or specific optimizations, this talk will focus on issues that are common to many deep learning frameworks when running on a Spark cluster: optimizing cluster setup and data ingest, tuning the cluster, and monitoring long-running jobs. We will demonstrate the techniques we cover using Google’s popular TensorFlow library.
More specifically, we will cover typical issues users encounter when integrating deep learning libraries with Spark clusters. Clusters can be configured to avoid task conflicts on GPUs and to allow using multiple GPUs per worker. Setting up pipelines for efficient data ingest improves job throughput. Interactive monitoring facilitates both the work of configuration and checking the stability of deep learning jobs.
Speaker: Tim Hunter
This talk was originally presented at Spark Summit East 2017.
This document discusses Spark Job Server, an open source project that allows Spark jobs to be submitted and run via a REST API. It provides features like job monitoring, context sharing between jobs to reuse cached data, and asynchronous APIs. The document outlines motivations for the project, how to use it including submitting and monitoring jobs, and future plans like high availability and hot failover support.
Securing Spark Applications by Kostas Sakellis and Marcelo VanzinSpark Summit
This document discusses securing Spark applications. It covers encryption, authentication, and authorization. Encryption protects data in transit using SASL or SSL. Authentication uses Kerberos to identify users. Authorization controls data access using Apache Sentry and the Sentry HDFS plugin, which synchronizes HDFS permissions with higher-level abstractions like tables. A future RecordService aims to provide a unified authorization system at the record level for Spark SQL.
This document summarizes a presentation about Tachyon, an open source memory-centric distributed storage system. It introduces Tachyon and how it can be used with Spark to resolve issues around slow data sharing, in-memory data loss during crashes, and data duplication. The presentation outlines new features in Tachyon 0.8.0 like tiered storage, pluggable data management policies, and a unified namespace across storage systems. It concludes by inviting users and collaborators to try, develop, and get involved with the Tachyon community.
The Pushdown of Everything by Stephan Kessler and Santiago MolaSpark Summit
Stephan Kessler and Santiago Mola presented SAP HANA Vora, which extends Spark SQL's data sources API to allow "pushing down" more of a SQL query's logical plan to the data source for execution. This "Pushdown of Everything" approach leverages data sources' capabilities to process less data and optimize query execution. They described how data sources can implement interfaces like TableScan, PrunedScan, and the new CatalystSource interface to support pushing down projections, filters, and more complex queries respectively. While this approach has advantages in performance, challenges include the complexity of implementing CatalystSource and ensuring compatibility across Spark versions. Future work aims to improve the API and provide utilities to simplify implementation.
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...Spark Summit
This document summarizes Uber's use of Spark as a data platform to support multi-tenancy and various data applications. Key points include:
- Uber uses Spark on YARN for resource management and isolation between teams/jobs. Parquet is used as the columnar file format for performance and schema support.
- Challenges include sharing infrastructure between many teams with different backgrounds and use cases. Spark provides a common platform.
- An Uber Development Kit (UDK) is used to help users get Spark jobs running quickly on Uber's infrastructure, with templates, defaults, and APIs for common tasks.
The document discusses lessons learned from building a real-time data processing platform using Spark and microservices. Key aspects include:
- A microservices-inspired architecture was used with Spark Streaming jobs processing data in parallel and communicating via Kafka.
- This modular approach allowed for independent development and deployment of new features without disrupting existing jobs.
- While Spark provided batch and streaming capabilities, managing resources across jobs and achieving low latency proved challenging.
- Alternative technologies like Kafka Streams and Confluent's Schema Registry were identified to improve resilience, schemas, and processing latency.
- Overall the platform demonstrated strengths in modularity, A/B testing, and empowering data scientists, but faced challenges around
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
A presentation cum workshop on Real time Analytics with Apache Kafka and Apache Spark. Apache Kafka is a distributed publish-subscribe messaging while other side Spark Streaming brings Spark's language-integrated API to stream processing, allows to write streaming applications very quickly and easily. It supports both Java and Scala. In this workshop we are going to explore Apache Kafka, Zookeeper and Spark with a Web click streaming example using Spark Streaming. A clickstream is the recording of the parts of the screen a computer user clicks on while web browsing.
Tuning and Monitoring Deep Learning on Apache SparkDatabricks
Deep Learning on Apache Spark has the potential for huge impact in research and industry. This talk will describe best practices for building deep learning pipelines with Spark.
Rather than comparing deep learning systems or specific optimizations, this talk will focus on issues that are common to many deep learning frameworks when running on a Spark cluster: optimizing cluster setup and data ingest, tuning the cluster, and monitoring long-running jobs. We will demonstrate the techniques we cover using Google’s popular TensorFlow library.
More specifically, we will cover typical issues users encounter when integrating deep learning libraries with Spark clusters. Clusters can be configured to avoid task conflicts on GPUs and to allow using multiple GPUs per worker. Setting up pipelines for efficient data ingest improves job throughput. Interactive monitoring facilitates both the work of configuration and checking the stability of deep learning jobs.
Speaker: Tim Hunter
This talk was originally presented at Spark Summit East 2017.
This document discusses Spark Job Server, an open source project that allows Spark jobs to be submitted and run via a REST API. It provides features like job monitoring, context sharing between jobs to reuse cached data, and asynchronous APIs. The document outlines motivations for the project, how to use it including submitting and monitoring jobs, and future plans like high availability and hot failover support.
Securing Spark Applications by Kostas Sakellis and Marcelo VanzinSpark Summit
This document discusses securing Spark applications. It covers encryption, authentication, and authorization. Encryption protects data in transit using SASL or SSL. Authentication uses Kerberos to identify users. Authorization controls data access using Apache Sentry and the Sentry HDFS plugin, which synchronizes HDFS permissions with higher-level abstractions like tables. A future RecordService aims to provide a unified authorization system at the record level for Spark SQL.
This document summarizes a presentation about Tachyon, an open source memory-centric distributed storage system. It introduces Tachyon and how it can be used with Spark to resolve issues around slow data sharing, in-memory data loss during crashes, and data duplication. The presentation outlines new features in Tachyon 0.8.0 like tiered storage, pluggable data management policies, and a unified namespace across storage systems. It concludes by inviting users and collaborators to try, develop, and get involved with the Tachyon community.
The Pushdown of Everything by Stephan Kessler and Santiago MolaSpark Summit
Stephan Kessler and Santiago Mola presented SAP HANA Vora, which extends Spark SQL's data sources API to allow "pushing down" more of a SQL query's logical plan to the data source for execution. This "Pushdown of Everything" approach leverages data sources' capabilities to process less data and optimize query execution. They described how data sources can implement interfaces like TableScan, PrunedScan, and the new CatalystSource interface to support pushing down projections, filters, and more complex queries respectively. While this approach has advantages in performance, challenges include the complexity of implementing CatalystSource and ensuring compatibility across Spark versions. Future work aims to improve the API and provide utilities to simplify implementation.
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...Spark Summit
This document summarizes Uber's use of Spark as a data platform to support multi-tenancy and various data applications. Key points include:
- Uber uses Spark on YARN for resource management and isolation between teams/jobs. Parquet is used as the columnar file format for performance and schema support.
- Challenges include sharing infrastructure between many teams with different backgrounds and use cases. Spark provides a common platform.
- An Uber Development Kit (UDK) is used to help users get Spark jobs running quickly on Uber's infrastructure, with templates, defaults, and APIs for common tasks.
The document discusses lessons learned from building a real-time data processing platform using Spark and microservices. Key aspects include:
- A microservices-inspired architecture was used with Spark Streaming jobs processing data in parallel and communicating via Kafka.
- This modular approach allowed for independent development and deployment of new features without disrupting existing jobs.
- While Spark provided batch and streaming capabilities, managing resources across jobs and achieving low latency proved challenging.
- Alternative technologies like Kafka Streams and Confluent's Schema Registry were identified to improve resilience, schemas, and processing latency.
- Overall the platform demonstrated strengths in modularity, A/B testing, and empowering data scientists, but faced challenges around
Productionizing Spark and the Spark Job ServerEvan Chan
You won't find this in many places - an overview of deploying, configuring, and running Apache Spark, including Mesos vs YARN vs Standalone clustering modes, useful config tuning parameters, and other tips from years of using Spark in production. Also, learn about the Spark Job Server and how it can help your organization deploy Spark as a RESTful service, track Spark jobs, and enable fast queries (including SQL!) of cached RDDs.
Jaws - Data Warehouse with Spark SQL by Ema OrhianSpark Summit
1) Jaws is a highly scalable and resilient data warehouse explorer that allows submitting Spark SQL queries concurrently and asynchronously through a RESTful API.
2) It provides features like persisted query logs, results pagination, and pluggable storage layers. Queries can be run on Spark SQL contexts configured to use data from HDFS, Cassandra, Parquet files on HDFS or Tachyon.
3) The architecture allows Jaws to scale on standalone, Mesos, or YARN clusters by distributing queries across multiple worker nodes, and supports canceling running queries.
Cassandra Day 2014: Interactive Analytics with Cassandra and SparkEvan Chan
This document discusses using Spark and Cassandra together for interactive analytics. It describes how Evan Chan uses both technologies at Ooyala to solve the problem of generating analytics from raw data in Cassandra in a flexible and fast way. It outlines their architecture of using Spark to generate materialized views from Cassandra data and then powering queries with those cached views for low latency queries.
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng ShiDatabricks
This document discusses speeding up OLAP cube building in Apache Kylin using Spark. Cubing with MapReduce can be slow due to serialization overhead and repeated job submissions. Spark allows caching data in memory across cuboid layers in one job, significantly reducing build times compared to MapReduce as shown in a benchmark on a 160 million row dataset. Spark simplifies Kylin development and brings capabilities for real-time OLAP and cloud integration.
This presentation will be useful to those who would like to get acquainted with Apache Spark architecture, top features and see some of them in action, e.g. RDD transformations and actions, Spark SQL, etc. Also it covers real life use cases related to one of ours commercial projects and recall roadmap how we’ve integrated Apache Spark into it.
Was presented on Morning@Lohika tech talks in Lviv.
Design by Yarko Filevych: http://www.filevych.com/
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis MagdaDatabricks
It’s not enough to build a mesh of sensors or embedded devices to get more insights about the surrounding environment and optimize your production. Usually, your IoT solution needs to be capable of transferring enormous amounts of data to a storage or cloud where the data has to be processed further. Quite often, the processing of the endless streams of data has to be done almost in real-time so that you can react on the IoT subsystem’s state accordingly, and in time.
During this session, see how to build a Fast Data solution that will receive endless streams from the IoT side and will be capable of processing the streams in real-time using Apache Ignite’s cluster resources. In particular, learn about data streaming to an Apache Ignite cluster from embedded devices and real-time data processing with Apache Spark.
Big Data visualization with Apache Spark and Zeppelinprajods
This presentation gives an overview of Apache Spark and explains the features of Apache Zeppelin(incubator). Zeppelin is the open source tool for data discovery, exploration and visualization. It supports REPLs for shell, SparkSQL, Spark(scala), python and angular. This presentation was made on the Big Data Day, at the Great Indian Developer Summit, Bangalore, April 2015
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...Spark Summit
Since April 2016, Spark-as-a-service has been available to researchers in Sweden from the Swedish ICT SICS Data Center at www.hops.site. Researchers work in an entirely UI-driven environment on a platform built with only open-source software.
Spark applications can be either deployed as jobs (batch or streaming) or written and run directly from Apache Zeppelin. Spark applications are run within a project on a YARN cluster with the novel property that Spark applications are metered and charged to projects. Projects are also securely isolated from each other and include support for project-specific Kafka topics. That is, Kafka topics are protected from access by users that are not members of the project. In this talk we will discuss the challenges in building multi-tenant Spark streaming applications on YARN that are metered and easy-to-debug. We show how we use the ELK stack (Elasticsearch, Logstash, and Kibana) for logging and debugging running Spark streaming applications, how we use Graphana and Graphite for monitoring Spark streaming applications, and how users can debug and optimize terminated Spark Streaming jobs using Dr Elephant. We will also discuss the experiences of our users (over 120 users as of Sept 2016): how they manage their Kafka topics and quotas, patterns for how users share topics between projects, and our novel solutions for helping researchers debug and optimize Spark applications.
To conclude, we will also give an overview on our course ID2223 on Large Scale Learning and Deep Learning, in which 60 students designed and ran SparkML applications on the platform.
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinAlex Zeltov
This workshop will provide an introduction to Big Data Analytics using Apache Spark and Apache Zeppelin.
https://github.com/zeltovhorton/intro_spark_zeppelin_meetup
There will be a short lecture that includes an introduction to Spark, the Spark components.
Spark is a unified framework for big data analytics. Spark provides one integrated API for use by developers, data scientists, and analysts to perform diverse tasks that would have previously required separate processing engines such as batch analytics, stream processing and statistical modeling. Spark supports a wide range of popular languages including Python, R, Scala, SQL, and Java. Spark can read from diverse data sources and scale to thousands of nodes.
The lecture will be followed by demo . There will be a short lecture on Hadoop and how Spark and Hadoop interact and compliment each other. You will learn how to move data into HDFS using Spark APIs, create Hive table, explore the data with Spark and SQL, transform the data and then issue some SQL queries. We will be using Scala and/or PySpark for labs.
This document outlines a project to capture user location data and send it to a database for real-time analysis using Kafka and Spark streaming. It describes starting Zookeeper and Kafka servers, creating Kafka topics, producing and consuming messages with Java producers and consumers, using the Spark CLI, integrating Kafka and Spark for streaming, creating DataFrames and SQL queries, and saving data to PostgreSQL tables for further processing and analysis. The goal is to demonstrate real-time data streaming and analytics on user location data.
An over-ambitious introduction to Spark programming, test and deployment. This slide tries to cover most core technologies and design patterns used in SpookyStuff, the fastest query engine for data collection/mashup from the deep web.
For more information please follow: https://github.com/tribbloid/spookystuff
A bug in PowerPoint used to cause transparent background color not being rendered properly. This has been fixed in a recent upload.
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)Spark Summit
This document discusses running Sqoop jobs on Apache Spark for faster data ingestion into Hadoop. The authors describe how Sqoop jobs can be executed as Spark jobs by leveraging Spark's faster execution engine compared to MapReduce. They demonstrate running a Sqoop job to ingest data from MySQL to HDFS using Spark and show it is faster than using MapReduce. Some challenges encountered are managing dependencies and job submission, but overall it allows leveraging Sqoop's connectors within Spark's distributed processing framework. Next steps include exploring alternative job submission methods in Spark and adding transformation capabilities to Sqoop connectors.
Spark is a general-purpose cluster computing framework that provides high-level APIs and is faster than Hadoop for iterative jobs and interactive queries. It leverages cached data in cluster memory across nodes for faster performance. Spark supports various higher-level tools including SQL, machine learning, graph processing, and streaming.
This document discusses Apache Spark, a fast and general engine for large-scale data processing. It introduces Spark's Resilient Distributed Datasets (RDDs) and its programming model using transformations and actions. It provides instructions for installing Spark and launching it on Amazon EC2. It includes an example word count program in Spark and compares its performance to MapReduce. Finally, it briefly describes MLlib, Spark's machine learning library, and provides an example of the k-means clustering algorithm.
Real time data viz with Spark Streaming, Kafka and D3.jsBen Laird
This document discusses building a dynamic visualization of large streaming transaction data. It proposes using Apache Kafka to handle the transaction stream, Apache Spark Streaming to process and aggregate the data, MongoDB for intermediate storage, a Node.js server, and Socket.io for real-time updates. Visualization would use Crossfilter, DC.js and D3.js to enable interactive exploration of billions of records in the browser.
Continuous Processing in Structured Streaming with Jose TorresDatabricks
This talk will cover the details of Continuous Processing in Structured Streaming and my work implementing the initial version in Spark 2.3 as well as the updates for 2.4. DStreams was Spark’s first attempt at streaming, and through dstream Spark became the first framework to provide both batch and streaming functionalities in one unified execution engine.
The way streaming execution happens is through this “micro-batch” model, in which the underlying execution engine simply runs on batches of data over and over again. Dstream’s design tightly couples the user-facing APIs with the execution model, and as a result was very difficult to accomplish certain tasks important in streaming, e.g. using event time and working with late data, without breaking the user-facing APIs. Structured Streaming was the 2nd (and the latest) major streaming effort in Spark. Its design decouples the frontend (user-facing APIs) and backend (execution), and allows us to change the execution model without any user API change.
However, the (historical) minimum possible latency for any record for DStreams or Structured Streaming was bounded by the amount of time that it takes to launch a task. This limitation is a result of the fact that the engine requires us to know both the starting and the ending offset, before any tasks are launched. In the worst case, the end-to-end latency is actually closer to the average batch time + task launching time. Continuous Processing removes this constraints and allows users to achieve sub-millisecond end-to-end latencies with the new execution engine.
This talk will take a technical deep dive into its capabilities, what it took to implement, and discuss the future developments.
INTELLIPAAT (www.intellipaat.com) is a young dynamic online training provider driving Education for Employ-ability & Career advancement across the globe Known as a "one stop, training shop" for high end technical training. Learn any Niche Business Intelligence, Database and BigData ,cloud computing technologies:
Business Intelligence/Database
Tableau Server, Buisness Object, Spotfire, Datastage, OBIEE, Qlikview, Hyperion, Microstartegy, Pentaho, Cognos, Informatica, Talend,Oracle Developer, Oracle DBA, DataModeling, Sap Business Object, Sap Hana etc..
BigData/CloudComputing
Spark, Storm, Scala, Mahout(Machine Learning),Hadoop, Cassandra, Hbase, Solr, Splunk, openstack etc.
Since we started our journey, we have trained over 1,20,000+ professionals with 50 corporate clients across the globe. Intellipaat has offices in India ( Jaipur , Bangalore) .US, UK, Canada.
FiloDB - Breakthrough OLAP Performance with Cassandra and SparkEvan Chan
You want to ingest event, time-series, streaming data easily, yet have flexible, fast ad-hoc queries. Is this even possible? Yes! Find out how in this talk of combining Apache Cassandra and Apache Spark, using a new open-source database, FiloDB.
Building a REST Job Server for Interactive Spark as a ServiceCloudera, Inc.
This document describes building a REST job server for interactive Spark as a service using Livy. It discusses the history and challenges of running Spark jobs in Hue, introduces Livy as a Spark server, and details its local and YARN-cluster modes as well as session creation, execution flows, and interpreter support for Scala, Python, R and more. Magic commands are also described for JSON, table, and plot output.
Livy is an open source REST service for interacting with and managing Spark contexts and jobs. It allows clients to submit Spark jobs via REST, monitor their status, and retrieve results. Livy manages long-running Spark contexts in a cluster and supports running multiple independent contexts simultaneously from different clients. It provides client APIs in Java, Scala, and soon Python to interface with the Livy REST endpoints for submitting, monitoring, and retrieving results of Spark jobs.
Productionizing Spark and the Spark Job ServerEvan Chan
You won't find this in many places - an overview of deploying, configuring, and running Apache Spark, including Mesos vs YARN vs Standalone clustering modes, useful config tuning parameters, and other tips from years of using Spark in production. Also, learn about the Spark Job Server and how it can help your organization deploy Spark as a RESTful service, track Spark jobs, and enable fast queries (including SQL!) of cached RDDs.
Jaws - Data Warehouse with Spark SQL by Ema OrhianSpark Summit
1) Jaws is a highly scalable and resilient data warehouse explorer that allows submitting Spark SQL queries concurrently and asynchronously through a RESTful API.
2) It provides features like persisted query logs, results pagination, and pluggable storage layers. Queries can be run on Spark SQL contexts configured to use data from HDFS, Cassandra, Parquet files on HDFS or Tachyon.
3) The architecture allows Jaws to scale on standalone, Mesos, or YARN clusters by distributing queries across multiple worker nodes, and supports canceling running queries.
Cassandra Day 2014: Interactive Analytics with Cassandra and SparkEvan Chan
This document discusses using Spark and Cassandra together for interactive analytics. It describes how Evan Chan uses both technologies at Ooyala to solve the problem of generating analytics from raw data in Cassandra in a flexible and fast way. It outlines their architecture of using Spark to generate materialized views from Cassandra data and then powering queries with those cached views for low latency queries.
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng ShiDatabricks
This document discusses speeding up OLAP cube building in Apache Kylin using Spark. Cubing with MapReduce can be slow due to serialization overhead and repeated job submissions. Spark allows caching data in memory across cuboid layers in one job, significantly reducing build times compared to MapReduce as shown in a benchmark on a 160 million row dataset. Spark simplifies Kylin development and brings capabilities for real-time OLAP and cloud integration.
This presentation will be useful to those who would like to get acquainted with Apache Spark architecture, top features and see some of them in action, e.g. RDD transformations and actions, Spark SQL, etc. Also it covers real life use cases related to one of ours commercial projects and recall roadmap how we’ve integrated Apache Spark into it.
Was presented on Morning@Lohika tech talks in Lviv.
Design by Yarko Filevych: http://www.filevych.com/
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis MagdaDatabricks
It’s not enough to build a mesh of sensors or embedded devices to get more insights about the surrounding environment and optimize your production. Usually, your IoT solution needs to be capable of transferring enormous amounts of data to a storage or cloud where the data has to be processed further. Quite often, the processing of the endless streams of data has to be done almost in real-time so that you can react on the IoT subsystem’s state accordingly, and in time.
During this session, see how to build a Fast Data solution that will receive endless streams from the IoT side and will be capable of processing the streams in real-time using Apache Ignite’s cluster resources. In particular, learn about data streaming to an Apache Ignite cluster from embedded devices and real-time data processing with Apache Spark.
Big Data visualization with Apache Spark and Zeppelinprajods
This presentation gives an overview of Apache Spark and explains the features of Apache Zeppelin(incubator). Zeppelin is the open source tool for data discovery, exploration and visualization. It supports REPLs for shell, SparkSQL, Spark(scala), python and angular. This presentation was made on the Big Data Day, at the Great Indian Developer Summit, Bangalore, April 2015
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...Spark Summit
Since April 2016, Spark-as-a-service has been available to researchers in Sweden from the Swedish ICT SICS Data Center at www.hops.site. Researchers work in an entirely UI-driven environment on a platform built with only open-source software.
Spark applications can be either deployed as jobs (batch or streaming) or written and run directly from Apache Zeppelin. Spark applications are run within a project on a YARN cluster with the novel property that Spark applications are metered and charged to projects. Projects are also securely isolated from each other and include support for project-specific Kafka topics. That is, Kafka topics are protected from access by users that are not members of the project. In this talk we will discuss the challenges in building multi-tenant Spark streaming applications on YARN that are metered and easy-to-debug. We show how we use the ELK stack (Elasticsearch, Logstash, and Kibana) for logging and debugging running Spark streaming applications, how we use Graphana and Graphite for monitoring Spark streaming applications, and how users can debug and optimize terminated Spark Streaming jobs using Dr Elephant. We will also discuss the experiences of our users (over 120 users as of Sept 2016): how they manage their Kafka topics and quotas, patterns for how users share topics between projects, and our novel solutions for helping researchers debug and optimize Spark applications.
To conclude, we will also give an overview on our course ID2223 on Large Scale Learning and Deep Learning, in which 60 students designed and ran SparkML applications on the platform.
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinAlex Zeltov
This workshop will provide an introduction to Big Data Analytics using Apache Spark and Apache Zeppelin.
https://github.com/zeltovhorton/intro_spark_zeppelin_meetup
There will be a short lecture that includes an introduction to Spark, the Spark components.
Spark is a unified framework for big data analytics. Spark provides one integrated API for use by developers, data scientists, and analysts to perform diverse tasks that would have previously required separate processing engines such as batch analytics, stream processing and statistical modeling. Spark supports a wide range of popular languages including Python, R, Scala, SQL, and Java. Spark can read from diverse data sources and scale to thousands of nodes.
The lecture will be followed by demo . There will be a short lecture on Hadoop and how Spark and Hadoop interact and compliment each other. You will learn how to move data into HDFS using Spark APIs, create Hive table, explore the data with Spark and SQL, transform the data and then issue some SQL queries. We will be using Scala and/or PySpark for labs.
This document outlines a project to capture user location data and send it to a database for real-time analysis using Kafka and Spark streaming. It describes starting Zookeeper and Kafka servers, creating Kafka topics, producing and consuming messages with Java producers and consumers, using the Spark CLI, integrating Kafka and Spark for streaming, creating DataFrames and SQL queries, and saving data to PostgreSQL tables for further processing and analysis. The goal is to demonstrate real-time data streaming and analytics on user location data.
An over-ambitious introduction to Spark programming, test and deployment. This slide tries to cover most core technologies and design patterns used in SpookyStuff, the fastest query engine for data collection/mashup from the deep web.
For more information please follow: https://github.com/tribbloid/spookystuff
A bug in PowerPoint used to cause transparent background color not being rendered properly. This has been fixed in a recent upload.
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)Spark Summit
This document discusses running Sqoop jobs on Apache Spark for faster data ingestion into Hadoop. The authors describe how Sqoop jobs can be executed as Spark jobs by leveraging Spark's faster execution engine compared to MapReduce. They demonstrate running a Sqoop job to ingest data from MySQL to HDFS using Spark and show it is faster than using MapReduce. Some challenges encountered are managing dependencies and job submission, but overall it allows leveraging Sqoop's connectors within Spark's distributed processing framework. Next steps include exploring alternative job submission methods in Spark and adding transformation capabilities to Sqoop connectors.
Spark is a general-purpose cluster computing framework that provides high-level APIs and is faster than Hadoop for iterative jobs and interactive queries. It leverages cached data in cluster memory across nodes for faster performance. Spark supports various higher-level tools including SQL, machine learning, graph processing, and streaming.
This document discusses Apache Spark, a fast and general engine for large-scale data processing. It introduces Spark's Resilient Distributed Datasets (RDDs) and its programming model using transformations and actions. It provides instructions for installing Spark and launching it on Amazon EC2. It includes an example word count program in Spark and compares its performance to MapReduce. Finally, it briefly describes MLlib, Spark's machine learning library, and provides an example of the k-means clustering algorithm.
Real time data viz with Spark Streaming, Kafka and D3.jsBen Laird
This document discusses building a dynamic visualization of large streaming transaction data. It proposes using Apache Kafka to handle the transaction stream, Apache Spark Streaming to process and aggregate the data, MongoDB for intermediate storage, a Node.js server, and Socket.io for real-time updates. Visualization would use Crossfilter, DC.js and D3.js to enable interactive exploration of billions of records in the browser.
Continuous Processing in Structured Streaming with Jose TorresDatabricks
This talk will cover the details of Continuous Processing in Structured Streaming and my work implementing the initial version in Spark 2.3 as well as the updates for 2.4. DStreams was Spark’s first attempt at streaming, and through dstream Spark became the first framework to provide both batch and streaming functionalities in one unified execution engine.
The way streaming execution happens is through this “micro-batch” model, in which the underlying execution engine simply runs on batches of data over and over again. Dstream’s design tightly couples the user-facing APIs with the execution model, and as a result was very difficult to accomplish certain tasks important in streaming, e.g. using event time and working with late data, without breaking the user-facing APIs. Structured Streaming was the 2nd (and the latest) major streaming effort in Spark. Its design decouples the frontend (user-facing APIs) and backend (execution), and allows us to change the execution model without any user API change.
However, the (historical) minimum possible latency for any record for DStreams or Structured Streaming was bounded by the amount of time that it takes to launch a task. This limitation is a result of the fact that the engine requires us to know both the starting and the ending offset, before any tasks are launched. In the worst case, the end-to-end latency is actually closer to the average batch time + task launching time. Continuous Processing removes this constraints and allows users to achieve sub-millisecond end-to-end latencies with the new execution engine.
This talk will take a technical deep dive into its capabilities, what it took to implement, and discuss the future developments.
INTELLIPAAT (www.intellipaat.com) is a young dynamic online training provider driving Education for Employ-ability & Career advancement across the globe Known as a "one stop, training shop" for high end technical training. Learn any Niche Business Intelligence, Database and BigData ,cloud computing technologies:
Business Intelligence/Database
Tableau Server, Buisness Object, Spotfire, Datastage, OBIEE, Qlikview, Hyperion, Microstartegy, Pentaho, Cognos, Informatica, Talend,Oracle Developer, Oracle DBA, DataModeling, Sap Business Object, Sap Hana etc..
BigData/CloudComputing
Spark, Storm, Scala, Mahout(Machine Learning),Hadoop, Cassandra, Hbase, Solr, Splunk, openstack etc.
Since we started our journey, we have trained over 1,20,000+ professionals with 50 corporate clients across the globe. Intellipaat has offices in India ( Jaipur , Bangalore) .US, UK, Canada.
FiloDB - Breakthrough OLAP Performance with Cassandra and SparkEvan Chan
You want to ingest event, time-series, streaming data easily, yet have flexible, fast ad-hoc queries. Is this even possible? Yes! Find out how in this talk of combining Apache Cassandra and Apache Spark, using a new open-source database, FiloDB.
Building a REST Job Server for Interactive Spark as a ServiceCloudera, Inc.
This document describes building a REST job server for interactive Spark as a service using Livy. It discusses the history and challenges of running Spark jobs in Hue, introduces Livy as a Spark server, and details its local and YARN-cluster modes as well as session creation, execution flows, and interpreter support for Scala, Python, R and more. Magic commands are also described for JSON, table, and plot output.
Livy is an open source REST service for interacting with and managing Spark contexts and jobs. It allows clients to submit Spark jobs via REST, monitor their status, and retrieve results. Livy manages long-running Spark contexts in a cluster and supports running multiple independent contexts simultaneously from different clients. It provides client APIs in Java, Scala, and soon Python to interface with the Livy REST endpoints for submitting, monitoring, and retrieving results of Spark jobs.
Organizations need to perform increasingly complex analysis on data — streaming analytics, ad-hoc querying, and predictive analytics — in order to get better customer insights and actionable business intelligence. Apache Spark has recently emerged as the framework of choice to address many of these challenges. In this session, we show you how to use Apache Spark on AWS to implement and scale common big data use cases such as real-time data processing, interactive data science, predictive analytics, and more. We will talk about common architectures, best practices to quickly create Spark clusters using Amazon EMR, and ways to integrate Spark with other big data services in AWS.
Learning Objectives:
• Learn why Spark is great for ad-hoc interactive analysis and real-time stream processing.
• How to deploy and tune scalable clusters running Spark on Amazon EMR.
• How to use EMR File System (EMRFS) with Spark to query data directly in Amazon S3.
• Common architectures to leverage Spark with Amazon DynamoDB, Amazon Redshift, Amazon Kinesis, and more.
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)Evan Chan
This was a talk that Kelvin Chu and I just gave at the SF Bay Area Spark Meetup 5/14 at Palantir Technologies.
We discussed the Spark Job Server (http://github.com/ooyala/spark-jobserver), its history, example workflows, architecture, and exciting future plans to provide HA spark job contexts.
We also discussed the use case of the job server at Ooyala to facilitate fast query jobs using shared RDD and a shared job context, and how we integrate with Apache Cassandra.
This document provides an overview of the Internet of Things (IoT) ecosystem and business models. It discusses how IoT connects everyday physical objects to the internet to collect and share data. Examples mentioned include wearable health devices, smart homes, connected cars, and tracking tools for cows and sports equipment. The document also outlines common IoT technology stacks involving hardware platforms, programming languages, and GUI tools. It emphasizes the importance of prototyping, understanding user needs, mobility, analytics, and algorithms for developing successful IoT products and business models.
This document discusses opportunities in data analytics. It notes that industries like finance, marketing, telecommunications, education, research, and healthcare are pursuing opportunities in data analytics. Specific opportunities for India's IT outsourcing firms in healthcare reform in the US are also mentioned. The document outlines the data analytics life cycle and what metrics a CEO may want to understand about user engagement and retention for a mobile game. It promotes hands-on examples to learn data analytics concepts.
Spatial Analysis On Histological Images Using SparkJen Aman
This document describes using Spark for spatial analysis of histological images to characterize the tumor microenvironment. The goal is to provide actionable data on the location and density of immune cells and blood vessels. Over 100,000 objects are annotated in each whole slide image. Spark is used to efficiently calculate over 5 trillion pairwise distances between objects within a neighborhood window. This enables profiling of co-localization and spatial clustering of objects. Initial results show the runtime scales linearly with the number of objects. Future work includes integrating clinical and genomic data to characterize variation between tumor types and patients.
Data Science & Best Practices for Apache Spark on Amazon EMRAmazon Web Services
Organizations need to perform increasingly complex analysis on their data — streaming analytics, ad-hoc querying and predictive analytics — in order to get better customer insights and actionable business intelligence. However, the growing data volume, speed, and complexity of diverse data formats make current tools inadequate or difficult to use. Apache Spark has recently emerged as the framework of choice to address these challenges. Spark is a general-purpose processing framework that follows a DAG model and also provides high-level APIs, making it more flexible and easier to use than MapReduce. Thanks to its use of in-memory datasets (RDDs), embedded libraries, fault-tolerance, and support for a variety of programming languages, Apache Spark enables developers to implement and scale far more complex big data use cases, including real-time data processing, interactive querying, graph computations and predictive analytics. In this session, we present a technical deep dive on Spark running on Amazon EMR. You learn why Spark is great for ad-hoc interactive analysis and real-time stream processing, how to deploy and tune scalable clusters running Spark on Amazon EMR, how to use EMRFS with Spark to query data directly in Amazon S3, and best practices and patterns for Spark on Amazon EMR.
This document summarizes the agenda and content covered in the Data Science Bootcamp Day 2. The topics covered include using Git and GitHub for version control, an introduction to Apache Maven for building Java projects, basic Hadoop commands for administration and running the WordCount example program on a Hadoop cluster using Maven. Key steps like cloning repositories, configuring Git, adding/committing/pushing changes, understanding the Maven lifecycle and running Hadoop jobs were demonstrated.
This document lists the titles of 15 horror films released between 2011-2014, including Silent Hill: Revelation 3D, Oculus, Mama, Sinister, and SMILEY. The list includes sequels, remakes, and original horror movies covering themes of supernatural entities, isolated families, and psychological terror.
Robert Montgomery provides his contact information and summarizes his experience and skills in information technology project management, technical support, and security administration. He has over 20 years of experience in various IT roles, including positions as IT Manager, Director of IT, and Security Associate. His technical skills include Windows, Exchange, hardware, networking, security systems, and certifications in A+, Microsoft, and Cisco technologies.
This document summarizes Kendall Moore's background and a project investigating spot weld modeling. It includes:
1) An introduction to Kendall Moore, including education and experience.
2) A description of the project to compare spot weld modeling using CBAR and CWELD elements to physical coupon testing under different parameters like element quality and connection configurations.
3) Results showing the CBAR model predicted lower stiffness but closer fatigue life to testing than CWELD, and stiffness and fatigue life were not sensitive to element size.
This document does not contain any substantive information to summarize in 3 sentences or less. It consists of random characters without any discernible meaning or context.
The magazine will be called "Rift" and will be published monthly, focusing on indie music. It will cost £3.60 and be distributed in stores like newsagents, supermarkets, HMV and music shops. The magazine will have an informal chatty style to involve readers. Regular content will include album reviews, interviews, song recommendations, top 10 lists, news, posters, and gig guides. Feature articles will cover topics like music festivals, concerts, band interviews and profiles, best songs lists, videos, and more. The magazine will use a combination of fonts like Berlin Sans FB for coverlines, Calibri for headlines, and Franklin Gothic for standfirsts.
This document discusses continuous delivery using containers. It describes how the company uses microservices architecture, scaling, full automation, and continuous delivery. Continuous delivery is defined as keeping software in short cycles so it can be reliably released at any time. The deployment pipeline includes committing code, running unit tests, integration tests, end-to-end tests, building Docker images, and pushing to a registry. CoreOS is also discussed as a lightweight OS that runs all services in containers and uses a distributed key-value store.
This document summarizes the agenda and key topics from Day 3 of a Data Science Bootcamp.
The agenda included:
- An introduction to Apache Spark and its configuration in single node and cluster modes.
- An introduction to Apache Kafka and its single node configuration including creating topics and pushing messages.
The document reviewed Spark and SQL contexts, Resilient Distributed Datasets (RDDs), and DataFrames - the primary Spark abstraction. It discussed DataFrame transformations and actions, caching data in memory/disk, and when to use DataFrames over RDDs. Finally, it provided an example of implementing MapReduce in Spark.
Numpy is the Python foundation for number crunching. It provides a high-level API, interactivity, visualization, and performance for numerical computing while also allowing low-level access. It uses a simple but powerful memory model and array data structure. Numpy powers many scientific computing libraries in Python and is demonstrated through examples of its API, memory management capabilities using strides, and extensions that build on it like TA-Lib for financial analysis.
The magazine will be called "Rift" and will be published monthly, focusing on indie music. It will cost £3.60 and be distributed in stores like newsagents, supermarkets, HMV, and music shops. The magazine will have an informal chatty style to involve readers. Regular content will include album reviews, interviews, song recommendations, top 10 lists, news, posters, and gig guides. Feature articles will cover topics like music festivals, concerts, band interviews and profiles, best songs lists, music videos, and fashion. The magazine will use a combination of fonts like Berlin Sans FB for coverlines and Calibri for headlines, with a blue, purple and red color scheme.
This is an introductory tutorial to Apache Spark at the Lagos Scala Meetup II. We discussed the basics of processing engine, Spark, how it relates to Hadoop MapReduce. Little handson at the end of the session.
This document provides an overview of the Apache Spark framework. It covers Spark fundamentals including the Spark execution model using Resilient Distributed Datasets (RDDs), basic Spark programming, and common Spark libraries and use cases. Key topics include how Spark improves on MapReduce by operating in-memory and supporting general graphs through its directed acyclic graph execution model. The document also reviews Spark installation and provides examples of basic Spark programs in Scala.
http://bit.ly/1BTaXZP – As organizations look for even faster ways to derive value from big data, they are turning to Apache Spark is an in-memory processing framework that offers lightning-fast big data analytics, providing speed, developer productivity, and real-time processing advantages. The Spark software stack includes a core data-processing engine, an interface for interactive querying, Spark Streaming for streaming data analysis, and growing libraries for machine-learning and graph analysis. Spark is quickly establishing itself as a leading environment for doing fast, iterative in-memory and streaming analysis. This talk will give an introduction the Spark stack, explain how Spark has lighting fast results, and how it complements Apache Hadoop. By the end of the session, you’ll come away with a deeper understanding of how you can unlock deeper insights from your data, faster, with Spark.
Spark is an open-source cluster computing framework that uses in-memory processing to allow data sharing across jobs for faster iterative queries and interactive analytics, it uses Resilient Distributed Datasets (RDDs) that can survive failures through lineage tracking and supports programming in Scala, Java, and Python for batch, streaming, and machine learning workloads.
This session covers how to work with PySpark interface to develop Spark applications. From loading, ingesting, and applying transformation on the data. The session covers how to work with different data sources of data, apply transformation, python best practices in developing Spark Apps. The demo covers integrating Apache Spark apps, In memory processing capabilities, working with notebooks, and integrating analytics tools into Spark Applications.
Since 2014, Typesafe has been actively contributing to the Apache Spark project, and has become a certified development support partner of Databricks, the company started by the creators of Spark. Typesafe and Mesosphere have forged a partnership in which Typesafe is the official commercial support provider of Spark on Apache Mesos, along with Mesosphere’s Datacenter Operating Systems (DCOS).
In this webinar with Iulian Dragos, Spark team lead at Typesafe Inc., we reveal how Typesafe supports running Spark in various deployment modes, along with the improvements we made to Spark to help integrate backpressure signals into the underlying technologies, making it a better fit for Reactive Streams. He also show you the functionalities at work, and how to make it simple to deploy to Spark on Mesos with Typesafe.
We will introduce:
Various deployment modes for Spark: Standalone, Spark on Mesos, and Spark with Mesosphere DCOS
Overview of Mesos and how it relates to Mesosphere DCOS
Deeper look at how Spark runs on Mesos
How to manage coarse-grained and fine-grained scheduling modes on Mesos
What to know about a client vs. cluster deployment
A demo running Spark on Mesos
Your data is getting bigger while your boss is getting anxious to have insights! This tutorial covers Apache Spark that makes data analytics fast to write and fast to run. Tackle big datasets quickly through a simple API in Python, and learn one programming paradigm in order to deploy interactive, batch, and streaming applications while connecting to data sources incl. HDFS, Hive, JSON, and S3.
Lessons Learned From Running Spark On DockerSpark Summit
Running Spark on Docker containers provides flexibility for data scientists and control for IT. Some key lessons learned include optimizing CPU and memory resources to avoid noisy neighbor problems, managing Docker images efficiently, using network plugins for multi-host connectivity, and addressing storage and security considerations. Performance testing showed Spark on Docker containers can achieve comparable performance to bare metal deployments for large-scale data processing workloads.
Introduction to Apache Spark. With an emphasis on the RDD API, Spark SQL (DataFrame and Dataset API) and Spark Streaming.
Presented at the Desert Code Camp:
http://oct2016.desertcodecamp.com/sessions/all
Introduction to Apache Spark. With an emphasis on the RDD API, Spark SQL (DataFrame and Dataset API) and Spark Streaming.
Presented at the Desert Code Camp:
http://oct2016.desertcodecamp.com/sessions/all
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine LearningChris Fregly
This document provides an overview and summary of Spark Streaming. It discusses Spark Streaming's architecture and APIs. Spark Streaming receives live input data streams and divides them into micro-batches, which it processes using Spark's execution engine to perform operations like transformations and actions. This allows for low-latency, high-throughput stream processing with fault tolerance. The document also covers Spark Streaming deployment and integrating it with sources like Kinesis, as well as monitoring and tuning Spark Streaming applications.
Luca Canali presented on using flame graphs to investigate performance improvements in Spark 2.0 over Spark 1.6 for a CPU-intensive workload. Flame graphs of the Spark 1.6 and 2.0 executions showed Spark 2.0 spending less time in core Spark functions and more time in whole stage code generation functions, indicating improved optimizations. Additional tools like Linux perf confirmed Spark 2.0 utilized CPU and memory throughput better. The presentation demonstrated how flame graphs and other profiling tools can help pinpoint performance bottlenecks and understand the impact of changes like Spark 2.0's code generation optimizations.
1. Apache Spark is an open source cluster computing framework for large-scale data processing. It is compatible with Hadoop and provides APIs for SQL, streaming, machine learning, and graph processing.
2. Over 3000 companies use Spark, including Microsoft, Uber, Pinterest, and Amazon. It can run on standalone clusters, EC2, YARN, and Mesos.
3. Spark SQL, Streaming, and MLlib allow for SQL queries, streaming analytics, and machine learning at scale using Spark's APIs which are inspired by Python/R data frames and scikit-learn.
Migrating ETL Workflow to Apache Spark at Scale in PinterestDatabricks
The document summarizes Pinterest's migration of ETL workflows from Cascading and Scalding to Spark. Key points:
- Pinterest runs Spark on AWS but manages its own clusters to avoid vendor lock-in. They have multiple Spark clusters with hundreds to thousands of nodes.
- The migration plan is to move remaining workloads from Hive, Cascading/Scalding, and Hadoop streaming to SparkSQL, PySpark, and native Spark over time. An automatic migration service helps with the process.
- Technical challenges included secondary sorting, accumulators behaving differently between frameworks, and output committer issues. Performance profiling and tuning was also important.
- Results of migrating so far include
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Exampleconfluent
This document introduces Kafka Streams and provides an example of using it to process streaming data from Apache Kafka. It summarizes some key limitations of using Apache Spark for streaming use cases with Kafka before demonstrating how to build a simple text processing pipeline with Kafka Streams. The document also discusses parallelism, state stores, aggregations, joins and deployment considerations when using Kafka Streams. It provides an example of how Kafka Streams was used to aggregate metrics from multiple instances of an application into a single stream.
This document summarizes Lightning-Fast Cluster Computing with Spark and Shark, a presentation about the Spark and Shark frameworks. Spark is an open-source cluster computing system that aims to provide fast, fault-tolerant processing of large datasets. It uses resilient distributed datasets (RDDs) and supports diverse workloads with sub-second latency. Shark is a system built on Spark that exposes the HiveQL query language and compiles queries down to Spark programs for faster, interactive analysis of large datasets.
What no one tells you about writing a streaming apphadooparchbook
This document discusses 5 things that are often not addressed when writing streaming applications:
1. Managing and monitoring long-running streaming jobs can be challenging as frameworks were not originally designed for streaming workloads. Options include using cluster mode to ensure jobs continue if clients disconnect and leveraging monitoring tools to track metrics.
2. Preventing data loss requires different approaches depending on the data source. File and receiver-based sources benefit from checkpointing while Kafka's commit log ensures data is not lost.
3. Spark Streaming is well-suited for tasks involving windowing, aggregations, and machine learning but may not be needed for all streaming use cases.
4. Achieving exactly-once semantics requires techniques
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...Spark Summit
So you know you want to write a streaming app but any non-trivial streaming app developer would have to think about these questions:
How do I manage offsets?
How do I manage state?
How do I make my spark streaming job resilient to failures? Can I avoid some failures?
How do I gracefully shutdown my streaming job?
How do I monitor and manage (e.g. re-try logic) streaming job?
How can I better manage the DAG in my streaming job?
When to use checkpointing and for what? When not to use checkpointing?
Do I need a WAL when using streaming data source? Why? When don’t I need one?
In this talk, we’ll share practices that no one talks about when you start writing your streaming app, but you’ll inevitably need to learn along the way.
The document discusses tools and techniques used by Uber's Hadoop team to make their Spark and Hadoop platforms more user-friendly and efficient. It introduces tools like SCBuilder to simplify Spark context creation, Kafka dispersal to distribute RDD results, and SparkPlug to provide templates for common jobs. It also describes a distributed log debugger called SparkChamber to help debug Spark jobs and techniques like building a spatial index to optimize geo-spatial joins. The goal is to abstract out infrastructure complexities and enforce best practices to make the platforms more self-service for users.
Similar to Productionizing Spark and the REST Job Server- Evan Chan (20)
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang Spark Summit
In this session we will present a Configurable FPGA-Based Spark SQL Acceleration Architecture. It is target to leverage FPGA highly parallel computing capability to accelerate Spark SQL Query and for FPGA’s higher power efficiency than CPU we can lower the power consumption at the same time. The Architecture consists of SQL query decomposition algorithms, fine-grained FPGA based Engine Units which perform basic computation of sub string, arithmetic and logic operations. Using SQL query decomposition algorithm, we are able to decompose a complex SQL query into basic operations and according to their patterns each is fed into an Engine Unit. SQL Engine Units are highly configurable and can be chained together to perform complex Spark SQL queries, finally one SQL query is transformed into a Hardware Pipeline. We will present the performance benchmark results comparing the queries with FGPA-Based Spark SQL Acceleration Architecture on XEON E5 and FPGA to the ones with Spark SQL Query on XEON E5 with 10X ~ 100X improvement and we will demonstrate one SQL query workload from a real customer.
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...Spark Summit
In this talk, we’ll present techniques for visualizing large scale machine learning systems in Spark. These are techniques that are employed by Netflix to understand and refine the machine learning models behind Netflix’s famous recommender systems that are used to personalize the Netflix experience for their 99 millions members around the world. Essential to these techniques is Vegas, a new OSS Scala library that aims to be the “missing MatPlotLib” for Spark/Scala. We’ll talk about the design of Vegas and its usage in Scala notebooks to visualize Machine Learning Models.
This presentation introduces how we design and implement a real-time processing platform using latest Spark Structured Streaming framework to intelligently transform the production lines in the manufacturing industry. In the traditional production line there are a variety of isolated structured, semi-structured and unstructured data, such as sensor data, machine screen output, log output, database records etc. There are two main data scenarios: 1) Picture and video data with low frequency but a large amount; 2) Continuous data with high frequency. They are not a large amount of data per unit. However the total amount of them is very large, such as vibration data used to detect the quality of the equipment. These data have the characteristics of streaming data: real-time, volatile, burst, disorder and infinity. Making effective real-time decisions to retrieve values from these data is critical to smart manufacturing. The latest Spark Structured Streaming framework greatly lowers the bar for building highly scalable and fault-tolerant streaming applications. Thanks to the Spark we are able to build a low-latency, high-throughput and reliable operation system involving data acquisition, transmission, analysis and storage. The actual user case proved that the system meets the needs of real-time decision-making. The system greatly enhance the production process of predictive fault repair and production line material tracking efficiency, and can reduce about half of the labor force for the production lines.
Improving Traffic Prediction Using Weather Data with Ramya RaghavendraSpark Summit
As common sense would suggest, weather has a definite impact on traffic. But how much? And under what circumstances? Can we improve traffic (congestion) prediction given weather data? Predictive traffic is envisioned to significantly impact how driver’s plan their day by alerting users before they travel, find the best times to travel, and over time, learn from new IoT data such as road conditions, incidents, etc. This talk will cover the traffic prediction work conducted jointly by IBM and the traffic data provider. As a part of this work, we conducted a case study over five large metropolitans in the US, 2.58 billion traffic records and 262 million weather records, to quantify the boost in accuracy of traffic prediction using weather data. We will provide an overview of our lambda architecture with Apache Spark being used to build prediction models with weather and traffic data, and Spark Streaming used to score the model and provide real-time traffic predictions. This talk will also cover a suite of extensions to Spark to analyze geospatial and temporal patterns in traffic and weather data, as well as the suite of machine learning algorithms that were used with Spark framework. Initial results of this work were presented at the National Association of Broadcasters meeting in Las Vegas in April 2017, and there is work to scale the system to provide predictions in over a 100 cities. Audience will learn about our experience scaling using Spark in offline and streaming mode, building statistical and deep-learning pipelines with Spark, and techniques to work with geospatial and time-series data.
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...Spark Summit
Graph is on the rise and it’s time to start learning about scalable graph analytics! In this session we will go over two Spark-based Graph Analytics frameworks: Tinkerpop and GraphFrames. While both frameworks can express very similar traversals, they have different performance characteristics and APIs. In this Deep-Dive by example presentation, we will demonstrate some common traversals and explain how, at a Spark level, each traversal is actually computed under the hood! Learn both the fluent Gremlin API as well as the powerful GraphFrame Motif api as we show examples of both simultaneously. No need to be familiar with Graphs or Spark for this presentation as we’ll be explaining everything from the ground up!
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...Spark Summit
Building accurate machine learning models has been an art of data scientists, i.e., algorithm selection, hyper parameter tuning, feature selection and so on. Recently, challenges to breakthrough this “black-arts” have got started. In cooperation with our partner, NEC Laboratories America, we have developed a Spark-based automatic predictive modeling system. The system automatically searches the best algorithm, parameters and features without any manual work. In this talk, we will share how the automation system is designed to exploit attractive advantages of Spark. The evaluation with real open data demonstrates that our system can explore hundreds of predictive models and discovers the most accurate ones in minutes on a Ultra High Density Server, which employs 272 CPU cores, 2TB memory and 17TB SSD in 3U chassis. We will also share open challenges to learn such a massive amount of models on Spark, particularly from reliability and stability standpoints. This talk will cover the presentation already shown on Spark Summit SF’17 (#SFds5) but from more technical perspective.
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
In Sweden, from the Rise ICE Data Center at www.hops.site, we are providing to reseachers both Spark-as-a-Service and, more recently, Tensorflow-as-a-Service as part of the Hops platform. In this talk, we examine the different ways in which Tensorflow can be included in Spark workflows, from batch to streaming to structured streaming applications. We will analyse the different frameworks for integrating Spark with Tensorflow, from Tensorframes to TensorflowOnSpark to Databrick’s Deep Learning Pipelines. We introduce the different programming models supported and highlight the importance of cluster support for managing different versions of python libraries on behalf of users. We will also present cluster management support for sharing GPUs, including Mesos and YARN (in Hops Hadoop). Finally, we will perform a live demonstration of training and inference for a TensorflowOnSpark application written on Jupyter that can read data from either HDFS or Kafka, transform the data in Spark, and train a deep neural network on Tensorflow. We will show how to debug the application using both Spark UI and Tensorboard, and how to examine logs and monitor training.
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
In Sweden, from the Rise ICE Data Center at www.hops.site, we are providing to reseachers both Spark-as-a-Service and, more recently, Tensorflow-as-a-Service as part of the Hops platform. In this talk, we examine the different ways in which Tensorflow can be included in Spark workflows, from batch to streaming to structured streaming applications. We will analyse the different frameworks for integrating Spark with Tensorflow, from Tensorframes to TensorflowOnSpark to Databrick’s Deep Learning Pipelines. We introduce the different programming models supported and highlight the importance of cluster support for managing different versions of python libraries on behalf of users. We will also present cluster management support for sharing GPUs, including Mesos and YARN (in Hops Hadoop). Finally, we will perform a live demonstration of training and inference for a TensorflowOnSpark application written on Jupyter that can read data from either HDFS or Kafka, transform the data in Spark, and train a deep neural network on Tensorflow. We will show how to debug the application using both Spark UI and Tensorboard, and how to examine logs and monitor training.
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...Spark Summit
With the rapid growth of available datasets, it is imperative to have good tools for extracting insight from big data. The Spark ML library has excellent support for performing at-scale data processing and machine learning experiments, but more often than not, Data Scientists find themselves struggling with issues such as: low level data manipulation, lack of support for image processing, text analytics and deep learning, as well as the inability to use Spark alongside other popular machine learning libraries. To address these pain points, Microsoft recently released The Microsoft Machine Learning Library for Apache Spark (MMLSpark), an open-source machine learning library built on top of SparkML that seeks to simplify the data science process and integrate SparkML Pipelines with deep learning and computer vision libraries such as the Microsoft Cognitive Toolkit (CNTK) and OpenCV. With MMLSpark, Data Scientists can build models with 1/10th of the code through Pipeline objects that compose seamlessly with other parts of the SparkML ecosystem. In this session, we explore some of the main lessons learned from building MMLSpark. Join us if you would like to know how to extend Pipelines to ensure seamless integration with SparkML, how to auto-generate Python and R wrappers from Scala Transformers and Estimators, how to integrate and use previously non-distributed libraries in a distributed manner and how to efficiently deploy a Spark library across multiple platforms.
Next CERN Accelerator Logging Service with Jakub WozniakSpark Summit
The Next Accelerator Logging Service (NXCALS) is a new Big Data project at CERN aiming to replace the existing Oracle-based service.
The main purpose of the system is to store and present Controls/Infrastructure related data gathered from thousands of devices in the whole accelerator complex.
The data is used to operate the machines, improve their performance and conduct studies for new beam types or future experiments.
During this talk, Jakub will speak about NXCALS requirements and design choices that lead to the selected architecture based on Hadoop and Spark. He will present the Ingestion API, the abstractions behind the Meta-data Service and the Spark-based Extraction API where simple changes to the schema handling greatly improved the overall usability of the system. The system itself is not CERN specific and can be of interest to other companies or institutes confronted with similar Big Data problems.
Powering a Startup with Apache Spark with Kevin KimSpark Summit
In Between (A mobile App for couples, downloaded 20M in Global), from daily batch for extracting metrics, analysis and dashboard. Spark is widely used by engineers and data analysts in Between, thanks to the performance and expendability of Spark, data operating has become extremely efficient. Entire team including Biz Dev, Global Operation, Designers are enjoying data results so Spark is empowering entire company for data driven operation and thinking. Kevin, Co-founder and Data Team leader of Between will be presenting how things are going in Between. Listeners will know how small and agile team is living with data (how we build organization, culture and technical base) after this presentation.
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraSpark Summit
As common sense would suggest, weather has a definite impact on traffic. But how much? And under what circumstances? Can we improve traffic (congestion) prediction given weather data? Predictive traffic is envisioned to significantly impact how driver’s plan their day by alerting users before they travel, find the best times to travel, and over time, learn from new IoT data such as road conditions, incidents, etc. This talk will cover the traffic prediction work conducted jointly by IBM and the traffic data provider. As a part of this work, we conducted a case study over five large metropolitans in the US, 2.58 billion traffic records and 262 million weather records, to quantify the boost in accuracy of traffic prediction using weather data. We will provide an overview of our lambda architecture with Apache Spark being used to build prediction models with weather and traffic data, and Spark Streaming used to score the model and provide real-time traffic predictions. This talk will also cover a suite of extensions to Spark to analyze geospatial and temporal patterns in traffic and weather data, as well as the suite of machine learning algorithms that were used with Spark framework. Initial results of this work were presented at the National Association of Broadcasters meeting in Las Vegas in April 2017, and there is work to scale the system to provide predictions in over a 100 cities. Audience will learn about our experience scaling using Spark in offline and streaming mode, building statistical and deep-learning pipelines with Spark, and techniques to work with geospatial and time-series data.
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Spark Summit
In many cases, Big Data becomes just another buzzword because of the lack of tools that can support both the technological requirements for developing and deploying of the projects and/or the fluency of communication between the different profiles of people involved in the projects.
In this talk, we will present Moriarty, a set of tools for fast prototyping of Big Data applications that can be deployed in an Apache Spark environment. These tools support the creation of Big Data workflows using the already existing functional blocks or supporting the creation of new functional blocks. The created workflow can then be deployed in a Spark infrastructure and used through a REST API.
For better understanding of Moriarty, the prototyping process and the way it hides the Spark environment to the Big Data users and developers, we will present it together with a couple of examples based on a Industry 4.0 success cases and other on a logistic success case.
How Nielsen Utilized Databricks for Large-Scale Research and Development with...Spark Summit
Nielsen used Databricks to test new digital advertising rating methodologies on a large scale. Databricks allowed Nielsen to run analyses on thousands of advertising campaigns using both small panel data and large production data. This identified edge cases and performance gains faster than traditional methods. Using Databricks reduced the time required to test and deploy improved rating methodologies to benefit Nielsen's clients.
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spark Summit
Data lineage tracking is one of the significant problems that financial institutions face when using modern big data tools. This presentation describes Spline – a data lineage tracking and visualization tool for Apache Spark. Spline captures and stores lineage information from internal Spark execution plans and visualizes it in a user-friendly manner.
Goal Based Data Production with Sim SimeonovSpark Summit
Since the invention of SQL and relational databases, data production has been about specifying how data is transformed through queries. While Apache Spark can certainly be used as a general distributed query engine, the power and granularity of Spark’s APIs enables a revolutionary increase in data engineering productivity: goal-based data production. Goal-based data production concerns itself with specifying WHAT the desired result is, leaving the details of HOW the result is achieved to a smart data warehouse running on top of Spark. That not only substantially increases productivity, but also significantly expands the audience that can work directly with Spark: from developers and data scientists to technical business users. With specific data and architecture patterns spanning the range from ETL to machine learning data prep and with live demos, this session will demonstrate how Spark users can gain the benefits of goal-based data production.
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Spark Summit
Have you imagined a simple machine learning solution able to prevent revenue leakage and monitor your distributed application? To answer this question, we offer a practical and a simple machine learning solution to create an intelligent monitoring application based on simple data analysis using Apache Spark MLlib. Our application uses linear regression models to make predictions and check if the platform is experiencing any operational problems that can impact in revenue losses. The application monitor distributed systems and provides notifications stating the problem detected, that way users can operate quickly to avoid serious problems which directly impact the company’s revenue and reduce the time for action. We will present an architecture for not only a monitoring system, but also an active actor for our outages recoveries. At the end of the presentation you will have access to our training program source code and you will be able to adapt and implement in your company. This solution already helped to prevent about US$3mi in losses last year.
Getting Ready to Use Redis with Apache Spark with Dvir VolkSpark Summit
Getting Ready to use Redis with Apache Spark is a technical tutorial designed to address integrating Redis with an Apache Spark deployment to increase the performance of serving complex decision models. To set the context for the session, we start with a quick introduction to Redis and the capabilities Redis provides. We cover the basic data types provided by Redis and cover the module system. Using an ad serving use-case, we look at how Redis can improve the performance and reduce the cost of using complex ML-models in production. Attendees will be guided through the key steps of setting up and integrating Redis with Spark, including how to train a model using Spark then load and serve it using Redis, as well as how to work with the Spark Redis module. The capabilities of the Redis Machine Learning Module (redis-ml) will be discussed focusing primarily on decision trees and regression (linear and logistic) with code examples to demonstrate how to use these feature. At the end of the session, developers should feel confident building a prototype/proof-of-concept application using Redis and Spark. Attendees will understand how Redis complements Spark and how to use Redis to serve complex, ML-models with high performance.
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Spark Summit
Here we present a general supervised framework for record deduplication and author-disambiguation via Spark. This work differentiates itself by – Application of Databricks and AWS makes this a scalable implementation. Compute resources are comparably lower than traditional legacy technology using big boxes 24/7. Scalability is crucial as Elsevier’s Scopus data, the biggest scientific abstract repository, covers roughly 250 million authorships from 70 million abstracts covering a few hundred years. – We create a fingerprint for each content by deep learning and/or word2vec algorithms to expedite pairwise similarity calculation. These encoders substantially reduce compute time while maintaining semantic similarity (unlike traditional TFIDF or predefined taxonomies). We will briefly discuss how to optimize word2vec training with high parallelization. Moreover, we show how these encoders can be used to derive a standard representation for all our entities namely such as documents, authors, users, journals, etc. This standard representation can simplify the recommendation problem into a pairwise similarity search and hence it can offer a basic recommender for cross-product applications where we may not have a dedicate recommender engine designed. – Traditional author-disambiguation or record deduplication algorithms are batch-processing with small to no training data. However, we have roughly 25 million authorships that are manually curated or corrected upon user feedback. Hence, it is crucial to maintain historical profiles and hence we have developed a machine learning implementation to deal with data streams and process them in mini batches or one document at a time. We will discuss how to measure the accuracy of such a system, how to tune it and how to process the raw data of pairwise similarity function into final clusters. Lessons learned from this talk can help all sort of companies where they want to integrate their data or deduplicate their user/customer/product databases.
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...Spark Summit
The use of large-scale machine learning and data mining methods is becoming ubiquitous in many application domains ranging from business intelligence and bioinformatics to self-driving cars. These methods heavily rely on matrix computations, and it is hence critical to make these computations scalable and efficient. These matrix computations are often complex and involve multiple steps that need to be optimized and sequenced properly for efficient execution. This work presents new efficient and scalable matrix processing and optimization techniques based on Spark. The proposed techniques estimate the sparsity of intermediate matrix-computation results and optimize communication costs. An evaluation plan generator for complex matrix computations is introduced as well as a distributed plan optimizer that exploits dynamic cost-based analysis and rule-based heuristics The result of a matrix operation will often serve as an input to another matrix operation, thus defining the matrix data dependencies within a matrix program. The matrix query plan generator produces query execution plans that minimize memory usage and communication overhead by partitioning the matrix based on the data dependencies in the execution plan. We implemented the proposed matrix techniques inside the Spark SQL, and optimize the matrix execution plan based on Spark SQL Catalyst. We conduct case studies on a series of ML models and matrix computations with special features on different datasets. These are PageRank, GNMF, BFGS, sparse matrix chain multiplications, and a biological data analysis. The open-source library ScaLAPACK and the array-based database SciDB are used for performance evaluation. Our experiments are performed on six real-world datasets are: social network data ( e.g., soc-pokec, cit-Patents, LiveJournal), Twitter2010, Netflix recommendation data, and 1000 Genomes Project sample. Experiments demonstrate that our proposed techniques achieve up to an order-of-magnitude performance.
Global Situational Awareness of A.I. and where its headedvikram sood
You can see the future first in San Francisco.
Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum.
The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war.
Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change.
Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride.
Let me tell you what we see.
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Aggregage
This webinar will explore cutting-edge, less familiar but powerful experimentation methodologies which address well-known limitations of standard A/B Testing. Designed for data and product leaders, this session aims to inspire the embrace of innovative approaches and provide insights into the frontiers of experimentation!
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfGetInData
Recently we have observed the rise of open-source Large Language Models (LLMs) that are community-driven or developed by the AI market leaders, such as Meta (Llama3), Databricks (DBRX) and Snowflake (Arctic). On the other hand, there is a growth in interest in specialized, carefully fine-tuned yet relatively small models that can efficiently assist programmers in day-to-day tasks. Finally, Retrieval-Augmented Generation (RAG) architectures have gained a lot of traction as the preferred approach for LLMs context and prompt augmentation for building conversational SQL data copilots, code copilots and chatbots.
In this presentation, we will show how we built upon these three concepts a robust Data Copilot that can help to democratize access to company data assets and boost performance of everyone working with data platforms.
Why do we need yet another (open-source ) Copilot?
How can we build one?
Architecture and evaluation
State of Artificial intelligence Report 2023kuntobimo2016
Artificial intelligence (AI) is a multidisciplinary field of science and engineering whose goal is to create intelligent machines.
We believe that AI will be a force multiplier on technological progress in our increasingly digital, data-driven world. This is because everything around us today, ranging from culture to consumer products, is a product of intelligence.
The State of AI Report is now in its sixth year. Consider this report as a compilation of the most interesting things we’ve seen with a goal of triggering an informed conversation about the state of AI and its implication for the future.
We consider the following key dimensions in our report:
Research: Technology breakthroughs and their capabilities.
Industry: Areas of commercial application for AI and its business impact.
Politics: Regulation of AI, its economic implications and the evolving geopolitics of AI.
Safety: Identifying and mitigating catastrophic risks that highly-capable future AI systems could pose to us.
Predictions: What we believe will happen in the next 12 months and a 2022 performance review to keep us honest.
Natural Language Processing (NLP), RAG and its applications .pptxfkyes25
1. In the realm of Natural Language Processing (NLP), knowledge-intensive tasks such as question answering, fact verification, and open-domain dialogue generation require the integration of vast and up-to-date information. Traditional neural models, though powerful, struggle with encoding all necessary knowledge within their parameters, leading to limitations in generalization and scalability. The paper "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" introduces RAG (Retrieval-Augmented Generation), a novel framework that synergizes retrieval mechanisms with generative models, enhancing performance by dynamically incorporating external knowledge during inference.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
End-to-end pipeline agility - Berlin Buzzwords 2024Lars Albertsson
We describe how we achieve high change agility in data engineering by eliminating the fear of breaking downstream data pipelines through end-to-end pipeline testing, and by using schema metaprogramming to safely eliminate boilerplate involved in changes that affect whole pipelines.
A quick poll on agility in changing pipelines from end to end indicated a huge span in capabilities. For the question "How long time does it take for all downstream pipelines to be adapted to an upstream change," the median response was 6 months, but some respondents could do it in less than a day. When quantitative data engineering differences between the best and worst are measured, the span is often 100x-1000x, sometimes even more.
A long time ago, we suffered at Spotify from fear of changing pipelines due to not knowing what the impact might be downstream. We made plans for a technical solution to test pipelines end-to-end to mitigate that fear, but the effort failed for cultural reasons. We eventually solved this challenge, but in a different context. In this presentation we will describe how we test full pipelines effectively by manipulating workflow orchestration, which enables us to make changes in pipelines without fear of breaking downstream.
Making schema changes that affect many jobs also involves a lot of toil and boilerplate. Using schema-on-read mitigates some of it, but has drawbacks since it makes it more difficult to detect errors early. We will describe how we have rejected this tradeoff by applying schema metaprogramming, eliminating boilerplate but keeping the protection of static typing, thereby further improving agility to quickly modify data pipelines without fear.
2. Who am I
• Distinguished Engineer, Tuplejump
• @evanfchan
• http://github.com/velvia
• User and contributor to Spark since 0.9
• Co-creator and maintainer of Spark Job Server
2
3. TupleJump
✤ Tuplejump is a big data technology leader providing solutions and
development partnership.
✤ FiloDB - Spark-based analytics database for time series and event
data (github.com/tuplejump/FiloDB)
✤ Calliope - the first Spark-Cassandra integration
✤ Stargate - an open source Lucene indexer for Cassandra
✤ SnackFS - open source HDFS for Cassandra
3
6. Choices, choices, choices
• YARN, Mesos, Standalone?
• With a distribution?
• What environment?
• How should I deploy?
• Hosted options?
• What about dependencies?
6
8. What all the clusters have in common
• YARN, Mesos, and Standalone all support the following
features:
–Running the Spark driver app in cluster mode
–Restarts of the driver app upon failure
–UI to examine state of workers and apps
8
9. Spark Standalone Mode
• The easiest clustering mode to deploy**
–Use make-distribution.sh to package, copy to all nodes
–sbin/start-master.sh on master node, then start slaves
–Test with spark-shell
• HA Master through Zookeeper election
• Must dedicate whole cluster to Spark
• In latest survey, used by almost half of Spark users
9
10. Apache Mesos
• Was started by Matias in 2007 before he worked on Spark!
• Can run your entire company on Mesos, not just big data
–Great support for micro services - Docker, Marathon
–Can run non-JVM workloads like MPI
• Commercial backing from Mesosphere
• Heavily used at Twitter and AirBNB
• The Mesosphere DCOS will revolutionize Spark et al deployment - “dcos package
install spark” !!
10
11. Mesos vs YARN
• Mesos is a two-level resource manager, with pluggable schedulers
–You can run YARN on Mesos, with YARN delegating resource offers
to Mesos (Project Myriad)
–You can run multiple schedulers within Mesos, and write your own
• If you’re already a Hadoop / Cloudera etc shop, YARN is easy choice
• If you’re starting out, go 100% Mesos
11
12. Mesos Coarse vs Fine-Grained
• Spark offers two modes to run Mesos Spark apps in (and you can
choose per driver app):
–coarse-grained: Spark allocates fixed number of workers for
duration of driver app
–fine-grained (default): Dynamic executor allocation per task, but
higher overhead per task
• Use coarse-grained if you run low-latency jobs
12
13. What about Datastax DSE?
• Cassandra, Hadoop, Spark all bundled in one distribution, collocated
• Custom cluster manager and HA/failover logic for Spark Master,
using Cassandra gossip
• Can use CFS (Cassandra-based HDFS), SnackFS, or plain
Cassandra tables for storage
–or use Tachyon to cache, then no need to collocate (use
Mesosphere DCOS)
13
14. Hosted Apache Spark
• Spark on Amazon EMR - first class citizen now
–Direct S3 access!
• Google Compute Engine - “Click to Deploy” Hadoop+Spark
• Databricks Cloud
• Many more coming
• What you notice about the different environments:
–Everybody has their own way of starting: spark-submit vs dse spark vs aws
emr … vs dcos spark …
14
15. Mesosphere DCOS
• Automates deployment to AWS,
Google, etc.
• Common API and UI, better cost
and control, cloud
• Load balancing and routing,
Mesos for resource sharing
• dcos package install spark
15
17. Building Spark
• Make sure you build for the right Hadoop version
• eg mvn -Phadoop-2.2 -Dhadoop.version=2.2.0 -DskipTests clean package
• Make sure you build for the right Scala version - Spark supports
both 2.10 and 2.11
17
18. Jars schmars
• Dependency conflicts are the worst part of Spark dev
• Every distro has slightly different jars - eg CDH < 5.4 packaged a different version of
Akka
• Leave out Hive if you don’t need it
• Use the Spark UI “Environment” tab to check jars and how they got there
• spark-submit —jars / —packages forwards jars to every executor (unless it’s an
HDFS / HTTP path)
• Spark-env.sh SPARK_CLASSPATH - include dep jars you’ve deployed to every node
18
19. Jars schmars
• You don’t need to package every dependency with your Spark application!
• spark-streaming is included in the distribution
• spark-streaming includes some Kafka jars already
• etc.
19
20. ClassPath Configuration Options
• spark.driver.userClassPathFirst, spark.executor.userClassPathFirst
• One way to solve dependency conflicts - make sure YOUR jars are loaded first, ahead of
Spark’s jars
• Client mode: use spark-submit options
• —driver-class-path, —driver-library-path
• Spark Classloader order of resolution
• Classes specified via —jars, —packages first (if above flag is set)
• Everything else in SPARK_CLASSPATH
20
21. Some useful config options
21
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.default.parallelism or pass # partitions for shuffle/reduce tasks as second arg
spark.scheduler.mode
FAIR - enable parallelism within apps (multi-tenant or low-
latency apps like SQL server)
spark.shuffle.memoryFraction,
spark.storage.memoryFraction
Fraction of Java heap to allocate for shuffle and RDD
caching, respectively, before spilling to disk
spark.cleaner.ttl
Enables periodic cleanup of cached RDDs, good for long-
lived jobs
spark.akka.frameSize
Increase the default of 10 (MB) to send back very large
results to the driver app (code smell)
spark.task.maxFailures # of retries for task failure is this - 1
22. Control Spark SQL Shuffles
• By default, Spark SQL / DataFrames will use 200
partitions when doing any groupBy / distinct operations
• sqlContext.setConf(
"spark.sql.shuffle.partitions", "16")
22
26. Run your apps in the cluster
• spark-submit: —deploy-mode cluster
• Spark Job Server: deploy SJS to the cluster
• Drivers and executors are very chatty - want to reduce latency and decrease
chance of networking timeouts
• Want to avoid running jobs on your local machine
26
27. Automatic Driver Restarts
• Standalone: —deploy-mode cluster —supervise
• YARN: —deploy-mode cluster
• Mesos: use Marathon to restart dead slaves
• Periodic checkpointing: important for recovering data
• RDD checkpointing helps reduce long RDD lineages
27
28. Speeding up application startup
• Spark-submit’s —packages option is super convenient for downloading
dependencies, but avoid it in production
• Downloads tons of jars from Maven when driver starts up, then executors
copy all the jars from driver
• Deploy frequently used dependencies to worker nodes yourself
• For really fast Spark jobs, use the Spark Job Server and share a SparkContext
amongst jobs!
28
29. Spark(Context) Metrics
• Spark’s built in MetricsSystem has sources (Spark info, JVM, etc.) and sinks
(Graphite, etc.)
• Configure metrics.properties (template in spark conf/ dir) and use these
params to spark-submit
--files=/path/to/metrics.properties
--conf spark.metrics.conf=metrics.properties
• See http://www.hammerlab.org/2015/02/27/monitoring-spark-with-graphite-
and-grafana/
29
30. Application Metrics
• Missing Hadoop counters? Use Spark Accumulators
• https://gist.github.com/ibuenros/9b94736c2bad2f4b8e2
3
• Above registers accumulators as a source to Spark’s
MetricsSystem
30
31. Watch how RDDs are cached
• RDDs cached to disk could slow down computation
31
32. Are your jobs stuck?
• First check cluster resources - does a job have enough CPU/mem?
• Take a thread dump of executors:
32
33. The Worst Killer - Classpath
• Classpath / jar versioning issues may cause Spark to hang silently. Debug
using the Environment tab of the UI:
33
36. Spark Job Server - What
• REST Interface for your Spark jobs
• Streaming, SQL, extendable
• Job history and configuration logged to a database
• Enable interactive low-latency queries (SQL/Dataframes
works too) of cached RDDs and tables
36
37. Spark Job Server - Where
37
Kafka
Spark
Streaming
Datastore Spark
Spark Job
Server
Internal users Internet
HTTP/HTTPS
38. Spark Job Server - Why
• Spark as a service
• Share Spark across the Enterprise
• HTTPS and LDAP Authentication
• Enterprises - easy integration with other teams, any language
• Share in-memory RDDs across logical jobs
• Low-latency queries
38
40. Used in Production
• As of last month, officially included in
Datastax Enterprise 4.8!
40
41. Active Community
• Large number of contributions from community
• HTTPS/LDAP contributed by team at KNIME
• Multiple committers
• Gitter IM channel, active Google group
41
42. Platform Independent
• Spark Standalone
• Mesos
• Yarn
• Docker
• Example: platform-independent LDAP auth, HTTPS, can be
used as a portal
42
44. Creating a Job Server Project
• sbt assembly -> fat jar -> upload to job server
• "provided" is used. Don’t want SBT assembly to include
the whole job server jar.
• Java projects should be possible too
44
resolvers += "Job Server Bintray" at "https://dl.bintray.com/spark-jobserver/maven"
libraryDependencies += "spark.jobserver" % "job-server-api" % "0.5.0" % "provided"
• In your build.sbt, add this
45. /**
* A super-simple Spark job example that implements the SparkJob trait and
* can be submitted to the job server.
*/
object WordCountExample extends SparkJob {
override def validate(sc: SparkContext, config: Config): SparkJobValidation =
{
Try(config.getString(“input.string”))
.map(x => SparkJobValid)
.getOrElse(SparkJobInvalid(“No input.string”))
}
override def runJob(sc: SparkContext, config: Config): Any = {
val dd = sc.parallelize(config.getString(“input.string”).split(" ").toSeq)
dd.map((_, 1)).reduceByKey(_ + _).collect().toMap
}
}
Example Job Server Job
45
46. What’s Different?
• Job does not create Context, Job Server does
• Decide when I run the job: in own context, or in pre-created context
• Allows for very modular Spark development
• Break up a giant Spark app into multiple logical jobs
• Example:
• One job to load DataFrames tables
• One job to query them
• One job to run diagnostics and report debugging information
46
47. Submitting and Running a Job
47
✦ curl --data-binary @../target/mydemo.jar
localhost:8090/jars/demo
OK[11:32 PM] ~
✦ curl -d "input.string = A lazy dog jumped mean dog"
'localhost:8090/jobs?appName=demo&classPath=WordCountExample
&sync=true'
{
"status": "OK",
"RESULT": {
"lazy": 1,
"jumped": 1,
"A": 1,
"mean": 1,
"dog": 2
}
}
50. Spark as a Query Engine
• Goal: spark jobs that run in under a second and answers queries on shared
RDD data
• Query params passed in as job config
• Need to minimize context creation overhead
–Thus many jobs sharing the same SparkContext
• On-heap RDD caching means no serialization loss
• Need to consider concurrent jobs (fair scheduling)
50
52. Sharing Data Between Jobs
• RDD Caching
–Benefit: no need to serialize data. Especially useful for indexes etc.
–Job server provides a NamedRdds trait for thread-safe CRUD of cached
RDDs by name
• (Compare to SparkContext’s API which uses an integer ID and is not
thread safe)
• For example, at Ooyala a number of fields are multiplexed into the RDD
name: timestamp:customerID:granularity
52
53. Data Concurrency
• With fair scheduler, multiple Job Server jobs can run simultaneously on one
SparkContext
• Managing multiple updates to RDDs
–Cache keeps track of which RDDs being updated
–Example: thread A spark job creates RDD “A” at t0
–thread B fetches RDD “A” at t1 > t0
–Both threads A and B, using NamedRdds, will get the RDD at time t2 when
thread A finishes creating the RDD “A”
53
54. Spark SQL/Hive Query Server
✤ Start a context based on SQLContext:
curl -d "" '127.0.0.1:8090/contexts/sql-context?context-
factory=spark.jobserver.context.SQLContextFactory'
✤ Run a job for loading and caching tables in DataFrames
curl -d ""
'127.0.0.1:8090/jobs?appName=test&classPath=spark.jobserver.SqlLoaderJob&context
=sql-context&sync=true'
✤ Supply a query to a Query Job. All queries are logged in database by
Spark Job Server.
curl -d ‘sql=“SELECT count(*) FROM footable”’
'127.0.0.1:8090/jobs?appName=test&classPath=spark.jobserver.SqlQueryJob&context=
sql-context&sync=true'
54
55. Example: Combining Streaming And Spark SQL
55
SparkSQLStreamingContext
Kafka Streaming
Job
SQL
Query
Job
DataFrame
s
Spark Job Server
SQL Query
56. SparkSQLStreamingJob
56
trait SparkSqlStreamingJob extends SparkJobBase {
type C = SQLStreamingContext
}
class SQLStreamingContext(c: SparkContext) {
val streamingContext = new StreamingContext(c, ...)
val sqlContext = new SQLContext(c)
}
Now you have access to both StreamingContext and
SQLContext, and it can be shared across jobs!
57. SparkSQLStreamingContext
57
To start this context:
curl -d "" “localhost:8090/contexts/stream_sqltest?context-
factory=com.abc.SQLStreamingContextFactory"
class SQLStreamingContextFactory extends SparkContextFactory {
import SparkJobUtils._
type C = SQLStreamingContext with ContextLike
def makeContext(config: Config, contextConfig: Config, contextName: String): C = {
val batchInterval = contextConfig.getInt("batch_interval")
val conf = configToSparkConf(config, contextConfig, contextName)
new SQLStreamingContext(new SparkContext(conf), Seconds(batchInterval)) with ContextLike {
def sparkContext: SparkContext = this.streamingContext.sparkContext
def isValidJob(job: SparkJobBase): Boolean = job.isInstanceOf[SparkSqlStreamingJob]
// Stop the streaming context, but not the SparkContext so that it can be re-used
// to create another streaming context if required:
def stop() { this.streamingContext.stop(false) }
}
}
}
59. Future Plans
• PR: Forked JVMs for supporting many concurrent
contexts
• True HA operation
• Swagger API documentation
59
60. HA for Job Server
Job
Server 1
Job
Server 2
Active
Job
Context
Gossip
Load balancer
60
Database
GET /jobs/<id>
61. HA and Hot Failover for Jobs
Job
Server 1
Job
Server 2
Active
Job
Context
HDFS
Standby
Job
Context
Gossip
Checkpoint
61
62. Thanks for your contributions!
• All of these were community contributed:
–HTTPS and Auth
–saving and retrieving job configuration
–forked JVM per context
• Your contributions are very welcome on Github!
62
64. Why We Needed a Job Server
• Our vision for Spark is as a multi-team big data service
• What gets repeated by every team:
• Bastion box for running Hadoop/Spark jobs
• Deploys and process monitoring
• Tracking and serializing job status, progress, and job results
• Job validation
• No easy way to kill jobs
• Polyglot technology stack - Ruby scripts run jobs, Go services
66. Completely Async Design
✤ http://spray.io - probably the fastest JVM HTTP
microframework
✤ Akka Actor based, non blocking
✤ Futures used to manage individual jobs. (Note that
Spark is using Scala futures to manage job stages now)
✤ Single JVM for now, but easy to distribute later via
remote Actors / Akka Cluster
67. Async Actor Flow
Spray web
API
Request
actor
Local
Supervisor
Job
Manager
Job 1
Future
Job 2
Future
Job Status
Actor
Job Result
Actor
70. Metadata Store
✤ JarInfo, JobInfo, ConfigInfo
✤ JobSqlDAO. Store metadata to SQL database by JDBC interface.
✤ Easily configured by spark.sqldao.jdbc.url
✤ jdbc:mysql://dbserver:3306/jobserverdb
✤ Multiple Job Servers can share the same MySQL.
✤ Jars uploaded once but accessible by all servers.
✤ The default will be JobSqlDAO and H2.
✤ Single H2 DB file. Serialization and deserialization are handled by H2.
71. Deployment and Metrics
✤ spark-jobserver repo comes with a full suite of tests and
deploy scripts:
✤ server_deploy.sh for regular server pushes
✤ server_package.sh for Mesos and Chronos .tar.gz
✤ /metricz route for codahale-metrics monitoring
✤ /healthz route for health check
72. Challenges and Lessons
• Spark is based around contexts - we need a Job Server oriented around logical
jobs
• Running multiple SparkContexts in the same process
• Better long term solution is forked JVM per SparkContext
• Workaround: spark.driver.allowMultipleContexts = true
• Dynamic jar and class loading is tricky
• Manage threads carefully - each context uses lots of threads
Editor's Notes
There are many ways to run Spark. There is a standalone cluster mode, and it can run under Mesos, or YARN.
There are many ways to run Spark. There is a standalone cluster mode, and it can run under Mesos, or YARN.
There are many ways to run Spark. There is a standalone cluster mode, and it can run under Mesos, or YARN.
There are many ways to run Spark. There is a standalone cluster mode, and it can run under Mesos, or YARN.
There are many ways to run Spark. There is a standalone cluster mode, and it can run under Mesos, or YARN.
There are many ways to run Spark. There is a standalone cluster mode, and it can run under Mesos, or YARN.
There are many ways to run Spark. There is a standalone cluster mode, and it can run under Mesos, or YARN.
There are many ways to run Spark. There is a standalone cluster mode, and it can run under Mesos, or YARN.
There are many ways to run Spark. There is a standalone cluster mode, and it can run under Mesos, or YARN.
There are many ways to run Spark. There is a standalone cluster mode, and it can run under Mesos, or YARN.
There are many ways to run Spark. There is a standalone cluster mode, and it can run under Mesos, or YARN.
There are many ways to run Spark. There is a standalone cluster mode, and it can run under Mesos, or YARN.
There are many ways to run Spark. There is a standalone cluster mode, and it can run under Mesos, or YARN.
There are many ways to run Spark. There is a standalone cluster mode, and it can run under Mesos, or YARN.
There are many ways to run Spark. There is a standalone cluster mode, and it can run under Mesos, or YARN.
There are many ways to run Spark. There is a standalone cluster mode, and it can run under Mesos, or YARN.
There are many ways to run Spark. There is a standalone cluster mode, and it can run under Mesos, or YARN.
There are many ways to run Spark. There is a standalone cluster mode, and it can run under Mesos, or YARN.
There are many ways to run Spark. There is a standalone cluster mode, and it can run under Mesos, or YARN.
There are many ways to run Spark. There is a standalone cluster mode, and it can run under Mesos, or YARN.
There are many ways to run Spark. There is a standalone cluster mode, and it can run under Mesos, or YARN.
There are many ways to run Spark. There is a standalone cluster mode, and it can run under Mesos, or YARN.
There are many ways to run Spark. There is a standalone cluster mode, and it can run under Mesos, or YARN.
There are many ways to run Spark. There is a standalone cluster mode, and it can run under Mesos, or YARN.
There are many ways to run Spark. There is a standalone cluster mode, and it can run under Mesos, or YARN.
There are many ways to run Spark. There is a standalone cluster mode, and it can run under Mesos, or YARN.
There are many ways to run Spark. There is a standalone cluster mode, and it can run under Mesos, or YARN.
There are many ways to run Spark. There is a standalone cluster mode, and it can run under Mesos, or YARN.
There are many ways to run Spark. There is a standalone cluster mode, and it can run under Mesos, or YARN.
We will have 3 presentations at the upcoming Spark Summit, please be sure to watch for us.
We will have 3 presentations at the upcoming Spark Summit, please be sure to watch for us.
We will have 3 presentations at the upcoming Spark Summit, please be sure to watch for us.
We will have 3 presentations at the upcoming Spark Summit, please be sure to watch for us.
Assuming it’s a Scala project…..
Every job has a Config object passed in….. 400….
Have the option to choose at job start time, whether to run in own context or shared context.
Pretty cool, I can submit a diagnostic jar to diagnose a problem.
Complex queries GROUP BY, filters, etc.
Here we have the workflow for low-latency jobs which share a single SparkContext.
Unlike the ad-hoc job, here we explicitly create a context. With standalone / coarse mode, guarantee/fix CPU and memory resources.
We run a job to load the data and populate the SparkContext RDDs.
An API server submits query jobs against the job server, which executes them and returns JSON results in the same request.
Jobs can be separate jars, written by separate teams, independently loaded, run, tested.