This document discusses Airbnb's data infrastructure and use of AirStream. It describes how AirStream provides a unified platform for both streaming and batch data processing using Spark SQL and a shared state store in HBase. Case studies show how AirStream is used for real-time data ingestion from Kafka to HBase, streaming exports from databases to HBase, and point-in-time queries. The document also covers how AirStream scales jobs using YARN, provides fault tolerance through checkpointing and job restarts, and monitors jobs with AirStream listeners.
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0Databricks
The next release of Apache Spark will be 2.0, marking a big milestone for the project. In this talk, I’ll cover how the community has grown to reach this point, and some of the major features in 2.0. The largest additions are performance improvements for Datasets, DataFrames and SQL through Project Tungsten, as well as a new Structured Streaming API that provides simpler and more powerful stream processing. I’ll also discuss a bit of what’s in the works for future versions.
Mobius is a C# binding for Apache Spark that allows .NET developers to build Spark applications using C#. It enables reusing existing .NET code and libraries in Spark and makes C# a first-class language for Spark. Mobius integrates with the Spark runtime by launching C# worker processes that communicate with the Java Virtual Machine to execute C# transformations and actions on RDDs in a pipelined fashion for better performance.
This document summarizes a presentation given at Spark Summit 2016 about tools and techniques used at Uber for Spark development and jobs. It introduces SCBuilder for encapsulating cluster environments, Kafka dispersal for publishing RDD results to Kafka, and SparkPlug for kickstarting job development with templates. It also discusses SparkChamber for distributed log debugging and future work including analytics, machine learning, and resource usage auditing.
Re-Architecting Spark For Performance UnderstandabilityJen Aman
The document discusses a new architecture called "monotasks" that is designed to make it easier to reason about the performance of Apache Spark jobs. The key aspects are:
1) Each task in a Spark job is dedicated to using a single resource (e.g. network, CPU, disk).
2) Dedicated schedulers control contention between tasks for resources.
3) The timing of individual "monotasks" can be used to model an ideal performance and understand bottlenecks.
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Databricks
This document summarizes key aspects of structuring computation and data in Apache Spark using SQL, DataFrames, and Datasets. It discusses how structuring computation and data through these APIs enables optimizations like predicate pushdown and efficient joins. It also describes how data is encoded efficiently in Spark's internal format and how encoders translate between domain objects and Spark's internal representations. Finally, it introduces structured streaming as a high-level streaming API built on top of Spark SQL that allows running the same queries continuously on streaming data.
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das Databricks
“In Spark 2.0, we have extended DataFrames and Datasets to handle real time streaming data. This not only provides a single programming abstraction for batch and streaming data, it also brings support for event-time based processing, out-or-order/delayed data, sessionization and tight integration with non-streaming data sources and sinks. In this talk, I will take a deep dive into the concepts and the API and show how this simplifies building complex “Continuous Applications”.” - T.D.
Databricks Blog: "Structured Streaming In Apache Spark 2.0: A new high-level API for streaming"
https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html
// About the Presenter //
Tathagata Das is an Apache Spark Committer and a member of the PMC. He’s the lead developer behind Spark Streaming, and is currently employed at Databricks. Before Databricks, you could find him at the AMPLab of UC Berkeley, researching datacenter frameworks and networks with professors Scott Shenker and Ion Stoica.
Follow T.D. on -
Twitter: https://twitter.com/tathadas
LinkedIn: https://www.linkedin.com/in/tathadas
Huawei Advanced Data Science With Spark StreamingJen Aman
This document discusses streamDM, an open source machine learning library for stream mining in Spark Streaming. It summarizes streamDM's capabilities for incremental learning on data streams using algorithms like SGD, Naive Bayes, clustering and decision trees. Examples of using streamDM in Huawei's network alarm analysis and fault localization systems are provided, demonstrating improvements in efficiency, accuracy and ability to handle large volumes of streaming data. The document encourages researchers to apply for Huawei's Innovation Research Program grants to further collaborative work on stream mining algorithms and applications.
This document discusses Airbnb's data infrastructure and use of AirStream. It describes how AirStream provides a unified platform for both streaming and batch data processing using Spark SQL and a shared state store in HBase. Case studies show how AirStream is used for real-time data ingestion from Kafka to HBase, streaming exports from databases to HBase, and point-in-time queries. The document also covers how AirStream scales jobs using YARN, provides fault tolerance through checkpointing and job restarts, and monitors jobs with AirStream listeners.
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0Databricks
The next release of Apache Spark will be 2.0, marking a big milestone for the project. In this talk, I’ll cover how the community has grown to reach this point, and some of the major features in 2.0. The largest additions are performance improvements for Datasets, DataFrames and SQL through Project Tungsten, as well as a new Structured Streaming API that provides simpler and more powerful stream processing. I’ll also discuss a bit of what’s in the works for future versions.
Mobius is a C# binding for Apache Spark that allows .NET developers to build Spark applications using C#. It enables reusing existing .NET code and libraries in Spark and makes C# a first-class language for Spark. Mobius integrates with the Spark runtime by launching C# worker processes that communicate with the Java Virtual Machine to execute C# transformations and actions on RDDs in a pipelined fashion for better performance.
This document summarizes a presentation given at Spark Summit 2016 about tools and techniques used at Uber for Spark development and jobs. It introduces SCBuilder for encapsulating cluster environments, Kafka dispersal for publishing RDD results to Kafka, and SparkPlug for kickstarting job development with templates. It also discusses SparkChamber for distributed log debugging and future work including analytics, machine learning, and resource usage auditing.
Re-Architecting Spark For Performance UnderstandabilityJen Aman
The document discusses a new architecture called "monotasks" that is designed to make it easier to reason about the performance of Apache Spark jobs. The key aspects are:
1) Each task in a Spark job is dedicated to using a single resource (e.g. network, CPU, disk).
2) Dedicated schedulers control contention between tasks for resources.
3) The timing of individual "monotasks" can be used to model an ideal performance and understand bottlenecks.
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Databricks
This document summarizes key aspects of structuring computation and data in Apache Spark using SQL, DataFrames, and Datasets. It discusses how structuring computation and data through these APIs enables optimizations like predicate pushdown and efficient joins. It also describes how data is encoded efficiently in Spark's internal format and how encoders translate between domain objects and Spark's internal representations. Finally, it introduces structured streaming as a high-level streaming API built on top of Spark SQL that allows running the same queries continuously on streaming data.
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das Databricks
“In Spark 2.0, we have extended DataFrames and Datasets to handle real time streaming data. This not only provides a single programming abstraction for batch and streaming data, it also brings support for event-time based processing, out-or-order/delayed data, sessionization and tight integration with non-streaming data sources and sinks. In this talk, I will take a deep dive into the concepts and the API and show how this simplifies building complex “Continuous Applications”.” - T.D.
Databricks Blog: "Structured Streaming In Apache Spark 2.0: A new high-level API for streaming"
https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html
// About the Presenter //
Tathagata Das is an Apache Spark Committer and a member of the PMC. He’s the lead developer behind Spark Streaming, and is currently employed at Databricks. Before Databricks, you could find him at the AMPLab of UC Berkeley, researching datacenter frameworks and networks with professors Scott Shenker and Ion Stoica.
Follow T.D. on -
Twitter: https://twitter.com/tathadas
LinkedIn: https://www.linkedin.com/in/tathadas
Huawei Advanced Data Science With Spark StreamingJen Aman
This document discusses streamDM, an open source machine learning library for stream mining in Spark Streaming. It summarizes streamDM's capabilities for incremental learning on data streams using algorithms like SGD, Naive Bayes, clustering and decision trees. Examples of using streamDM in Huawei's network alarm analysis and fault localization systems are provided, demonstrating improvements in efficiency, accuracy and ability to handle large volumes of streaming data. The document encourages researchers to apply for Huawei's Innovation Research Program grants to further collaborative work on stream mining algorithms and applications.
Huohua: A Distributed Time Series Analysis Framework For SparkJen Aman
This document summarizes Wenbo Zhao's presentation on Huohua, a distributed time series analysis framework for Spark. Huohua addresses issues with existing time series solutions by introducing the TimeSeriesRDD data structure that preserves temporal ordering across operations like grouping and temporal join. The group function groups time series locally without shuffling to maintain order, and temporal join uses partitioning to perform localized stream joins across partitions.
The document discusses the Datastax Spark Cassandra Connector. It provides an overview of how the connector allows Spark to interact with Cassandra data, including performing full table scans, pushing down filters and projections to Cassandra, distributed joins using Cassandra's partitioning, and writing data back to Cassandra in a distributed way. It also highlights some recent features of the connector like support for Cassandra 3.0, materialized views, and performance improvements from the Java Wildcard Cassandra Tester project.
This document describes Drizzle, a low latency execution engine for Apache Spark. It addresses the high overheads of Spark's centralized scheduling model by decoupling execution from scheduling through batch scheduling and pre-scheduling of shuffles. Microbenchmarks show Drizzle achieves milliseconds latency for iterative workloads compared to hundreds of milliseconds for Spark. End-to-end experiments show Drizzle improves latency for streaming and machine learning workloads like logistic regression. The authors are working on automatic batch tuning and an open source release of Drizzle.
Big Data in Production: Lessons from Running in the CloudJen Aman
This document discusses best practices for running big data workloads in production on the cloud. It emphasizes that production systems require scalability, high availability, maintainability, and evolvability. The document also discusses challenges such as security, automation, and efficiency. It recommends leveraging cloud services like AWS to handle tasks like provisioning, monitoring, billing and cost optimization in order to focus on the core data and analytics workloads.
Bolt is a distributed ndarray built on PySpark that conforms to NumPy's API. It can handle large multidimensional datasets exceeding 1 TB in size with up to 1011 elements. By distributing arrays across a cluster, Bolt enables operations on large neuroscience, astronomy, geospatial and climate science datasets that would be impossible on a single machine due to limits of memory and processing power. Key features include indexing, slicing, transpose, and applying functions along axes.
Livy is an open source REST service for interacting with and managing Spark contexts and jobs. It allows clients to submit Spark jobs via REST, monitor their status, and retrieve results. Livy manages long-running Spark contexts in a cluster and supports running multiple independent contexts simultaneously from different clients. It provides client APIs in Java, Scala, and soon Python to interface with the Livy REST endpoints for submitting, monitoring, and retrieving results of Spark jobs.
A Journey into Databricks' Pipelines: Journey and Lessons LearnedDatabricks
With components like Spark SQL, MLlib, and Streaming, Spark is a unified engine for building data applications. In this talk, we will take a look at how we use Spark on our own Databricks platform throughout our data pipeline for use cases such as ETL, data warehousing, and real time analysis. We will demonstrate how these applications empower engineering and data analytics. We will also share some lessons learned from building our data pipeline around security and operations. This talk will include examples on how to use Structured Streaming (a.k.a Streaming DataFrames) for online analysis, SparkR for offline analysis, and how we connect multiple sources to achieve a Just-In-Time Data Warehouse.
Building Realtime Data Pipelines with Kafka Connect and Spark StreamingJen Aman
This document discusses building real-time data pipelines with Kafka Connect and Spark Streaming. It introduces Kafka Connect as a tool for large-scale streaming data import and export for Kafka. Kafka Connect uses connectors to move data between Kafka and other data systems in a scalable, parallel, and fault-tolerant manner. It then discusses how Kafka Connect can be used together with Spark Streaming to provide real-time data integration capabilities.
700 Queries Per Second with Updates: Spark As A Real-Time Web ServiceSpark Summit
This document discusses using Apache Spark to enable low-latency web queries through a persistent Spark context. It introduces FiloDB, a distributed, versioned, columnar analytics database built on Spark that allows for fast, updatable queries through efficient in-memory columnar storage and filtering. The document demonstrates running over 700 SQL queries per second on a dataset of 15 million NYC taxi records loaded into FiloDB through caching of SQL parsing and use of Spark's collectAsync to enable asynchronous query execution.
Netflix - Productionizing Spark On Yarn For ETL At Petabyte ScaleJen Aman
This document discusses Netflix's use of Spark on YARN for ETL workloads. Some key points:
- Netflix runs Spark on YARN across 3000 EC2 nodes to process large amounts of streaming data from over 125 million hours watched per day.
- Technical challenges included optimizing Parquet filtering, output committers, broadcast joins, and the Spark History Server. Improvements led to speeds of 8-18x faster.
- Production Spark applications at Netflix include recommendation engines that analyze user profiles and personalize content, processing petabytes of data in hours rather than days.
Interactive Visualization of Streaming Data Powered by SparkSpark Summit
This document discusses interactive data visualization of streaming data powered by Spark. It addresses the challenges of streaming data including time, frequency, retention, synchronization and order of data. Zoomdata uses Spark Streaming to receive and manipulate streaming data in memory on a single JVM. The data can then be held in a buffer like MongoDB and interacted with via custom code. This allows for contextual expressiveness with streaming data and independent scalability. Future work may include cross stream synchronization, on-demand scaling on Mesos, and more extensible landing strategies.
Large Scale Deep Learning with TensorFlow Jen Aman
Large-scale deep learning with TensorFlow allows storing and performing computation on large datasets to develop computer systems that can understand data. Deep learning models like neural networks are loosely based on what is known about the brain and become more powerful with more data, larger models, and more computation. At Google, deep learning is being applied across many products and areas, from speech recognition to image understanding to machine translation. TensorFlow provides an open-source software library for machine learning that has been widely adopted both internally at Google and externally.
Vskills certified html5 developer Notes covers the following topics.
HTML5
Introduction
History
HTML Versions
HTML5 Enhancements
Elements, Tags and Attributes
Head and body tags
HTML Editor
Create a web page
Viewing the Source
White Space and Flow
HTML Comments
HTML Meta Tags
HTML Attributes
XHTML First Line
DTD (Document Type Declaration)
HTML5 new Doctype and Charset
Special Characters
Capitalization
Quotations
Nesting
Spacing and Breaks
HTML5 Global attributes
http://www.vskills.in/certification/Web-Development/Certified-HTML5-Developer
The document outlines the various architectures that make up a solution architecture for Sunpower, including business architecture, information architecture, infrastructure architecture, data architecture, integration architecture, and service architecture. Business architecture defines the business objectives, strategy, capabilities, processes, and structure. Information architecture shows how data will be captured from various social media and legacy systems and stored in a data lake using column families and denormalized tables. Infrastructure architecture and data architecture are also included as key components of the overall solution architecture.
- Apache Spark is an open-source cluster computing framework that provides fast, general engine for large-scale data processing. It introduces Resilient Distributed Datasets (RDDs) that allow in-memory processing for speed.
- The document discusses Spark's key concepts like transformations, actions, and directed acyclic graphs (DAGs) that represent Spark job execution. It also summarizes Spark SQL, MLlib, and Spark Streaming modules.
- The presenter is a solutions architect who provides an overview of Spark and how it addresses limitations of Hadoop by enabling faster, in-memory processing using RDDs and a more intuitive API compared to MapReduce.
Developing apache spark jobs in .net using mobiusshareddatamsft
Slides used for the talk "Developing Apache Spark Jobs in .NET using Mobius" at dotnetfringe 20016 (http://lanyrd.com/2016/netfringe/sfcxpx).
Apache Spark is an open source data processing framework built for big data processing and analytics. Ease of programming and high performance relative to the traditional big data tools and platforms and a unified API to solve a diverse set of complex data problems drove the rapid adoption of Spark in the industry. Apache Spark APIs in Scala, Java, Python and R cater to a wide range of big data professionals and a variety of functional roles. Mobius is an open source project that aims to bring Spark's rich set of capabilities to the .NET community. Mobius project added C# as another first-class programming language for Apache Spark and currently supports RDD, DataFrame and Streaming API. With Mobius, developers can build Spark jobs in C# and reuse their existing .NET libraries with Apache Spark. Mobius is open-sourced at http://github.com/Microsoft/Mobius. This project has received great support from the .NET community and positive feedback from the Spark enthusiasts
Big Data Scala by the Bay: Interactive Spark in your Browsergethue
Supporting running Spark scripts directly from a browser would bring the user experience up. Indeed, everybody has a Web navigator, the command line can be avoided, built-in graphing and visualization make it easy to explore and understand data with just a few clicks. This also simplifies the administration as now everything becomes centralized in a service and is accessible by non native clients. For this purpose, an open source Spark Job Server was developed in order to provide Scala, SQL and Python in a Web shell. The main Hadoop components of the platform are also integrated in the same interface. This talk describes the architecture of the Spark Server and its main features: # Scala, Python, SQL submissions # Impersonation # Security # Job progress / canceling # YARN / HDFS / Hive integration The server also ships with a friendly user interface built as a Hue app. We will focus on explaining how they were built, how to use the API and which lessons were learned. The final end user interaction will be live demoed.
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...Spark Summit
This document discusses AppsFlyer's experience running Spark on Mesos in production for retention data processing and analytics. Key points include:
- AppsFlyer processes over 30 million installs and 5 billion sessions daily for retention reporting across 18 dimensions using Spark, Mesos, and S3.
- Challenges included timeouts and errors when using Spark's S3 connectors due to the eventual consistency of S3, which was addressed by using more robust connectors and configuration options.
- A coarse-grained Mesos scheduling approach was found to be more stable than fine-grained, though it has limitations like static core allocation that future Mesos improvements may address.
- Tuning jobs for coarse-
This document discusses analytical functions in databases, specifically the percentile window function. The percentile window function calculates the value at a specified percentile of the values within each group of a window or partition. For example, a query is shown that uses the PERCENTILE_CONT function to find the median salary value for each department by calculating the 0.5 percentile of salaries within each department, ordered descending by salary.
This document discusses how to implement operations like selection, joining, grouping, and sorting in Cassandra without SQL. It explains that Cassandra uses a nested data model to efficiently store and retrieve related data. Operations like selection can be performed by creating additional column families that index data by fields like birthdate and allow fast retrieval of records by those fields. Joining can be implemented by nesting related entity data within the same column family. Grouping and sorting are also achieved through additional indexing column families. While this requires duplicating data for different queries, it takes advantage of Cassandra's strengths in scalable updates.
Deep Learning and Streaming in Apache Spark 2.x with Matei ZahariaJen Aman
2017 continues to be an exciting year for Apache Spark. I will talk about new updates in two major areas in the Spark community this year: stream processing with Structured Streaming, and deep learning with high-level libraries such as Deep Learning Pipelines and TensorFlowOnSpark. In both areas, the community is making powerful new functionality available in the same high-level APIs used in the rest of the Spark ecosystem (e.g., DataFrames and ML Pipelines), and improving both the scalability and ease of use of stream processing and machine learning.
Snorkel: Dark Data and Machine Learning with Christopher RéJen Aman
Building applications that can read and analyze a wide variety of data may change the way we do science and make business decisions. However, building such applications is challenging: real world data is expressed in natural language, images, or other “dark” data formats which are fraught with imprecision and ambiguity and so are difficult for machines to understand. This talk will describe Snorkel, whose goal is to make routine Dark Data and other prediction tasks dramatically easier. At its core, Snorkel focuses on a key bottleneck in the development of machine learning systems: the lack of large training datasets. In Snorkel, a user implicitly creates large training sets by writing simple programs that label data, instead of performing manual feature engineering or tedious hand-labeling of individual data items. We’ll provide a set of tutorials that will allow folks to write Snorkel applications that use Spark.
Snorkel is open source on github and available from Snorkel.Stanford.edu.
Huohua: A Distributed Time Series Analysis Framework For SparkJen Aman
This document summarizes Wenbo Zhao's presentation on Huohua, a distributed time series analysis framework for Spark. Huohua addresses issues with existing time series solutions by introducing the TimeSeriesRDD data structure that preserves temporal ordering across operations like grouping and temporal join. The group function groups time series locally without shuffling to maintain order, and temporal join uses partitioning to perform localized stream joins across partitions.
The document discusses the Datastax Spark Cassandra Connector. It provides an overview of how the connector allows Spark to interact with Cassandra data, including performing full table scans, pushing down filters and projections to Cassandra, distributed joins using Cassandra's partitioning, and writing data back to Cassandra in a distributed way. It also highlights some recent features of the connector like support for Cassandra 3.0, materialized views, and performance improvements from the Java Wildcard Cassandra Tester project.
This document describes Drizzle, a low latency execution engine for Apache Spark. It addresses the high overheads of Spark's centralized scheduling model by decoupling execution from scheduling through batch scheduling and pre-scheduling of shuffles. Microbenchmarks show Drizzle achieves milliseconds latency for iterative workloads compared to hundreds of milliseconds for Spark. End-to-end experiments show Drizzle improves latency for streaming and machine learning workloads like logistic regression. The authors are working on automatic batch tuning and an open source release of Drizzle.
Big Data in Production: Lessons from Running in the CloudJen Aman
This document discusses best practices for running big data workloads in production on the cloud. It emphasizes that production systems require scalability, high availability, maintainability, and evolvability. The document also discusses challenges such as security, automation, and efficiency. It recommends leveraging cloud services like AWS to handle tasks like provisioning, monitoring, billing and cost optimization in order to focus on the core data and analytics workloads.
Bolt is a distributed ndarray built on PySpark that conforms to NumPy's API. It can handle large multidimensional datasets exceeding 1 TB in size with up to 1011 elements. By distributing arrays across a cluster, Bolt enables operations on large neuroscience, astronomy, geospatial and climate science datasets that would be impossible on a single machine due to limits of memory and processing power. Key features include indexing, slicing, transpose, and applying functions along axes.
Livy is an open source REST service for interacting with and managing Spark contexts and jobs. It allows clients to submit Spark jobs via REST, monitor their status, and retrieve results. Livy manages long-running Spark contexts in a cluster and supports running multiple independent contexts simultaneously from different clients. It provides client APIs in Java, Scala, and soon Python to interface with the Livy REST endpoints for submitting, monitoring, and retrieving results of Spark jobs.
A Journey into Databricks' Pipelines: Journey and Lessons LearnedDatabricks
With components like Spark SQL, MLlib, and Streaming, Spark is a unified engine for building data applications. In this talk, we will take a look at how we use Spark on our own Databricks platform throughout our data pipeline for use cases such as ETL, data warehousing, and real time analysis. We will demonstrate how these applications empower engineering and data analytics. We will also share some lessons learned from building our data pipeline around security and operations. This talk will include examples on how to use Structured Streaming (a.k.a Streaming DataFrames) for online analysis, SparkR for offline analysis, and how we connect multiple sources to achieve a Just-In-Time Data Warehouse.
Building Realtime Data Pipelines with Kafka Connect and Spark StreamingJen Aman
This document discusses building real-time data pipelines with Kafka Connect and Spark Streaming. It introduces Kafka Connect as a tool for large-scale streaming data import and export for Kafka. Kafka Connect uses connectors to move data between Kafka and other data systems in a scalable, parallel, and fault-tolerant manner. It then discusses how Kafka Connect can be used together with Spark Streaming to provide real-time data integration capabilities.
700 Queries Per Second with Updates: Spark As A Real-Time Web ServiceSpark Summit
This document discusses using Apache Spark to enable low-latency web queries through a persistent Spark context. It introduces FiloDB, a distributed, versioned, columnar analytics database built on Spark that allows for fast, updatable queries through efficient in-memory columnar storage and filtering. The document demonstrates running over 700 SQL queries per second on a dataset of 15 million NYC taxi records loaded into FiloDB through caching of SQL parsing and use of Spark's collectAsync to enable asynchronous query execution.
Netflix - Productionizing Spark On Yarn For ETL At Petabyte ScaleJen Aman
This document discusses Netflix's use of Spark on YARN for ETL workloads. Some key points:
- Netflix runs Spark on YARN across 3000 EC2 nodes to process large amounts of streaming data from over 125 million hours watched per day.
- Technical challenges included optimizing Parquet filtering, output committers, broadcast joins, and the Spark History Server. Improvements led to speeds of 8-18x faster.
- Production Spark applications at Netflix include recommendation engines that analyze user profiles and personalize content, processing petabytes of data in hours rather than days.
Interactive Visualization of Streaming Data Powered by SparkSpark Summit
This document discusses interactive data visualization of streaming data powered by Spark. It addresses the challenges of streaming data including time, frequency, retention, synchronization and order of data. Zoomdata uses Spark Streaming to receive and manipulate streaming data in memory on a single JVM. The data can then be held in a buffer like MongoDB and interacted with via custom code. This allows for contextual expressiveness with streaming data and independent scalability. Future work may include cross stream synchronization, on-demand scaling on Mesos, and more extensible landing strategies.
Large Scale Deep Learning with TensorFlow Jen Aman
Large-scale deep learning with TensorFlow allows storing and performing computation on large datasets to develop computer systems that can understand data. Deep learning models like neural networks are loosely based on what is known about the brain and become more powerful with more data, larger models, and more computation. At Google, deep learning is being applied across many products and areas, from speech recognition to image understanding to machine translation. TensorFlow provides an open-source software library for machine learning that has been widely adopted both internally at Google and externally.
Vskills certified html5 developer Notes covers the following topics.
HTML5
Introduction
History
HTML Versions
HTML5 Enhancements
Elements, Tags and Attributes
Head and body tags
HTML Editor
Create a web page
Viewing the Source
White Space and Flow
HTML Comments
HTML Meta Tags
HTML Attributes
XHTML First Line
DTD (Document Type Declaration)
HTML5 new Doctype and Charset
Special Characters
Capitalization
Quotations
Nesting
Spacing and Breaks
HTML5 Global attributes
http://www.vskills.in/certification/Web-Development/Certified-HTML5-Developer
The document outlines the various architectures that make up a solution architecture for Sunpower, including business architecture, information architecture, infrastructure architecture, data architecture, integration architecture, and service architecture. Business architecture defines the business objectives, strategy, capabilities, processes, and structure. Information architecture shows how data will be captured from various social media and legacy systems and stored in a data lake using column families and denormalized tables. Infrastructure architecture and data architecture are also included as key components of the overall solution architecture.
- Apache Spark is an open-source cluster computing framework that provides fast, general engine for large-scale data processing. It introduces Resilient Distributed Datasets (RDDs) that allow in-memory processing for speed.
- The document discusses Spark's key concepts like transformations, actions, and directed acyclic graphs (DAGs) that represent Spark job execution. It also summarizes Spark SQL, MLlib, and Spark Streaming modules.
- The presenter is a solutions architect who provides an overview of Spark and how it addresses limitations of Hadoop by enabling faster, in-memory processing using RDDs and a more intuitive API compared to MapReduce.
Developing apache spark jobs in .net using mobiusshareddatamsft
Slides used for the talk "Developing Apache Spark Jobs in .NET using Mobius" at dotnetfringe 20016 (http://lanyrd.com/2016/netfringe/sfcxpx).
Apache Spark is an open source data processing framework built for big data processing and analytics. Ease of programming and high performance relative to the traditional big data tools and platforms and a unified API to solve a diverse set of complex data problems drove the rapid adoption of Spark in the industry. Apache Spark APIs in Scala, Java, Python and R cater to a wide range of big data professionals and a variety of functional roles. Mobius is an open source project that aims to bring Spark's rich set of capabilities to the .NET community. Mobius project added C# as another first-class programming language for Apache Spark and currently supports RDD, DataFrame and Streaming API. With Mobius, developers can build Spark jobs in C# and reuse their existing .NET libraries with Apache Spark. Mobius is open-sourced at http://github.com/Microsoft/Mobius. This project has received great support from the .NET community and positive feedback from the Spark enthusiasts
Big Data Scala by the Bay: Interactive Spark in your Browsergethue
Supporting running Spark scripts directly from a browser would bring the user experience up. Indeed, everybody has a Web navigator, the command line can be avoided, built-in graphing and visualization make it easy to explore and understand data with just a few clicks. This also simplifies the administration as now everything becomes centralized in a service and is accessible by non native clients. For this purpose, an open source Spark Job Server was developed in order to provide Scala, SQL and Python in a Web shell. The main Hadoop components of the platform are also integrated in the same interface. This talk describes the architecture of the Spark Server and its main features: # Scala, Python, SQL submissions # Impersonation # Security # Job progress / canceling # YARN / HDFS / Hive integration The server also ships with a friendly user interface built as a Hue app. We will focus on explaining how they were built, how to use the API and which lessons were learned. The final end user interaction will be live demoed.
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...Spark Summit
This document discusses AppsFlyer's experience running Spark on Mesos in production for retention data processing and analytics. Key points include:
- AppsFlyer processes over 30 million installs and 5 billion sessions daily for retention reporting across 18 dimensions using Spark, Mesos, and S3.
- Challenges included timeouts and errors when using Spark's S3 connectors due to the eventual consistency of S3, which was addressed by using more robust connectors and configuration options.
- A coarse-grained Mesos scheduling approach was found to be more stable than fine-grained, though it has limitations like static core allocation that future Mesos improvements may address.
- Tuning jobs for coarse-
This document discusses analytical functions in databases, specifically the percentile window function. The percentile window function calculates the value at a specified percentile of the values within each group of a window or partition. For example, a query is shown that uses the PERCENTILE_CONT function to find the median salary value for each department by calculating the 0.5 percentile of salaries within each department, ordered descending by salary.
This document discusses how to implement operations like selection, joining, grouping, and sorting in Cassandra without SQL. It explains that Cassandra uses a nested data model to efficiently store and retrieve related data. Operations like selection can be performed by creating additional column families that index data by fields like birthdate and allow fast retrieval of records by those fields. Joining can be implemented by nesting related entity data within the same column family. Grouping and sorting are also achieved through additional indexing column families. While this requires duplicating data for different queries, it takes advantage of Cassandra's strengths in scalable updates.
Deep Learning and Streaming in Apache Spark 2.x with Matei ZahariaJen Aman
2017 continues to be an exciting year for Apache Spark. I will talk about new updates in two major areas in the Spark community this year: stream processing with Structured Streaming, and deep learning with high-level libraries such as Deep Learning Pipelines and TensorFlowOnSpark. In both areas, the community is making powerful new functionality available in the same high-level APIs used in the rest of the Spark ecosystem (e.g., DataFrames and ML Pipelines), and improving both the scalability and ease of use of stream processing and machine learning.
Snorkel: Dark Data and Machine Learning with Christopher RéJen Aman
Building applications that can read and analyze a wide variety of data may change the way we do science and make business decisions. However, building such applications is challenging: real world data is expressed in natural language, images, or other “dark” data formats which are fraught with imprecision and ambiguity and so are difficult for machines to understand. This talk will describe Snorkel, whose goal is to make routine Dark Data and other prediction tasks dramatically easier. At its core, Snorkel focuses on a key bottleneck in the development of machine learning systems: the lack of large training datasets. In Snorkel, a user implicitly creates large training sets by writing simple programs that label data, instead of performing manual feature engineering or tedious hand-labeling of individual data items. We’ll provide a set of tutorials that will allow folks to write Snorkel applications that use Spark.
Snorkel is open source on github and available from Snorkel.Stanford.edu.
Deep Learning on Apache® Spark™: Workflows and Best PracticesJen Aman
This document summarizes a presentation about deep learning workflows and best practices on Apache Spark. It discusses how deep learning fits within broader data pipelines for tasks like training and transformation. It also outlines recurring patterns for integrating Spark and deep learning frameworks, including using Spark for data parallelism and embedding deep learning transforms. The presentation provides tips for developers on topics like using GPUs with PySpark and monitoring deep learning jobs. It concludes by discussing challenges in the areas of distributed deep learning and Spark integration.
Deep Learning on Apache® Spark™ : Workflows and Best PracticesJen Aman
The combination of Deep Learning with Apache Spark has the potential for tremendous impact in many sectors of the industry. This webinar, based on the experience gained in assisting customers with the Databricks Virtual Analytics Platform, will present some best practices for building deep learning pipelines with Spark.
Rather than comparing deep learning systems or specific optimizations, this webinar will focus on issues that are common to deep learning frameworks when running on a Spark cluster, including:
* optimizing cluster setup;
* configuring the cluster;
* ingesting data; and
* monitoring long-running jobs.
We will demonstrate the techniques we cover using Google’s popular TensorFlow library. More specifically, we will cover typical issues users encounter when integrating deep learning libraries with Spark clusters.
Clusters can be configured to avoid task conflicts on GPUs and to allow using multiple GPUs per worker. Setting up pipelines for efficient data ingest improves job throughput, and monitoring facilitates both the work of configuration and the stability of deep learning jobs.
RISELab:Enabling Intelligent Real-Time DecisionsJen Aman
Spark Summit East Keynote by Ion Stoica
A long-standing grand challenge in computing is to enable machines to act autonomously and intelligently: to rapidly and repeatedly take appropriate actions based on information in the world around them. To address this challenge, at UC Berkeley we are starting a new five year effort that focuses on the development of data-intensive systems that provide Real-Time Intelligence with Secure Execution (RISE). Following in the footsteps of AMPLab, RISELab is an interdisciplinary effort bringing together researchers across AI, robotics, security, and data systems. In this talk I’ll present our research vision and then discuss some of the applications that will be enabled by RISE technologies.
Spatial Analysis On Histological Images Using SparkJen Aman
This document describes using Spark for spatial analysis of histological images to characterize the tumor microenvironment. The goal is to provide actionable data on the location and density of immune cells and blood vessels. Over 100,000 objects are annotated in each whole slide image. Spark is used to efficiently calculate over 5 trillion pairwise distances between objects within a neighborhood window. This enables profiling of co-localization and spatial clustering of objects. Initial results show the runtime scales linearly with the number of objects. Future work includes integrating clinical and genomic data to characterize variation between tumor types and patients.
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Jen Aman
Kyle Foreman presented on using Spark for large-scale global health simulations. The talk discussed (1) the motivation for simulations to model disease burden forecasts and alternative scenarios, (2) SimBuilder for constructing modular simulation workflows as directed acyclic graphs, and (3) benchmarks showing Spark backends can efficiently distribute simulations across a cluster. Future work aims to optimize Spark DataFrame joins and take better advantage of Numpy's vectorization for panel data simulations.
A Graph-Based Method For Cross-Entity Threat DetectionJen Aman
This document proposes a graph-based method for cross-entity threat detection. It models entity relationships as a multigraph and detects anomalies by identifying unexpected new connections between entities over time. It introduces two algorithms: a naive detector that identifies edges only in the detection graph, and a 2nd-order detector that identifies edges between entity clusters. An experiment on a real dataset found around 700 1st-order and 200 2nd-order anomalies in under 5 minutes, demonstrating the method's ability to efficiently detect threats across unrelated accounts.
Yggdrasil: Faster Decision Trees Using Column Partitioning In SparkJen Aman
This document introduces Yggdrasil, a new approach for training decision trees in Spark that partitions data by column instead of by row. Column partitioning significantly reduces communication costs for deep trees with many features. Evaluation on real-world datasets with millions of rows and thousands of features shows Yggdrasil achieves up to 24x speedup over the existing row-partitioning approach in Spark MLlib. The authors propose merging Yggdrasil into Spark MLlib to provide both row and column partitioning options for optimal performance on different problem sizes and depths.
Time-Evolving Graph Processing On Commodity ClustersJen Aman
Tegra is a system for efficiently processing time-evolving graphs on commodity clusters. It uses a distributed graph snapshot index to represent and retrieve multiple snapshots of evolving graphs. It introduces a timelapse abstraction to perform temporal analytics on windows of snapshots, avoiding redundant computation. Tegra supports both bulk and incremental graph computations using this representation, allowing results to be reused when graphs are updated. An evaluation on real-world graphs shows Tegra can store more snapshots in memory and reduce computation time compared to baseline approaches.
Re-Architecting Spark For Performance UnderstandabilityJen Aman
The document describes a new architecture called "monotasks" for Apache Spark that aims to make reasoning about Spark job performance easier. The monotasks architecture decomposes Spark tasks so that each task uses only one resource (e.g. CPU, disk, network). This avoids issues where Spark tasks bottleneck on different resources over time or experience resource contention. With monotasks, dedicated schedulers control resource contention and monotask timing data can be used to model ideal performance. Results show monotasks match Spark's performance and provide clearer insight into bottlenecks.
Efficient State Management With Spark 2.0 And Scale-Out DatabasesJen Aman
This document discusses efficient state management with Spark 2.0 and scale-out databases. It introduces SnappyData, an open source project that provides a unified in-memory database for streams, transactions, and OLAP queries to enable real-time operational analytics. SnappyData extends Spark by localizing state management and processing to avoid shuffles, supports approximate query processing for interactive queries, and provides a unified cluster architecture for OLTP, OLAP and streaming workloads.
GPU Computing With Apache Spark And PythonJen Aman
GPU Computing With Apache Spark And Python
- Python is a popular language for data science and analytics due to its large ecosystem of libraries and ease of use, but it is slow for number crunching tasks. GPU computing is a way to accelerate Python workloads.
- This presentation demonstrates using GPUs with Apache Spark and Python through libraries like Accelerate, which provides drop-in GPU-accelerated functions, and Numba, which can compile Python functions to run on GPUs.
- As an example, the task of image registration, which involves computationally expensive 2D FFTs, is accelerated using these GPU libraries within a PySpark job, achieving a 2-4x speedup over CPU-only versions
Building Custom Machine Learning Algorithms With Apache SystemMLJen Aman
This document discusses Apache SystemML, which is a machine learning framework for building custom machine learning algorithms on Apache Spark. It originated from research projects at IBM involving machine learning on Hadoop. SystemML aims to allow data scientists to build ML solutions using languages like R and Python, while executing algorithms on big data platforms like Spark. It provides a high-level language for expressing algorithms and performs automatic parallelization and optimization. The document demonstrates SystemML through a matrix factorization example for a targeted advertising problem. It shows how to wrangle data, build a custom algorithm, and get results. In conclusion, it recommends that readers try out SystemML through its website.
This document summarizes a talk on Spark on Mesos given by Dean Wampler from Lightbend and Timothy Chen from Mirantis. It discusses why Spark on Mesos is useful by allowing one cluster system to run multiple tools. It then covers recent updates like a new integration test suite, coarse-grained scheduler improvements, and Mesos framework authentication. Finally, it outlines future plans such as GPU support on Mesos and making "production" use of Spark on Mesos easier.
Elasticsearch And Apache Lucene For Apache Spark And MLlibJen Aman
This document summarizes a presentation about using Elasticsearch and Lucene for text processing and machine learning pipelines in Apache Spark. Some key points:
- Elasticsearch provides text analysis capabilities through Lucene and can be used to clean, tokenize, and vectorize text for machine learning tasks.
- Elasticsearch integrates natively with Spark through Java/Scala APIs and allows indexing and querying data from Spark.
- A typical machine learning pipeline for text classification in Spark involves tokenization, feature extraction (e.g. hashing), and a classifier like logistic regression.
- The presentation proposes preparing text analysis specifications in Elasticsearch once and reusing them across multiple Spark pipelines to simplify the workflows and avoid data movement between systems
Spark at Bloomberg: Dynamically Composable Analytics Jen Aman
The Bloomberg Spark Server provides a platform for dynamic composable analytics on Spark. It includes a Function Transform Registry (FTR) of reusable analytics functions. It also includes a Managed DataFrame Registry (MDFR) that allows request processors to lookup and compose analytics on named DataFrames. The request processors handle analytic requests by looking up functions from the FTR and composing them on MDFs from the MDFR. The Spark Server addresses challenges of standalone Spark apps like data redundancy and cross-asset analytics. It also handles real-time ingestion and memory management to support online analytics on Spark.
EclairJS allows developers to use JavaScript and Node.js to interact with Apache Spark for large-scale data processing and analytics. It provides a Spark API for Node.js so that compute-intensive workloads can be handed off to Spark running in the backend. EclairJS also enables the use of JavaScript with Jupyter notebooks, so data engineers and web developers can experiment with Spark from within the browser using familiar JavaScript syntax.
This document summarizes a presentation given at Spark Summit 2016 about using Spark for real-time data processing and analytics at Uber and Marketplace Data. Some key points:
- Uber generates large amounts of data across its 70+ countries and 450+ cities that is used for real-time processing, analytics, and forecasting.
- Marketplace Data uses Spark for real-time data processing, analytics, and forecasting of Uber's data, which involves challenges like complex event processing, geo aggregation, and querying large and streaming datasets.
- Jupyter notebooks are used to empower users and data scientists to work with Spark in a flexible way, though challenges remain around reliability, freshness, and isolating queries.
This presentation is about health care analysis using sentiment analysis .
*this is very useful to students who are doing project on sentiment analysis
*
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...Marlon Dumas
This webinar discusses the limitations of traditional approaches for business process simulation based on had-crafted model with restrictive assumptions. It shows how process mining techniques can be assembled together to discover high-fidelity digital twins of end-to-end processes from event data.
06-18-2024-Princeton Meetup-Introduction to MilvusTimothy Spann
06-18-2024-Princeton Meetup-Introduction to Milvus
tim.spann@zilliz.com
https://www.linkedin.com/in/timothyspann/
https://x.com/paasdev
https://github.com/tspannhw
https://github.com/milvus-io/milvus
Get Milvused!
https://milvus.io/
Read my Newsletter every week!
https://github.com/tspannhw/FLiPStackWeekly/blob/main/142-17June2024.md
For more cool Unstructured Data, AI and Vector Database videos check out the Milvus vector database videos here
https://www.youtube.com/@MilvusVectorDatabase/videos
Unstructured Data Meetups -
https://www.meetup.com/unstructured-data-meetup-new-york/
https://lu.ma/calendar/manage/cal-VNT79trvj0jS8S7
https://www.meetup.com/pro/unstructureddata/
https://zilliz.com/community/unstructured-data-meetup
https://zilliz.com/event
Twitter/X: https://x.com/milvusio https://x.com/paasdev
LinkedIn: https://www.linkedin.com/company/zilliz/ https://www.linkedin.com/in/timothyspann/
GitHub: https://github.com/milvus-io/milvus https://github.com/tspannhw
Invitation to join Discord: https://discord.com/invite/FjCMmaJng6
Blogs: https://milvusio.medium.com/ https://www.opensourcevectordb.cloud/ https://medium.com/@tspann
Expand LLMs' knowledge by incorporating external data sources into LLMs and your AI applications.
Interview Methods - Marital and Family Therapy and Counselling - Psychology S...PsychoTech Services
A proprietary approach developed by bringing together the best of learning theories from Psychology, design principles from the world of visualization, and pedagogical methods from over a decade of training experience, that enables you to: Learn better, faster!
We are pleased to share with you the latest VCOSA statistical report on the cotton and yarn industry for the month of May 2024.
Starting from January 2024, the full weekly and monthly reports will only be available for free to VCOSA members. To access the complete weekly report with figures, charts, and detailed analysis of the cotton fiber market in the past week, interested parties are kindly requested to contact VCOSA to subscribe to the newsletter.
06-20-2024-AI Camp Meetup-Unstructured Data and Vector DatabasesTimothy Spann
Tech Talk: Unstructured Data and Vector Databases
Speaker: Tim Spann (Zilliz)
Abstract: In this session, I will discuss the unstructured data and the world of vector databases, we will see how they different from traditional databases. In which cases you need one and in which you probably don’t. I will also go over Similarity Search, where do you get vectors from and an example of a Vector Database Architecture. Wrapping up with an overview of Milvus.
Introduction
Unstructured data, vector databases, traditional databases, similarity search
Vectors
Where, What, How, Why Vectors? We’ll cover a Vector Database Architecture
Introducing Milvus
What drives Milvus' Emergence as the most widely adopted vector database
Hi Unstructured Data Friends!
I hope this video had all the unstructured data processing, AI and Vector Database demo you needed for now. If not, there’s a ton more linked below.
My source code is available here
https://github.com/tspannhw/
Let me know in the comments if you liked what you saw, how I can improve and what should I show next? Thanks, hope to see you soon at a Meetup in Princeton, Philadelphia, New York City or here in the Youtube Matrix.
Get Milvused!
https://milvus.io/
Read my Newsletter every week!
https://github.com/tspannhw/FLiPStackWeekly/blob/main/141-10June2024.md
For more cool Unstructured Data, AI and Vector Database videos check out the Milvus vector database videos here
https://www.youtube.com/@MilvusVectorDatabase/videos
Unstructured Data Meetups -
https://www.meetup.com/unstructured-data-meetup-new-york/
https://lu.ma/calendar/manage/cal-VNT79trvj0jS8S7
https://www.meetup.com/pro/unstructureddata/
https://zilliz.com/community/unstructured-data-meetup
https://zilliz.com/event
Twitter/X: https://x.com/milvusio https://x.com/paasdev
LinkedIn: https://www.linkedin.com/company/zilliz/ https://www.linkedin.com/in/timothyspann/
GitHub: https://github.com/milvus-io/milvus https://github.com/tspannhw
Invitation to join Discord: https://discord.com/invite/FjCMmaJng6
Blogs: https://milvusio.medium.com/ https://www.opensourcevectordb.cloud/ https://medium.com/@tspann
https://www.meetup.com/unstructured-data-meetup-new-york/events/301383476/?slug=unstructured-data-meetup-new-york&eventId=301383476
https://www.aicamp.ai/event/eventdetails/W2024062014
Top 5 Lessons Learned in Building Streaming Applications at Microsoft Bing Scale
1. Top Five Lessons Learned in
Building Streaming Applications at
Microsoft Bing Scale
Renyi Xiong
Microsoft
2. Bing Scale Problem – Log Merging
Edge
3.2 GB/s
UI-Layout
2.8 GB/s
Click
69 MB/s
Kafka
in
every
DC
E E E E
U U U U
C C C C
Merged
Log
Raw Logs Databus Event Merge Pipeline
10-minute App-Time Window10
Kafka
Databus
1 2 3 4
• Merge Bing query events with click events
• Lambda architecture: batch- and stream-processing shares the same C#
library
• Spark Streaming in C#
4. Lesson 1: Use UpdateStateByKey to join DStreams
• Problem
• Application time is not supported in Spark 1.6.
• Solution – UpdateStateByKey
• UpdateStateByKey takes a custom JoinFunction as input parameter;
• Custom JoinFunction enforces time window based on Application Time;
• UpdateStateByKey maintains partially joined events as the state
Edge DStream
Click DStream
Batch job 1
RDD @ time 1
Batch job 2
RDD @ time 2
State DStream
UpdateStateByKey
Batch job 3
RDD @ time 3
Pseudo Code
Iterator[(K, S)] JoinFunction(
int pid, Iterator[(K:key, Iterator[V]:newEvents, S:oldState)] events)
{
val currentTime = events.newEvents.max(e => e.eventTime);
foreach (var e in events) {
val newState = <oldState join newEvents>
if (oldState.min(s => s.eventTime) + TimeWindow < currentTime) // TimeWindow 10minutes
<output to external storage>
else
return (key, newState)
}
}
UpdateStateByKey C# API
PairDStreamFunctions.cs, https://github.com/Microsoft/Mobius
5. Lesson 2: Dynamic Repartition with Kafka Direct Approach
• Problem
• Unbalanced Kafka partitions caused delay in the pipeline
• Solution – Dynamic Repartition
1. Repartition data from one Kafka partition into multiple RDDs without extra
shuffling cost of DStream.Repartition
2. Repartition threshold is configurable per topic
After Dynamic Repartition
Pseudo Code
class DynamicPartitionKafkaRDD(kafkaPartitionOffsetRanges)
override def getPartitions {
// repartition threshold per topic loaded from config
val maxRddPartitionSize = Map<topic, partitionSize>
// apply max repartition threshold
kafkaPartitionOffsetRanges.flatMap { case o =>
val rddPartitionSize = maxRddPartitionSize(o.topic)
(o.fromOffset until o.untilOffset by rddPartitionSize).map(
s => (o.topic, o.partition, s, (o.untilOffset, s + rddPartitionSize)))
}
}
Source Code
DynamicPartitionKafkaRDD.scala - https://github.com/Microsoft/Mobius
2-minute interval Before Dynamic Repartition
6. Lesson 3: On-time Kafka fetch job submission
• Problem
• Overall job delay accumulates due to transient slow/hot Kafka
broker issue.
• Solution
• Submit Kafka Fetch job on batch interval in a separate thread,
even when previous batch delayed.
Job#A2
UpdateStateByKey
Job#A1
Fetch data
Batch-A
New Thread
Job#A3
Checkpoint
Job#B2
UpdateStateByKey
Job#B1
Fetch data
Batch-B
Job#B3
Checkpoint
Job#C2
UpdateStateByKey
Job#C1
Fetch data
Batch-C
Job#C3
checkpoint
Main Thread
Submit Job
Pseudo Code
class CSharpStateDStream
override def compute {
val lastState = getOrCompute (validTime - batchInterval)
val rdd = parent.getOrCompute(validTime)
if (!lastBatchCompleted) {
// if last batch not complete yet
// run Fetch data job to materialize rdd in a separate thread
rdd.cache()
ThreadPool.execute(sc.runJob(rdd))
// wait for job to complele
}
<compute UpdateStateByKey Dstream>
}
Source Code
CSharpDStream.scala - https://github.com/Microsoft/Mobius
Driver perspective
7. Lesson 4: Parallel Kafka metadata refresh
• Problem
• Fetching Kafka metadata from multiple data centers often takes more time than expected.
• Solution
• Customize DirectKafkaInputDStream, move metadata refresh for each {topic, data-center} to a
separate thread
Kafka
offset range 1
t3
Main Thread
t2
New Thread
t1
Kafka
offset range 2
Kafka
offset range 3
batch 1 batch 2 batch 3
Batch Job Submission
Kafka Metadata refresh
Pseudo Code
class DynamicPartitionKafkaInputDStream {
// starts a separate schedule thread at refreshOffsetsInterval
refreshOffsetsScheduler.scheduleAtFixedRate(
<get offset ranges>
<enqueue offset ranges>
)
override def compute {
<dequeue offset ranges nonblockingly>
<generate kafka RDD>
}
}
Source Code
DynamicPartitionKafkaInputDStream.scala - https://github.com/Microsoft/Mobius
Driver perspective
8. Lesson 5: Parallel Kafka metadata refresh + RDD materialization
• Problem
• Kafka data fetch and data processing not in parallel
• Solution
• Take Lesson 4 further
• In metadata refresh thread, materialize and cache
Kafka RDD
Kafka
offset range 1
RDD.Cache
t3
Main Thread
t2
New Thread
t1
Kafka
offset range 2
RDD.Cache
Kafka
offset range 3
RDD.Cache
batch 1 batch 2 batch 3
Batch Job Submission
Kafka Metadata refresh
Driver perspective
Pseudo Code
class DynamicPartitionKafkaInputDStream
// starts a separate schedule thread at refreshOffsetsInterval
refreshOffsetsScheduler.scheduleAtFixedRate(
<get offset ranges>
<generate kafka RDD>
// materialize and cache
sc.runJob(kafkaRdd.cache)
<enqueue kafka RDD>
)
override def compute {
<dequeue kafka RDD nonblockingly>
}
Source Code
DynamicPartitionKafkaInputDStream.scala - https://github.com/Microsoft/Mobius
9. THANK YOU.
• Special thanks to TD and Ram from Databricks for all the support
• Contact us
• Renyi Xiong, renyix@microsoft.com
• Kaarthik Sivashanmugam, ksivas@microsoft.com
• mobiuscore@microsoft.com