Spark is an open source cluster computing framework that can outperform Hadoop by 30x through a combination of in-memory computation and a richer execution engine. Shark is a port of Apache Hive onto Spark, which provides a similar speedup for SQL queries, allowing interactive exploration of data in existing Hive warehouses. This talk will cover how both Spark and Shark are being used at various companies to accelerate big data analytics, the architecture of the systems, and where they are heading. We will also discuss the next major feature we are developing, Spark Streaming, which adds support for low-latency stream processing to Spark, giving users a unified interface for batch and real-time analytics.
Shark is a SQL query engine built on top of Spark, a fast MapReduce-like engine. It extends Spark to support SQL and complex analytics efficiently while maintaining the fault tolerance and scalability of MapReduce. Shark uses techniques from databases like columnar storage and dynamic query optimization to improve performance. Benchmarks show Shark can perform SQL queries and machine learning algorithms faster than traditional MapReduce systems like Hive and Hadoop. The goal of Shark is to provide a unified system for both SQL and complex analytics processing at large scale.
Spark is a fast and general engine for large-scale data processing. It provides an interface called resilient distributed datasets (RDDs) that allow data to be distributed in memory across clusters and manipulated using parallel operations. Shark is a system built on Spark that allows running SQL queries over large datasets using Spark's speed and generality. The document discusses Spark and Shark's performance advantages over Hadoop for iterative and interactive applications.
In this talk, we present two emerging, popular open source projects: Spark and Shark. Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write. It outperform Hadoop by up to 100x in many real-world applications. Spark programs are often much shorter than their MapReduce counterparts thanks to its high-level APIs and language integration in Java, Scala, and Python. Shark is an analytic query engine built on top of Spark that is compatible with Hive. It can run Hive queries much faster in existing Hive warehouses without modifications.
These systems have been adopted by many organizations large and small (e.g. Yahoo, Intel, Adobe, Alibaba, Tencent) to implement data intensive applications such as ETL, interactive SQL, and machine learning.
This document provides an overview of Apache Spark, including its core concepts, transformations and actions, persistence, parallelism, and examples. Spark is introduced as a fast and general engine for large-scale data processing, with advantages like in-memory computing, fault tolerance, and rich APIs. Key concepts covered include its resilient distributed datasets (RDDs) and lazy evaluation approach. The document also discusses Spark SQL, streaming, and integration with other tools.
Shark is a new data analysis system that marries SQL queries with complex analytics like machine learning on large clusters. It uses Spark as an execution engine and provides in-memory columnar storage with extensions like partial DAG execution and co-partitioning tables to optimize query performance. Shark also supports expressing machine learning algorithms in SQL to avoid moving data out of the database. It aims to efficiently support both SQL and complex analytics while retaining fault tolerance and allowing users to choose loading frequently used data into memory for fast queries.
Spark is an open source cluster computing framework for large-scale data processing. It provides high-level APIs and runs on Hadoop clusters. Spark components include Spark Core for execution, Spark SQL for SQL queries, Spark Streaming for real-time data, and MLlib for machine learning. The core abstraction in Spark is the resilient distributed dataset (RDD), which allows data to be partitioned across nodes for parallel processing. A word count example demonstrates how to use transformations like flatMap and reduceByKey to count word frequencies from an input file in Spark.
Structured Streaming for Columnar Data Warehouses with Jack GudenkaufDatabricks
Organizations building big data analytics solutions for streaming environments struggle with adapting legacy batch systems for streaming, supporting multiple columnar analytical databases, providing time series aggregations, and streaming Fact and Dimensional data into star schemas. In this session, you will learn how we overcame these challenges and developed an end-user self-service, no-code required “ETL” framework. Extensible and operationally robust, this developer framework includes a Spark Structured Streaming app for Kafka, Hadoop/Hive (ORC, Parquet), OpenTSDB/HBase, and Vertica data pipelines.
There's a big shift in both at the architecture and api level from Hadoop 1 vs Hadoop 2, particularly YARN and we had our first meetup to talk about this (http://www.meetup.com/Atlanta-YARN-User-Group/) on 10/13/2013.
Shark is a SQL query engine built on top of Spark, a fast MapReduce-like engine. It extends Spark to support SQL and complex analytics efficiently while maintaining the fault tolerance and scalability of MapReduce. Shark uses techniques from databases like columnar storage and dynamic query optimization to improve performance. Benchmarks show Shark can perform SQL queries and machine learning algorithms faster than traditional MapReduce systems like Hive and Hadoop. The goal of Shark is to provide a unified system for both SQL and complex analytics processing at large scale.
Spark is a fast and general engine for large-scale data processing. It provides an interface called resilient distributed datasets (RDDs) that allow data to be distributed in memory across clusters and manipulated using parallel operations. Shark is a system built on Spark that allows running SQL queries over large datasets using Spark's speed and generality. The document discusses Spark and Shark's performance advantages over Hadoop for iterative and interactive applications.
In this talk, we present two emerging, popular open source projects: Spark and Shark. Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write. It outperform Hadoop by up to 100x in many real-world applications. Spark programs are often much shorter than their MapReduce counterparts thanks to its high-level APIs and language integration in Java, Scala, and Python. Shark is an analytic query engine built on top of Spark that is compatible with Hive. It can run Hive queries much faster in existing Hive warehouses without modifications.
These systems have been adopted by many organizations large and small (e.g. Yahoo, Intel, Adobe, Alibaba, Tencent) to implement data intensive applications such as ETL, interactive SQL, and machine learning.
This document provides an overview of Apache Spark, including its core concepts, transformations and actions, persistence, parallelism, and examples. Spark is introduced as a fast and general engine for large-scale data processing, with advantages like in-memory computing, fault tolerance, and rich APIs. Key concepts covered include its resilient distributed datasets (RDDs) and lazy evaluation approach. The document also discusses Spark SQL, streaming, and integration with other tools.
Shark is a new data analysis system that marries SQL queries with complex analytics like machine learning on large clusters. It uses Spark as an execution engine and provides in-memory columnar storage with extensions like partial DAG execution and co-partitioning tables to optimize query performance. Shark also supports expressing machine learning algorithms in SQL to avoid moving data out of the database. It aims to efficiently support both SQL and complex analytics while retaining fault tolerance and allowing users to choose loading frequently used data into memory for fast queries.
Spark is an open source cluster computing framework for large-scale data processing. It provides high-level APIs and runs on Hadoop clusters. Spark components include Spark Core for execution, Spark SQL for SQL queries, Spark Streaming for real-time data, and MLlib for machine learning. The core abstraction in Spark is the resilient distributed dataset (RDD), which allows data to be partitioned across nodes for parallel processing. A word count example demonstrates how to use transformations like flatMap and reduceByKey to count word frequencies from an input file in Spark.
Structured Streaming for Columnar Data Warehouses with Jack GudenkaufDatabricks
Organizations building big data analytics solutions for streaming environments struggle with adapting legacy batch systems for streaming, supporting multiple columnar analytical databases, providing time series aggregations, and streaming Fact and Dimensional data into star schemas. In this session, you will learn how we overcame these challenges and developed an end-user self-service, no-code required “ETL” framework. Extensible and operationally robust, this developer framework includes a Spark Structured Streaming app for Kafka, Hadoop/Hive (ORC, Parquet), OpenTSDB/HBase, and Vertica data pipelines.
There's a big shift in both at the architecture and api level from Hadoop 1 vs Hadoop 2, particularly YARN and we had our first meetup to talk about this (http://www.meetup.com/Atlanta-YARN-User-Group/) on 10/13/2013.
Apache Spark presentation at HasGeek FifthElelephant
https://fifthelephant.talkfunnel.com/2015/15-processing-large-data-with-apache-spark
Covering Big Data Overview, Spark Overview, Spark Internals and its supported libraries
This document summarizes Spark, a fast and general engine for large-scale data processing. Spark addresses limitations of MapReduce by supporting efficient sharing of data across parallel operations in memory. Resilient distributed datasets (RDDs) allow data to persist across jobs for faster iterative algorithms and interactive queries. Spark provides APIs in Scala and Java for programming RDDs and a scheduler to optimize jobs. It integrates with existing Hadoop clusters and scales to petabytes of data.
Apache Drill is the next generation of SQL query engines. It builds on ANSI SQL 2003, and extends it to handle new formats like JSON, Parquet, ORC, and the usual CSV, TSV, XML and other Hadoop formats. Most importantly, it melts away the barriers that have caused databases to become silos of data. It does so by able to handle schema-changes on the fly, enabling a whole new world of self-service and data agility never seen before.
The document provides an introduction to Hadoop, including an overview of its core components HDFS and MapReduce, and motivates their use by explaining the need to process large amounts of data in parallel across clusters of computers in a fault-tolerant and scalable manner. It also presents sample code walkthroughs and discusses the Hadoop ecosystem of related projects like Pig, HBase, Hive and Zookeeper.
This is slides from our recent HadoopIsrael meetup. It is dedicated to comparison Spark and Tez frameworks.
In the end of the meetup there is small update about our ImpalaToGo project.
Spark Streaming allows for scalable, high-throughput, fault-tolerant stream processing of live data streams. It works by chopping data streams into batches, treating each batch as an RDD, and processing them using RDD transformations and operations. This provides a simple abstraction called a DStream that represents a continuous stream of data as a series of RDDs. Transformations applied to DStreams are similarly applied to the underlying RDDs. Spark Streaming also supports window operations, output operations, and integrating streaming with Spark's ML and graph processing capabilities.
Let Spark Fly: Advantages and Use Cases for Spark on HadoopMapR Technologies
http://bit.ly/1BTaXZP – Apache Spark is currently one of the most active projects in the Hadoop ecosystem, and as such, there’s been plenty of hype about it in recent months, but how much of the discussion is marketing spin? And what are the facts? MapR and Databricks, the company that created and led the development of the Spark stack, will cut through the noise to uncover practical advantages for having the full set of Spark technologies at your disposal and reveal the benefits for running Spark on Hadoop
This presentation was given at a webinar hosted by Data Science Central and co-presented by MapR + Databricks.
To see the webinar, please go to: http://www.datasciencecentral.com/video/let-spark-fly-advantages-and-use-cases-for-spark-on-hadoop
This was the first session about Hadoop and MapReduce. It introduces what Hadoop is and its main components. It also covers the how to program your first MapReduce task and how to run it on pseudo distributed Hadoop installation.
This session was given in Arabic and i may provide a video for the session soon.
This slide introduces Hadoop Spark.
Just to help you construct an idea of Spark regarding its architecture, data flow, job scheduling, and programming.
Not all technical details are included.
The document compares and contrasts the SAS and Spark frameworks. It provides an overview of their programming models, with SAS using data steps and procedures while Spark uses Scala and distributed datasets. Examples are shown of common tasks like loading data, sorting, grouping, and regression in both SAS Proc SQL and Spark SQL. Spark MLlib is described as Spark's machine learning library, in contrast to SAS Stats. Finally, Spark Streaming is demonstrated for loading and querying streaming data from Kafka. The key takeaways recommend trying Spark for large data, distributed computing, better control of code, open source licensing, or leveraging Hadoop data.
This document provides an overview of Spark and its key components. Spark is a fast and general engine for large-scale data processing. It uses Resilient Distributed Datasets (RDDs) that allow data to be partitioned across clusters and cached in memory for fast performance. Spark is up to 100x faster than Hadoop for iterative jobs and provides a unified framework for batch processing, streaming, SQL, and machine learning workloads.
Why Spark Is the Next Top (Compute) ModelDean Wampler
This document contains code for implementing an inverted index using Apache Hadoop MapReduce. The code includes a main class that sets up the job, a LineIndexMapper class containing the map method to tokenize text and output (word, document) pairs, and a LineIndexReducer class containing the reduce method to aggregate values for each word key into a comma-separated string. The code implements the basic MapReduce pattern of mapping input data to intermediate key-value pairs, shuffling, and reducing to output the final index.
This document introduces Apache Spark. It discusses MapReduce and its limitations in processing large datasets. Spark was developed to address these limitations by enabling fast sharing of data across clusters using resilient distributed datasets (RDDs). RDDs allow transformations like map and filter to be applied lazily and support operations like join and groupByKey. This provides benefits for iterative and interactive queries compared to MapReduce.
This document provides an overview of Apache Spark's architectural components through the life of simple Spark jobs. It begins with a simple Spark application analyzing airline on-time arrival data, then covers Resilient Distributed Datasets (RDDs), the cluster architecture, job execution through Spark components like tasks and scheduling, and techniques for writing better Spark applications like optimizing partitioning and reducing shuffle size.
Big data, just an introduction to Hadoop and Scripting LanguagesCorley S.r.l.
This document provides an introduction to Big Data and Apache Hadoop. It defines Big Data as large and complex datasets that are difficult to process using traditional database tools. It describes how Hadoop uses MapReduce and HDFS to provide scalable storage and parallel processing of Big Data. It provides examples of companies using Hadoop to analyze exabytes of data and common Hadoop use cases like log analysis. Finally, it summarizes some popular Hadoop ecosystem projects like Hive, Pig, and Zookeeper that provide SQL-like querying, data flows, and coordination.
This document provides an overview of Spark, including:
- Spark's processing model involves chopping live data streams into batches and treating each batch as an RDD to apply transformations and actions.
- Resilient Distributed Datasets (RDDs) are Spark's primary abstraction, representing an immutable distributed collection of objects that can be operated on in parallel.
- An example word count program is presented to illustrate how to create and manipulate RDDs to count the frequency of words in a text file.
This document discusses Spark shuffle, which is an expensive operation that involves data partitioning, serialization/deserialization, compression, and disk I/O. It provides an overview of how shuffle works in Spark and the history of optimizations like sort-based shuffle and an external shuffle service. Key concepts discussed include shuffle writers, readers, and the pluggable block transfer service that handles data transfer. The document also covers shuffle-related configuration options and potential future work.
Apache Spark presentation at HasGeek FifthElelephant
https://fifthelephant.talkfunnel.com/2015/15-processing-large-data-with-apache-spark
Covering Big Data Overview, Spark Overview, Spark Internals and its supported libraries
This document summarizes Spark, a fast and general engine for large-scale data processing. Spark addresses limitations of MapReduce by supporting efficient sharing of data across parallel operations in memory. Resilient distributed datasets (RDDs) allow data to persist across jobs for faster iterative algorithms and interactive queries. Spark provides APIs in Scala and Java for programming RDDs and a scheduler to optimize jobs. It integrates with existing Hadoop clusters and scales to petabytes of data.
Apache Drill is the next generation of SQL query engines. It builds on ANSI SQL 2003, and extends it to handle new formats like JSON, Parquet, ORC, and the usual CSV, TSV, XML and other Hadoop formats. Most importantly, it melts away the barriers that have caused databases to become silos of data. It does so by able to handle schema-changes on the fly, enabling a whole new world of self-service and data agility never seen before.
The document provides an introduction to Hadoop, including an overview of its core components HDFS and MapReduce, and motivates their use by explaining the need to process large amounts of data in parallel across clusters of computers in a fault-tolerant and scalable manner. It also presents sample code walkthroughs and discusses the Hadoop ecosystem of related projects like Pig, HBase, Hive and Zookeeper.
This is slides from our recent HadoopIsrael meetup. It is dedicated to comparison Spark and Tez frameworks.
In the end of the meetup there is small update about our ImpalaToGo project.
Spark Streaming allows for scalable, high-throughput, fault-tolerant stream processing of live data streams. It works by chopping data streams into batches, treating each batch as an RDD, and processing them using RDD transformations and operations. This provides a simple abstraction called a DStream that represents a continuous stream of data as a series of RDDs. Transformations applied to DStreams are similarly applied to the underlying RDDs. Spark Streaming also supports window operations, output operations, and integrating streaming with Spark's ML and graph processing capabilities.
Let Spark Fly: Advantages and Use Cases for Spark on HadoopMapR Technologies
http://bit.ly/1BTaXZP – Apache Spark is currently one of the most active projects in the Hadoop ecosystem, and as such, there’s been plenty of hype about it in recent months, but how much of the discussion is marketing spin? And what are the facts? MapR and Databricks, the company that created and led the development of the Spark stack, will cut through the noise to uncover practical advantages for having the full set of Spark technologies at your disposal and reveal the benefits for running Spark on Hadoop
This presentation was given at a webinar hosted by Data Science Central and co-presented by MapR + Databricks.
To see the webinar, please go to: http://www.datasciencecentral.com/video/let-spark-fly-advantages-and-use-cases-for-spark-on-hadoop
This was the first session about Hadoop and MapReduce. It introduces what Hadoop is and its main components. It also covers the how to program your first MapReduce task and how to run it on pseudo distributed Hadoop installation.
This session was given in Arabic and i may provide a video for the session soon.
This slide introduces Hadoop Spark.
Just to help you construct an idea of Spark regarding its architecture, data flow, job scheduling, and programming.
Not all technical details are included.
The document compares and contrasts the SAS and Spark frameworks. It provides an overview of their programming models, with SAS using data steps and procedures while Spark uses Scala and distributed datasets. Examples are shown of common tasks like loading data, sorting, grouping, and regression in both SAS Proc SQL and Spark SQL. Spark MLlib is described as Spark's machine learning library, in contrast to SAS Stats. Finally, Spark Streaming is demonstrated for loading and querying streaming data from Kafka. The key takeaways recommend trying Spark for large data, distributed computing, better control of code, open source licensing, or leveraging Hadoop data.
This document provides an overview of Spark and its key components. Spark is a fast and general engine for large-scale data processing. It uses Resilient Distributed Datasets (RDDs) that allow data to be partitioned across clusters and cached in memory for fast performance. Spark is up to 100x faster than Hadoop for iterative jobs and provides a unified framework for batch processing, streaming, SQL, and machine learning workloads.
Why Spark Is the Next Top (Compute) ModelDean Wampler
This document contains code for implementing an inverted index using Apache Hadoop MapReduce. The code includes a main class that sets up the job, a LineIndexMapper class containing the map method to tokenize text and output (word, document) pairs, and a LineIndexReducer class containing the reduce method to aggregate values for each word key into a comma-separated string. The code implements the basic MapReduce pattern of mapping input data to intermediate key-value pairs, shuffling, and reducing to output the final index.
This document introduces Apache Spark. It discusses MapReduce and its limitations in processing large datasets. Spark was developed to address these limitations by enabling fast sharing of data across clusters using resilient distributed datasets (RDDs). RDDs allow transformations like map and filter to be applied lazily and support operations like join and groupByKey. This provides benefits for iterative and interactive queries compared to MapReduce.
This document provides an overview of Apache Spark's architectural components through the life of simple Spark jobs. It begins with a simple Spark application analyzing airline on-time arrival data, then covers Resilient Distributed Datasets (RDDs), the cluster architecture, job execution through Spark components like tasks and scheduling, and techniques for writing better Spark applications like optimizing partitioning and reducing shuffle size.
Big data, just an introduction to Hadoop and Scripting LanguagesCorley S.r.l.
This document provides an introduction to Big Data and Apache Hadoop. It defines Big Data as large and complex datasets that are difficult to process using traditional database tools. It describes how Hadoop uses MapReduce and HDFS to provide scalable storage and parallel processing of Big Data. It provides examples of companies using Hadoop to analyze exabytes of data and common Hadoop use cases like log analysis. Finally, it summarizes some popular Hadoop ecosystem projects like Hive, Pig, and Zookeeper that provide SQL-like querying, data flows, and coordination.
This document provides an overview of Spark, including:
- Spark's processing model involves chopping live data streams into batches and treating each batch as an RDD to apply transformations and actions.
- Resilient Distributed Datasets (RDDs) are Spark's primary abstraction, representing an immutable distributed collection of objects that can be operated on in parallel.
- An example word count program is presented to illustrate how to create and manipulate RDDs to count the frequency of words in a text file.
This document discusses Spark shuffle, which is an expensive operation that involves data partitioning, serialization/deserialization, compression, and disk I/O. It provides an overview of how shuffle works in Spark and the history of optimizations like sort-based shuffle and an external shuffle service. Key concepts discussed include shuffle writers, readers, and the pluggable block transfer service that handles data transfer. The document also covers shuffle-related configuration options and potential future work.
Cost-based query optimization in Apache Hive 0.14Julian Hyde
Tez is making Hive faster, and now cost-based optimization (CBO) is making it smarter. A new initiative in Hive introduces cost-based optimization for the first time, based on the Optiq framework. Optiq's lead developer Julian Hyde shows the improvements that CBO is bringing in Apache Hive 0.14.
For those interested in Hive internals, he gives an overview of the Optiq framework and shows some of the improvements that are coming to future versions of Hive with the Stinger.next initiative.
The document provides an overview of the Scala programming language. It discusses what Scala is, its definition, and opinions on Scala from notable figures like the creator of Java. It also covers Scala's scalability, data types, functional aspects, object-oriented aspects, and comparison to Java. Key points are that Scala is a multi-paradigm language that integrates functional and object-oriented programming, runs on the JVM, and is more concise and expressive than Java.
The document discusses the new version of Apache Sqoop (Sqoop 2), which aims to address challenges with the previous version. Sqoop 2 features a client-server architecture for easier installation and management, a REST API for improved integration with tools like Oozie, and enhanced security. It is designed to make data transfer between Hadoop and external systems simpler, more extensible, and more secure.
Strata NYC 2015: What's new in Spark StreamingDatabricks
Spark Streaming allows processing of live data streams at scale. Recent improvements include:
1) Enhanced fault tolerance through a write-ahead log and replay of unprocessed data on failure.
2) Dynamic backpressure to automatically adjust ingestion rates and ensure stability.
3) Visualization tools for debugging and monitoring streaming jobs.
4) Support for streaming machine learning algorithms and integration with other Spark components.
In this talk at 2015 Spark Summit East, the lead developer of Spark streaming, @tathadas, talks about the state of Spark streaming:
Spark Streaming extends the core Apache Spark API to perform large-scale stream processing, which is revolutionizing the way Big “Streaming” Data application are being written. It is rapidly adopted by companies spread across various business verticals – ad and social network monitoring, real-time analysis of machine data, fraud and anomaly detections, etc. These companies are mainly adopting Spark Streaming because – Its simple, declarative batch-like API makes large-scale stream processing accessible to non-scientists. – Its unified API and a single processing engine (i.e. Spark core engine) allows a single cluster and a single set of operational processes to cover the full spectrum of uses cases – batch, interactive and stream processing. – Its stronger, exactly-once semantics makes it easier to express and debug complex business logic. In this talk, I am going to elaborate on such adoption stories, highlighting interesting use cases of Spark Streaming in the wild. In addition, this presentation will also showcase the exciting new developments in Spark Streaming and the potential future roadmap.
This document provides a history and market overview of Apache Spark. It discusses the motivation for distributed data processing due to increasing data volumes, velocities and varieties. It then covers brief histories of Google File System, MapReduce, BigTable, and other technologies. Hadoop and MapReduce are explained. Apache Spark is introduced as a faster alternative to MapReduce that keeps data in memory. Competitors like Flink, Tez and Storm are also mentioned.
Structuring Spark: DataFrames, Datasets, and StreamingDatabricks
This document discusses how Spark provides structured APIs like SQL, DataFrames, and Datasets to organize data and computation. It describes how these APIs allow Spark to optimize queries by understanding their structure. The document outlines how Spark represents data internally and how encoders translate between this format and user objects. It also introduces Spark's new structured streaming functionality, which allows batch queries to run continuously on streaming data using the same API.
Spark Streaming allows real-time processing of live data streams. It works by dividing the streaming data into batches called DStreams, which are then processed using Spark's batch API. Common sources of data include Kafka, files, and sockets. Transformations like map, reduce, join and window can be applied to DStreams. Stateful operations like updateStateByKey allow updating persistent state. Checkpointing to reliable storage like HDFS provides fault-tolerance.
QCon São Paulo: Real-Time Analytics with Spark StreamingPaco Nathan
The document provides an overview of real-time analytics using Spark Streaming. It discusses Spark Streaming's micro-batch approach of treating streaming data as a series of small batch jobs. This allows for low-latency analysis while integrating streaming and batch processing. The document also covers Spark Streaming's fault tolerance mechanisms and provides several examples of companies like Pearson, Guavus, and Sharethrough using Spark Streaming for real-time analytics in production environments.
The document provides an overview of Apache Spark, including its history and key capabilities. It discusses how Spark was developed in 2009 at UC Berkeley and later open sourced, and how it has since become a major open-source project for big data. The document summarizes that Spark provides in-memory performance for ETL, storage, exploration, analytics and more on Hadoop clusters, and supports machine learning, graph analysis, and SQL queries.
This document provides a summary of improvements made to Hive's performance through the use of Apache Tez and other optimizations. Some key points include:
- Hive was improved to use Apache Tez as its execution engine instead of MapReduce, reducing latency for interactive queries and improving throughput for batch queries.
- Statistics collection was optimized to gather column-level statistics from ORC file footers, speeding up statistics gathering.
- The cost-based optimizer Optiq was added to Hive, allowing it to choose better execution plans.
- Vectorized query processing, broadcast joins, dynamic partitioning, and other optimizations improved individual query performance by over 100x in some cases.
Building Robust, Adaptive Streaming Apps with Spark StreamingDatabricks
As the adoption of Spark Streaming increases rapidly, the community has been asking for greater robustness and scalability from Spark Streaming applications in a wider range of operating environments. To fulfill these demands, we have steadily added a number of features in Spark Streaming. We have added backpressure mechanisms which allows Spark Streaming to dynamically adapt to changes in incoming data rates, and maintain stability of the application. In addition, we are extending Spark’s Dynamic Allocation to Spark Streaming, so that streaming applications can elastically scale based on processing requirements. In my talk, I am going to explore these mechanisms and explain how developers can write robust, scalable and adaptive streaming applications using them. Presented by Tathagata "TD" Das from Databricks.
R originated in the 1970s at Bell Labs and has since evolved significantly. It is an open-source programming language used widely for statistical analysis and graphics. While powerful, R has some drawbacks like poor performance for large datasets and a steep learning curve. However, its key advantages including being free, having a large community of users, and extensive libraries have made it a popular tool, especially for academic research.
Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305mjfrankli
This document summarizes UC Berkeley's AMPLab and its work on big data analytics. It discusses how AMPLab is developing machine learning algorithms and tools like Spark, Shark, and MLlib to handle massive and diverse datasets in real-time. These tools use techniques like distributed computing on clusters and human computation through crowdsourcing. AMPLab's goal is to help researchers and organizations gain better insights from large amounts of data through scalable analytics that balance cost, time, and answer quality.
BDT305 Transforming Big Data with Spark and Shark - AWS re: Invent 2012Amazon Web Services
This document summarizes UC Berkeley's AMPLab and its work on big data analytics. It discusses how AMPLab is developing machine learning algorithms and tools like Spark to analyze massive and diverse datasets generated from sources like the internet of things, scientific computing, and social media. It aims to balance the costs, time, and quality of answers from big data. AMPLab researchers are working on tools like MLBase, Shark, and Spark to perform distributed machine learning and data analytics across clusters. The document highlights some of AMPLab's projects and tools to demonstrate faster analytics on large datasets compared to other frameworks like Hadoop and Hive.
Unified Big Data Processing with Apache Spark (QCON 2014)Databricks
This document discusses Apache Spark, a fast and general engine for big data processing. It describes how Spark generalizes the MapReduce model through its Resilient Distributed Datasets (RDDs) abstraction, which allows efficient sharing of data across parallel operations. This unified approach allows Spark to support multiple types of processing, like SQL queries, streaming, and machine learning, within a single framework. The document also outlines ongoing developments like Spark SQL and improved machine learning capabilities.
Making Big Data Analytics Interactive and Real-TimeSeven Nguyen
Spark is a parallel framework that provides efficient primitives for in-memory data sharing across clusters. It introduces Resilient Distributed Datasets (RDDs) that allow data to be partitioned across nodes and transformed using operations like map and filter. RDDs track the lineage of transformations to allow recovery of lost data partitions. This allows interactive queries and iterative algorithms to run much faster than disk-based systems like MapReduce by avoiding expensive replication and I/O.
Unified Big Data Processing with Apache SparkC4Media
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1yNuLGF.
Matei Zaharia talks about the latest developments in Spark and shows examples of how it can combine processing algorithms to build rich data pipelines in just a few lines of code. Filmed at qconsf.com.
Matei Zaharia is an assistant professor of computer science at MIT, and CTO of Databricks, the company commercializing Apache Spark.
End-to-end Data Pipeline with Apache SparkDatabricks
This document discusses Apache Spark, a fast and general cluster computing system. It summarizes Spark's capabilities for machine learning workflows, including feature preparation, model training, evaluation, and production use. It also outlines new high-level APIs for data science in Spark, including DataFrames, machine learning pipelines, and an R interface, with the goal of making Spark more similar to single-machine libraries like SciKit-Learn. These new APIs are designed to make Spark easier to use for machine learning and interactive data analysis.
Simplifying Big Data Analytics with Apache SparkDatabricks
Apache Spark is a fast and general-purpose cluster computing system for large-scale data processing. It improves on MapReduce by allowing data to be kept in memory across jobs, enabling faster iterative jobs. Spark consists of a core engine along with libraries for SQL, streaming, machine learning, and graph processing. The document discusses new APIs in Spark including DataFrames, which provide a tabular interface like in R/Python, and data sources, which allow plugging external data systems into Spark. These changes aim to make Spark easier for data scientists to use at scale.
OCF.tw's talk about "Introduction to spark"Giivee The
在 OCF and OSSF 的邀請下分享一下 Spark
If you have any interest about 財團法人開放文化基金會(OCF) or 自由軟體鑄造場(OSSF)
Please check http://ocf.tw/ or http://www.openfoundry.org/
另外感謝 CLBC 的場地
如果你想到在一個良好的工作環境下工作
歡迎跟 CLBC 接洽 http://clbc.tw/
Author: Stefan Papp, Data Architect at “The unbelievable Machine Company“. An overview of Big Data Processing engines with a focus on Apache Spark and Apache Flink, given at a Vienna Data Science Group meeting on 26 January 2017. Following questions are addressed:
• What are big data processing paradigms and how do Spark 1.x/Spark 2.x and Apache Flink solve them?
• When to use batch and when stream processing?
• What is a Lambda-Architecture and a Kappa Architecture?
• What are the best practices for your project?
Spark is a fast and general engine for large-scale data processing. It runs on Hadoop clusters through YARN and Mesos, and can also run standalone. Spark is up to 100x faster than Hadoop for certain applications because it keeps data in memory rather than disk, and it supports iterative algorithms through its Resilient Distributed Dataset (RDD) abstraction. The presenter provides a demo of Spark's word count algorithm in Scala, Java, and Python to illustrate how easy it is to use Spark across languages.
Advanced spark training advanced spark internals and tuning reynold xincaidezhi655
This document provides an overview of Spark's RDD abstraction and the life cycle of a Spark application.
It defines RDDs as distributed collections characterized by partitions, dependencies, a compute function, and optional properties like a partitioner. It describes how Spark builds a DAG of stages from transformations and actions, and schedules tasks across executors.
The document also covers performance debugging techniques like identifying slow stages, stragglers, garbage collection issues, and profiling tasks locally. Debugging tools discussed include the Spark UI, executor logs, jstack, and YourKit.
This document discusses Scala and big data technologies. It provides an overview of Scala libraries for working with Hadoop and MapReduce, including Scalding which provides a Scala DSL for Cascading. It also covers Spark, a cluster computing framework that operates on distributed datasets in memory for faster performance. Additional Scala projects for data analysis using functional programming approaches on Hadoop are also mentioned.
This document discusses Spark Streaming and its use for near real-time ETL. It provides an overview of Spark Streaming, how it works internally using receivers and workers to process streaming data, and an example use case of building a recommender system to find matches using both batch and streaming data. Key points covered include the streaming execution model, handling data receipt and job scheduling, and potential issues around data loss and (de)serialization.
This document provides an overview of Apache Spark, an open-source cluster computing framework. It discusses Spark's history and community growth. Key aspects covered include Resilient Distributed Datasets (RDDs) which allow transformations like map and filter, fault tolerance through lineage tracking, and caching data in memory or disk. Example applications demonstrated include log mining, machine learning algorithms, and Spark's libraries for SQL, streaming, and machine learning.
Scaling Big Data Mining Infrastructure Twitter ExperienceDataWorks Summit
The analytics platform at Twitter has experienced tremendous growth over the past few years in terms of size, complexity, number of users, and variety of use cases. In this talk, we’ll discuss the evolution of our infrastructure and the development of capabilities for data mining on “big data”. We’ll share our experiences as a case study, but make recommendations for best practices and point out opportunities for future work.
Ray (https://github.com/ray-project/ray) is a framework developed at UC Berkeley and maintained by Anyscale for building distributed AI applications. Over the last year, the broader machine learning ecosystem has been rapidly adopting Ray as the primary framework for distributed execution. In this talk, we will overview how libraries such as Horovod (https://horovod.ai/), XGBoost, and Hugging Face Transformers, have integrated with Ray. We will then showcase how Uber leverages Ray and these ecosystem integrations to simplify critical production workloads at Uber. This is a joint talk between Anyscale and Uber.
Spark Training Institutes: kelly technologies is the best Spark class Room training institutes in Bangalore. Providing Spark training by real time faculty in Bangalore.
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slackshyamraj55
Discover the seamless integration of RPA (Robotic Process Automation), COMPOSER, and APM with AWS IDP enhanced with Slack notifications. Explore how these technologies converge to streamline workflows, optimize performance, and ensure secure access, all while leveraging the power of AWS IDP and real-time communication via Slack notifications.
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfMalak Abu Hammad
Discover how MongoDB Atlas and vector search technology can revolutionize your application's search capabilities. This comprehensive presentation covers:
* What is Vector Search?
* Importance and benefits of vector search
* Practical use cases across various industries
* Step-by-step implementation guide
* Live demos with code snippets
* Enhancing LLM capabilities with vector search
* Best practices and optimization strategies
Perfect for developers, AI enthusiasts, and tech leaders. Learn how to leverage MongoDB Atlas to deliver highly relevant, context-aware search results, transforming your data retrieval process. Stay ahead in tech innovation and maximize the potential of your applications.
#MongoDB #VectorSearch #AI #SemanticSearch #TechInnovation #DataScience #LLM #MachineLearning #SearchTechnology
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!SOFTTECHHUB
As the digital landscape continually evolves, operating systems play a critical role in shaping user experiences and productivity. The launch of Nitrux Linux 3.5.0 marks a significant milestone, offering a robust alternative to traditional systems such as Windows 11. This article delves into the essence of Nitrux Linux 3.5.0, exploring its unique features, advantages, and how it stands as a compelling choice for both casual users and tech enthusiasts.
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
UiPath Test Automation using UiPath Test Suite series, part 5DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 5. In this session, we will cover CI/CD with devops.
Topics covered:
CI/CD with in UiPath
End-to-end overview of CI/CD pipeline with Azure devops
Speaker:
Lyndsey Byblow, Test Suite Sales Engineer @ UiPath, Inc.
Building Production Ready Search Pipelines with Spark and MilvusZilliz
Spark is the widely used ETL tool for processing, indexing and ingesting data to serving stack for search. Milvus is the production-ready open-source vector database. In this talk we will show how to use Spark to process unstructured data to extract vector representations, and push the vectors to Milvus vector database for search serving.
Infrastructure Challenges in Scaling RAG with Custom AI modelsZilliz
Building Retrieval-Augmented Generation (RAG) systems with open-source and custom AI models is a complex task. This talk explores the challenges in productionizing RAG systems, including retrieval performance, response synthesis, and evaluation. We’ll discuss how to leverage open-source models like text embeddings, language models, and custom fine-tuned models to enhance RAG performance. Additionally, we’ll cover how BentoML can help orchestrate and scale these AI components efficiently, ensuring seamless deployment and management of RAG systems in the cloud.
HCL Notes and Domino License Cost Reduction in the World of DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-and-domino-license-cost-reduction-in-the-world-of-dlau/
The introduction of DLAU and the CCB & CCX licensing model caused quite a stir in the HCL community. As a Notes and Domino customer, you may have faced challenges with unexpected user counts and license costs. You probably have questions on how this new licensing approach works and how to benefit from it. Most importantly, you likely have budget constraints and want to save money where possible. Don’t worry, we can help with all of this!
We’ll show you how to fix common misconfigurations that cause higher-than-expected user counts, and how to identify accounts which you can deactivate to save money. There are also frequent patterns that can cause unnecessary cost, like using a person document instead of a mail-in for shared mailboxes. We’ll provide examples and solutions for those as well. And naturally we’ll explain the new licensing model.
Join HCL Ambassador Marc Thomas in this webinar with a special guest appearance from Franz Walder. It will give you the tools and know-how to stay on top of what is going on with Domino licensing. You will be able lower your cost through an optimized configuration and keep it low going forward.
These topics will be covered
- Reducing license cost by finding and fixing misconfigurations and superfluous accounts
- How do CCB and CCX licenses really work?
- Understanding the DLAU tool and how to best utilize it
- Tips for common problem areas, like team mailboxes, functional/test users, etc
- Practical examples and best practices to implement right away
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
Best 20 SEO Techniques To Improve Website Visibility In SERPPixlogix Infotech
Boost your website's visibility with proven SEO techniques! Our latest blog dives into essential strategies to enhance your online presence, increase traffic, and rank higher on search engines. From keyword optimization to quality content creation, learn how to make your site stand out in the crowded digital landscape. Discover actionable tips and expert insights to elevate your SEO game.
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
How to Get CNIC Information System with Paksim Ga.pptx
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
1. Spark and Shark
High-Speed In-Memory Analytics
over Hadoop and Hive Data
Matei Zaharia, in collaboration with
Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Cliff
Engle, Michael Franklin, Haoyuan Li, Antonio Lupher, Justin
Ma, Murphy McCauley, Scott Shenker, Ion Stoica, Reynold Xin
UC Berkeley
spark-project.org UC BERKELEY
2. What is Spark?
Not a modified version of Hadoop
Separate, fast, MapReduce-like engine
» In-memory data storage for very fast iterative queries
» General execution graphs and powerful optimizations
» Up to 40x faster than Hadoop
Compatible with Hadoop’s storage APIs
» Can read/write to any Hadoop-supported system,
including HDFS, HBase, SequenceFiles, etc
3. What is Shark?
Port of Apache Hive to run on Spark
Compatible with existing Hive data, metastores,
and queries (HiveQL, UDFs, etc)
Similar speedups of up to 40x
4. Project History
Spark project started in 2009, open sourced 2010
Shark started summer 2011, alpha April 2012
In use at Berkeley, Princeton, Klout, Foursquare,
Conviva, Quantifind, Yahoo! Research & others
250+ member meetup, 500+ watchers on GitHub
6. Why a New Programming Model?
MapReduce greatly simplified big data analysis
But as soon as it got popular, users wanted more:
» More complex, multi-stage applications (graph
algorithms, machine learning)
» More interactive ad-hoc queries
» More real-time online processing
All three of these apps require fast data sharing
across parallel jobs
7. Data Sharing in MapReduce
HDFS HDFS HDFS HDFS
read write read write
iter. 1 iter. 2 . . .
Input
HDFS query 1 result 1
read
query 2 result 2
query 3 result 3
Input
. . .
Slow due to replication, serialization, and disk IO
8. Data Sharing in Spark
iter. 1 iter. 2 . . .
Input
query 1
one-time
processing
query 2
query 3
Input Distributed
memory . . .
10-100× faster than network and disk
9. Spark Programming Model
Key idea: resilient distributed datasets (RDDs)
» Distributed collections of objects that can be cached
in memory across cluster nodes
» Manipulated through various parallel operators
» Automatically rebuilt on failure
Interface
» Clean language-integrated API in Scala
» Can be used interactively from Scala console
10. Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
BaseTransformed RDD
RDD Cache 1
lines = spark.textFile(“hdfs://...”) Worker
results
errors = lines.filter(_.startsWith(“ERROR”))
messages = errors.map(_.split(„t‟)(2)) tasks Block 1
Driver
cachedMsgs = messages.cache()
Action
cachedMsgs.filter(_.contains(“foo”)).count
cachedMsgs.filter(_.contains(“bar”)).count Cache 2
Worker
. . .
Cache 3
Worker Block 2
Result: scaled tosearch of Wikipedia
full-text 1 TB data in 5-7 sec
in <1 sec (vs 20 for on-disk data)
(vs 170 sec sec for on-disk data) Block 3
11. Fault Tolerance
RDDs track the series of transformations used to
build them (their lineage) to recompute lost data
E.g: messages = textFile(...).filter(_.contains(“error”))
.map(_.split(„t‟)(2))
HadoopRDD FilteredRDD MappedRDD
path = hdfs://… func = _.contains(...) func = _.split(…)
12. Example: Logistic Regression
val data = spark.textFile(...).map(readPoint).cache()
var w = Vector.random(D) Load data in memory once
Initial parameter vector
for (i <- 1 to ITERATIONS) {
val gradient = data.map(p =>
(1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x
).reduce(_ + _)
w -= gradient Repeated MapReduce steps
} to do gradient descent
println("Final w: " + w)
13. Logistic Regression Performance
4500
4000
3500 127 s / iteration
Running Time (s)
3000
2500
Hadoop
2000
1500 Spark
1000
500
0 first iteration 174 s
further iterations 6 s
1 5 10 20 30
Number of Iterations
14. Supported Operators
map reduce sample
filter count cogroup
groupBy reduceByKey take
sort groupByKey partitionBy
join first pipe
leftOuterJoin union save
rightOuterJoin cross ...
15. Other Engine Features
General graphs of operators ( efficiency)
A: B:
G:
Stage 1 groupBy
C: D: F:
map = RDD
E: join
= cached data
Stage 2 union Stage 3
16. Other Engine Features
Controllable data partitioning to minimize
communication
PageRank Performance
200 171 Hadoop
Iteration time (s)
150
Basic Spark
100 72
Spark + Controlled
50 23 Partitioning
0
19. Applications
In-memory analytics & anomaly detection (Conviva)
Interactive queries on data streams (Quantifind)
Exploratory log analysis (Foursquare)
Traffic estimation w/ GPS data (Mobile Millennium)
Twitter spam classification (Monarch)
...
20. Conviva GeoReport
Hive 20
Spark 0.5
Time (hours)
0 5 10 15 20
Group aggregations on many keys w/ same filter
40× gain over Hive from avoiding repeated
reading, deserialization and filtering
21. Quantifind Feed Analysis
Parsed Extracted In-Memory
Data Feeds Web
Documents Entities Time Series
Spark
App
queries
Load data feeds, extract entities, and compute
in-memory tables every few minutes
Let users drill down interactively from AJAX app
22. Mobile Millennium Project
Estimate city traffic from crowdsourced GPS data
Iterative EM algorithm
scaling to 160 nodes
Credit: Tim Hunter, with support of the Mobile Millennium team; P.I. Alex Bayen; traffic.berkeley.edu
24. Motivation
Hive is great, but Hadoop’s execution engine
makes even the smallest queries take minutes
Scala is good for programmers, but many data
users only know SQL
Can we extend Hive to run on Spark?
25. Hive Architecture
Client CLI JDBC
Driver
Meta Physical Plan
store SQL Query
Parser Optimizer Execution
MapReduce
HDFS
26. Shark Architecture
Client CLI JDBC
Driver Cache Mgr.
Meta Physical Plan
store SQL Query
Parser Optimizer Execution
Spark
HDFS
[Engle et al, SIGMOD 2012]
27. Efficient In-Memory Storage
Simply caching Hive records as Java objects is
inefficient due to high per-object overhead
Instead, Shark employs column-oriented
storage using arrays of primitive types
Row Storage Column Storage
1 john 4.1 1 2 3
2 mike 3.5 john mike sally
3 sally 6.4 4.1 3.5 6.4
28. Efficient In-Memory Storage
Simply caching Hive records as Java objects is
inefficient due to high per-object overhead
Instead, Shark employs column-oriented
storage using arrays of primitive types
Row Storage Column Storage
1 john 4.1 1 2 3
Benefit: similarly compact size to serialized data,
2 but >5x faster to access sally
mike 3.5 john mike
3 sally 6.4 4.1 3.5 6.4
29. Using Shark
CREATE TABLE mydata_cached AS SELECT …
Run standard HiveQL on it, including UDFs
» A few esoteric features are not yet supported
Can also call from Scala to mix with Spark
Early alpha release at shark.cs.berkeley.edu
30. Benchmark Query 1
SELECT * FROM grep WHERE field LIKE „%XYZ%‟;
Shark (cached) 12s
Shark 182s
Hive 207s
0 50 100 150 200 250
Execution Time (secs)
31. Benchmark Query 2
SELECT sourceIP, AVG(pageRank), SUM(adRevenue) AS earnings
FROM rankings AS R, userVisits AS V ON R.pageURL = V.destURL
WHERE V.visitDate BETWEEN „1999-01-01‟ AND „2000-01-01‟
GROUP BY V.sourceIP
ORDER BY earnings DESC
LIMIT 1;
Shark (cached) 126s
Shark 270s
Hive 447s
0 100 200 300 400 500
Execution Time (secs)
34. Streaming Spark
Many key big data apps must run in real time
» Live event reporting, click analysis, spam filtering, …
Event-passing systems (e.g. Storm) are low-level
» Users must worry about FT, state, consistency
» Programming model is different from batch, so must
write each app twice
Can we give streaming a Spark-like interface?
35. Our Idea
Run streaming computations as a series of very
short (<1 second) batch jobs
» “Discretized stream processing”
Keep state in memory as RDDs (automatically
recover from any failure)
Provide a functional API similar to Spark
36. Spark Streaming API
Functional operators on discretized streams
New “stateful” operators for windowing
pageViews ones counts
pageViews = readStream("...", "1s")
t = 1:
ones = pageViews.map(ev => (ev.url, 1))
map reduce
counts = ones.runningReduce(_ + _)
t = 2:
D-streams Transformation
sliding = ones.reduceByWindow(
“5s”, _ + _, _ - _) ...
= RDD = partition
Sliding window reduce with
“add” and “subtract” functions
37. Streaming + Batch + Ad-Hoc
Combining D-streams with historical data:
pageViews.join(historicCounts).map(...)
Interactive ad-hoc queries on stream state:
counts.slice(“21:00”, “21:05”).topK(10)
38. How Fast Can It Go?
Can process 4 GB/s (42M records/s) of data on 100
nodes at sub-second latency
Recovers from failures within 1 sec
5 5
Grep TopKWords
Cluster Throughput
4 4
3 3
(GB/s)
2 2
1 1
0 1 sec 2 sec 0
0 50 100 0 50 100
Maximum throughput possible with 1s or 2s latency
39. Performance vs Storm
Spark Storm Spark Storm
60 30
Grep Throughput
TopK Throughput
(MB/s/node)
(MB/s/node)
40 20
20 10
0 0
10000 100 10000 100
Record Size (bytes) Record Size (bytes)
Storm limited to 10,000 records/s/node
Also tried S4: 7000 records/s/node
41. Conclusion
Spark & Shark speed up your interactive, complex,
and (soon) streaming analytics on Hadoop data
Download and docs: www.spark-project.org
» Easy local mode and deploy scripts for EC2
User meetup: meetup.com/spark-users
Training camp at Berkeley in August!
matei@berkeley.edu / @matei_zaharia
42. Behavior with Not Enough RAM
100
68.8
Iteration time (s)
58.1
80
40.7
60
29.7
40
11.5
20
0
Cache 25% 50% 75% Fully
disabled cached
% of working set in memory
43. Software Stack
Shark Bagel Streaming
(Hive on Spark) (Pregel on Spark) Spark
…
Spark
Local Apache
EC2 YARN
mode Mesos
Editor's Notes
Each iteration is, for example, a MapReduce job
Add “variables” to the “functions” in functional programming
Key idea: add “variables” to the “functions” in functional programming
This is for a 29 GB dataset on 20 EC2 m1.xlarge machines (4 cores each)
This will show interactive search on 50 GB of Wikipedia data
This will show interactive search on 50 GB of Wikipedia data
Query planning is also better in Shark due to (1) more optimizations and (2) use of more optimized Spark operators such as hash-based join
This will show interactive search on 50 GB of Wikipedia data
This will show interactive search on 50 GB of Wikipedia data
Streaming Spark offers similar speed while providing FT and consistency guarantees that these systems lack