This document discusses using the DryadLINQ framework to perform data-intensive computing on Windows HPC Server. DryadLINQ allows developers to write LINQ queries over distributed datasets using a declarative programming model. It automatically parallelizes queries by generating execution plans that leverage both intra-node parallelism using PLINQ and inter-node parallelism using the Dryad distributed execution engine. DryadLINQ integrates with .NET and provides type safety while handling serialization, distribution, and failure recovery of queries across large clustered datasets.
Slides for my talk at Hadoop Summit Dublin, April 2016.
The talk motivates how streaming can subsume batch use cases at the example of continuous counting.
Introducing Arc: A Common Intermediate Language for Unified Batch and Stream...Flink Forward
Today's end-to-end data pipelines need to combine many diverse workloads such as machine learning, relational operations, stream dataflows, tensor transformations, and graphs. For each of these workload types exist several frontends (e.g., DataFrames/SQL, Beam, Keras) based on different programming languages as well as different runtimes (e.g., Spark, Flink, Tensorflow) that target a particular frontend and possibly a hardware architecture (e.g., GPUs). Putting all the pieces of a data pipeline together simply leads to excessive data materialisation, type conversions and hardware utilisation as well as miss-matches of processing guarantees.
Our research group at RISE and KTH in Sweden has founded Arc, an intermediate language that bridges the gap between any frontend and a dataflow runtime (e.g., Flink) through a set of fundamental building blocks for expressing data pipelines. Arc incorporates Flink and Beam-inspired stream semantics such as windows, state and out of order processing as well as concepts found in batch computation models. With Arc, we can cross- compile and optimise diverse tasks written in any programming language into a unified dataflow program. Arc programs can run on various hardware backends efficiently as well as allowing seamless, distributed execution on dataflow runtimes. To that end, we showcase Arcon a concept runtime built in Rust that can execute Arc programs natively as well as presenting a minimal set of extensions to make Flink an Arc-ready runtime.
This document provides an overview of CBStreams, a ColdFusion module that implements Java Streams functionality for processing data in a functional programming style. It discusses key concepts like lazy evaluation, intermediate operations that transform streams, and terminal operations that produce final results. Examples are given for building streams from various data sources, applying filters, maps, reductions and more. Lambda expressions and closures play an important role in functional-style stream processing.
The Mechanics of Testing Large Data Pipelines (QCon London 2016)Mathieu Bastian
Talk about testing large Data Pipelines, mostly inspired from my experience at LinkedIn working on relevancy and recommender system pipelines.
Abstract: Applied machine learning data pipelines are being developed at a very fast pace and often exceed traditional web/business applications codebase in terms of scale and complexity. The algorithms and processes these data workflows implement fulfill business-critical applications which require robust and scalable architectures. But how to make these data pipelines robust? When the number of developers and data jobs grow while at the same time the underlying data change how do we test that everything works as expected?
In software development we divide things in clean, independent modules and use unit and integration testing to prevent bugs and regression. So why is it more complicated with big data workflows? Partly because these workflows usually pull data from dozens of sources out of our control and have a large number of interdependent data processing jobs. Also, partly because we don't know yet how to do or lack the proper tools.
Flink 0.10 @ Bay Area Meetup (October 2015)Stephan Ewen
Flink 0.10 focuses on operational readiness with improvements to high availability, monitoring, and integration with other systems. It provides first-class support for event time processing and refines the DataStream API to be both easy to use and powerful for stream processing tasks.
Analyzing Blockchain Transactions in Apache Spark with Jiri KremserDatabricks
Blockchain has become a buzzword: people are excited about distributed ledgers and cryptocurrencies, but these technologies are shrouded in myths, and misunderstanding. This talk will shed some light into how this awesome technology is actually used in practice by using Apache Spark to analyze blockchain transactions.
We’ll start with a brief introduction to blockchain transactions and how we can ETL transaction graph data obtained from the public binary format. Then we will look at how to model graph data in Spark, briefly comparing GraphFrames and GraphX. The majority of the presentation will be a live demo, running on Spark in the cloud, showing how we can run various queries on the transaction graph data, solve graph algorithms such as PageRank for identifying significant BTC addresses, observe network evolution, and more.
All of the work described in this talk is published as open source code and all of the data are available in public and available for community experimentation as well as all the containers. You will leave this talk with a better understanding of blockchain technology and graph processing in Spark and you will have the concrete tools to reproduce my research or start answering your own questions.
Distributed Stream Processing - Spark Summit East 2017Petr Zapletal
The document discusses distributed stream processing frameworks. It provides an overview of frameworks like Storm, Spark Streaming, Samza, Flink, and Kafka Streams. It compares aspects of different frameworks like programming models, delivery guarantees, fault tolerance, and state management. General guidelines are given for choosing a framework based on needs like latency requirements and state needs. Storm and Trident are recommended for low latency tasks while Spark Streaming and Flink are more full-featured but have higher latency. The document provides code examples for word count in different frameworks.
Slides for my talk at Hadoop Summit Dublin, April 2016.
The talk motivates how streaming can subsume batch use cases at the example of continuous counting.
Introducing Arc: A Common Intermediate Language for Unified Batch and Stream...Flink Forward
Today's end-to-end data pipelines need to combine many diverse workloads such as machine learning, relational operations, stream dataflows, tensor transformations, and graphs. For each of these workload types exist several frontends (e.g., DataFrames/SQL, Beam, Keras) based on different programming languages as well as different runtimes (e.g., Spark, Flink, Tensorflow) that target a particular frontend and possibly a hardware architecture (e.g., GPUs). Putting all the pieces of a data pipeline together simply leads to excessive data materialisation, type conversions and hardware utilisation as well as miss-matches of processing guarantees.
Our research group at RISE and KTH in Sweden has founded Arc, an intermediate language that bridges the gap between any frontend and a dataflow runtime (e.g., Flink) through a set of fundamental building blocks for expressing data pipelines. Arc incorporates Flink and Beam-inspired stream semantics such as windows, state and out of order processing as well as concepts found in batch computation models. With Arc, we can cross- compile and optimise diverse tasks written in any programming language into a unified dataflow program. Arc programs can run on various hardware backends efficiently as well as allowing seamless, distributed execution on dataflow runtimes. To that end, we showcase Arcon a concept runtime built in Rust that can execute Arc programs natively as well as presenting a minimal set of extensions to make Flink an Arc-ready runtime.
This document provides an overview of CBStreams, a ColdFusion module that implements Java Streams functionality for processing data in a functional programming style. It discusses key concepts like lazy evaluation, intermediate operations that transform streams, and terminal operations that produce final results. Examples are given for building streams from various data sources, applying filters, maps, reductions and more. Lambda expressions and closures play an important role in functional-style stream processing.
The Mechanics of Testing Large Data Pipelines (QCon London 2016)Mathieu Bastian
Talk about testing large Data Pipelines, mostly inspired from my experience at LinkedIn working on relevancy and recommender system pipelines.
Abstract: Applied machine learning data pipelines are being developed at a very fast pace and often exceed traditional web/business applications codebase in terms of scale and complexity. The algorithms and processes these data workflows implement fulfill business-critical applications which require robust and scalable architectures. But how to make these data pipelines robust? When the number of developers and data jobs grow while at the same time the underlying data change how do we test that everything works as expected?
In software development we divide things in clean, independent modules and use unit and integration testing to prevent bugs and regression. So why is it more complicated with big data workflows? Partly because these workflows usually pull data from dozens of sources out of our control and have a large number of interdependent data processing jobs. Also, partly because we don't know yet how to do or lack the proper tools.
Flink 0.10 @ Bay Area Meetup (October 2015)Stephan Ewen
Flink 0.10 focuses on operational readiness with improvements to high availability, monitoring, and integration with other systems. It provides first-class support for event time processing and refines the DataStream API to be both easy to use and powerful for stream processing tasks.
Analyzing Blockchain Transactions in Apache Spark with Jiri KremserDatabricks
Blockchain has become a buzzword: people are excited about distributed ledgers and cryptocurrencies, but these technologies are shrouded in myths, and misunderstanding. This talk will shed some light into how this awesome technology is actually used in practice by using Apache Spark to analyze blockchain transactions.
We’ll start with a brief introduction to blockchain transactions and how we can ETL transaction graph data obtained from the public binary format. Then we will look at how to model graph data in Spark, briefly comparing GraphFrames and GraphX. The majority of the presentation will be a live demo, running on Spark in the cloud, showing how we can run various queries on the transaction graph data, solve graph algorithms such as PageRank for identifying significant BTC addresses, observe network evolution, and more.
All of the work described in this talk is published as open source code and all of the data are available in public and available for community experimentation as well as all the containers. You will leave this talk with a better understanding of blockchain technology and graph processing in Spark and you will have the concrete tools to reproduce my research or start answering your own questions.
Distributed Stream Processing - Spark Summit East 2017Petr Zapletal
The document discusses distributed stream processing frameworks. It provides an overview of frameworks like Storm, Spark Streaming, Samza, Flink, and Kafka Streams. It compares aspects of different frameworks like programming models, delivery guarantees, fault tolerance, and state management. General guidelines are given for choosing a framework based on needs like latency requirements and state needs. Storm and Trident are recommended for low latency tasks while Spark Streaming and Flink are more full-featured but have higher latency. The document provides code examples for word count in different frameworks.
Extending Spark for Qbeast's SQL Data Source with Paola Pardo and Cesare Cug...Qbeast
Slides of the Barcelona Spark meetup of the 24th of October 2019. The recording is available at https://www.youtube.com/watch?v=eCoCcBH4hIU.
Abstract
One of the key strengths of Spark is its flexibility as it integrates with dozens of different storage systems and file formats. However, it is not the same reading from a CSV file, or a SQL database, or an exotic stratified sampled multidimensional database. And finding the right balance between modularity and flexibility is not easy!
In this presentation, we will talk about the evolution of Spark's DataSource API, and how it integrates with the SQL optimizer, highlighting how we can make much faster queries with logical and the physical plans that better integrates with the storage. From theory to practise, we will then discuss how we extended the Spark's internals, and we built a new source integration that allows the push-down of both sampling and multidimensional filtering.
About the speakers:
Paola Pardo is a Computer Engineer from Barcelona. She graduated in Computer engineer this last summer at the Technical University of Catalunya with a thesis focused on Data storage push down optimization based on Apache Spark. She is, and she is currently working at Barcelona Supercomputing Center and in its spin-off Qbeast developing a Qbeast-Spark connector.
Cesare Cugnasco is a PhD in Computer Architecture and a researcher at the Barcelona Supercomputing Center. His research focuses on NoSQL databases, distributed computing and High-performance storage. He invented and patented a new database architecture for Big Data, and he is building a spin-off for its commercialization.
Flink provides concise summaries of key points:
1) After submitting a Flink job, the client creates and submits the job graph to the JobManager, which then creates an execution graph and deploys tasks across TaskManagers for parallel execution.
2) The batch optimizer chooses optimal execution plans by evaluating physical execution strategies like join algorithms and data shipping approaches to minimize data shuffling and network usage.
3) Flink iterations are optimized by having the runtime directly handle caching, state maintenance, and pushing work out of loops to avoid scheduling overhead between iterations. Delta iterations further improve efficiency by only updating changed elements in each iteration.
What is SamzaSQL, and what might I use it for? Does this mean that Samza is turning into a database? What is a query optimizer, and what can it do for my streaming queries?
How does Apache Calcite parse, validate and optimize streaming SQL queries? How is relational algebra extended to handle streaming?
OrientDB - Time Series and Event Sequences - Codemotion Milan 2014Luigi Dell'Aquila
This document discusses using a multi-model database approach to manage time series and event sequence data. It describes some common approaches like using a relational database with timestamp fields or storing events in a document database. It then outlines how OrientDB combines graph and document models to provide flexibility while maintaining fast write and read speeds. Events can be connected in the graph and stored as documents to allow for relationships and complex properties. The document summarizes how OrientDB allows aggregating data during writes using hooks and querying pre-aggregated data to enable fast analysis of time-based data.
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...Data Con LA
Debugging data processing logic in Data-Intensive Scalable Computing (DISC) systems is a difficult and time consuming effort. To aid this effort, we built Titian, a library that enables data provenance tracking data through transformations in Apache Spark.
The state of analytics has changed dramatically over the last few years. Hadoop is now commonplace, and the ecosystem has evolved to include new tools such as Spark, Shark, and Drill, that live alongside the old MapReduce-based standards. It can be difficult to keep up with the pace of change, and newcomers are left with a dizzying variety of seemingly similar choices. This is compounded by the number of possible deployment permutations, which can cause all but the most determined to simply stick with the tried and true. In this talk I will introduce you to a powerhouse combination of Cassandra and Spark, which provides a high-speed platform for both real-time and batch analysis.
Grokking Techtalk #38: Escape Analysis in Go compilerGrokking VN
This document discusses escape analysis in the Go compiler. It provides an overview of the Go language and compiler, including the main phases of parsing, type checking and AST transformations, SSA form, and generating machine code. It notes that the type checking phase contains several sub-phases, including escape analysis, which determines if local variables can be allocated to the stack instead of the heap. The document then delves into how escape analysis is implemented in the Go compiler.
What's new with Apache Spark's Structured Streaming?Miklos Christine
Structured Streaming in Apache Spark allows users to write streaming applications as batch-style queries on static or streaming data sources. It treats streams as continuous unbounded tables and allows batch queries written on DataFrames/Datasets to be automatically converted into incremental execution plans to process streaming data in micro-batches. This provides a simple yet powerful API for building robust stream processing applications with end-to-end fault tolerance guarantees and integration with various data sources and sinks.
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...Flink Forward
1. Google Cloud Dataflow is a fully managed service that allows users to define data processing pipelines that can run batch or streaming computations.
2. The Dataflow programming model defines pipelines as directed graphs of transformations on collections of data elements. This provides flexibility in how computations are defined across batch and streaming workloads.
3. The Dataflow service handles graph optimization, scaling of workers, and monitoring of jobs to efficiently execute user-defined pipelines on Google Cloud Platform.
This document provides an overview of Spark, including:
- Spark's processing model involves chopping live data streams into batches and treating each batch as an RDD to apply transformations and actions.
- Resilient Distributed Datasets (RDDs) are Spark's primary abstraction, representing an immutable distributed collection of objects that can be operated on in parallel.
- An example word count program is presented to illustrate how to create and manipulate RDDs to count the frequency of words in a text file.
Extending Flink State Serialization for Better Performance and Smaller Checkp...Flink Forward
Operations with Flink state are a common source of performance issues for a typical stateful stream processing application. One tiny mistake can easily make your job to spend most of a precious CPU time in serialization and inflate a checkpoint size to the sky. In this talk we’ll focus on a Flink serialization framework and common problems happening around it:
* Is Kryo fallback is really that expensive from the CPU and state size perspective?
* How to plug your own or existing serializers into the Flink (like protobuf).
* Using Scala sealed traits without Kryo fallback.
* Using custom integer variable-length encoding and delta encoding for primitive arrays to further reduce the state size.
Rapid Web API development with Kotlin and KtorTrayan Iliev
Introduction to Kotlin and Ktor with flow, async and channel examples. Ktor is an async web framework with minimal ceremony that leverages the advantages of Kotlin like coroutines and extensible functional DSLs..
Pinot: Realtime OLAP for 530 Million Users - Sigmod 2018Seunghyun Lee
Pinot is a real-time OLAP data store that can support multiple analytics use cases like interactive dashboards, site facing queries, and anomaly detection in a single system. It achieves this through features like configurable indexes, dynamic query planning and execution, smart data partitioning and routing, and pre-materialized indexes like star-trees that optimize for latency and throughput across different workloads. The document discusses Pinot's architecture and optimizations that enable it to meet the performance requirements of these different use cases.
Design and Implementation of the Security Graph LanguageAsankhaya Sharma
Today software is built in fundamentally different
ways from how it was a decade ago. It is increasingly common
for applications to be assembled out of open-source components,
resulting in the use of large amounts of third-party code. This
third-party code is a means for vulnerabilities to make their
way downstream into applications. Recent vulnerabilities such
as Heartbleed, FREAK SSL/TLS, GHOST, and the Equifax data
breach (due to a flaw in Apache Struts) were ultimately caused
by third-party components. We argue that an automated way to
audit the open-source ecosystem, catalog existing vulnerabilities,
and discover new flaws is essential to using open-source safely.
To this end, we describe the Security Graph Language (SGL), a
domain-specific language for analysing graph-structured datasets
of open-source code and cataloguing vulnerabilities. SGL allows
users to express complex queries on relations between libraries
and vulnerabilities in the style of a program analysis language.
SGL queries double as an executable representation for vulnerabilities, allowing vulnerabilities to be automatically checked
against a database and deduplicated using a canonical representation. We outline a novel optimisation for SGL queries based on
regular path query containment, improving query performance up to 3 orders of magnitude. We also demonstrate the
effectiveness of SGL in practice to find zero-day vulnerabilities
by identifying sever
This document discusses Apache Spark, an open-source cluster computing framework. It provides an overview of Spark, including its main concepts like RDDs (Resilient Distributed Datasets) and transformations. Spark is presented as a faster alternative to Hadoop for iterative jobs and machine learning through its ability to keep data in-memory. Example code is shown for Spark's programming model in Scala and Python. The document concludes that Spark offers a rich API to make data analytics fast, achieving speedups of up to 100x over Hadoop in real applications.
Distributing Queries the Citus Way | PostgresConf US 2018 | Marco SlotCitus Data
Citus is a sharding extension for postgres that can efficiently distribute a wide range of SQL queries. It uses postgres' planner hook to transparently intercept and plan queries on "distributed" tables. Citus then executes the queries in parallel across many servers, in a way that delegates most of the heavy lifting back to postgres.
Within Citus, we distinguish between several types of SQL queries, which each have their own planning logic:
Local-only queries
Single-node “router” queries
Multi-node “real-time” queries
Multi-stage queries
Each type of query corresponds to a different use case, and Citus implements several planners and executors using different techniques to accommodate the performance requirements and trade-offs for each use case.
This talk will discuss the internals of the different types of planners and executors for distributing SQL on top of postgres, and how they can be applied to different use cases.
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...Martin Zapletal
The document discusses distributed machine learning and data processing. It covers several topics including reasons for using distributed machine learning, different distributed computing architectures and primitives, distributed data stores and analytics tools like Spark, streaming architectures like Lambda and Kappa, and challenges around distributed state management and fault tolerance. It provides examples of failures in distributed databases and suggestions to choose the appropriate tools based on the use case and understand their internals.
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraSomnath Mazumdar
This document provides an overview of several big data technologies including MapReduce, Pig, Flume, Cascading, and Dremel. It describes what each technology is used for, how it works, and example applications. MapReduce is a programming model for processing large datasets in a distributed environment, while Pig, Flume, and Cascading build upon MapReduce to provide higher-level abstractions. Dremel is an interactive query system for nested and complex datasets that uses a column-oriented data storage format.
This document discusses different tools that can be used to generate random test data and load test applications, including Tsung, ScalaCheck, and Gatling. It provides an overview of how each tool works and how they can be combined. Tsung is an open source load testing tool that can simulate users and load test applications. ScalaCheck is a property-based testing library that can generate random test data. Gatling is an open source load testing framework that supports load testing applications using scenarios and simulated users. It discusses how ScalaCheck can be used to generate random test data and how that data can be fed into Gatling load tests using feeders.
Extending Spark for Qbeast's SQL Data Source with Paola Pardo and Cesare Cug...Qbeast
Slides of the Barcelona Spark meetup of the 24th of October 2019. The recording is available at https://www.youtube.com/watch?v=eCoCcBH4hIU.
Abstract
One of the key strengths of Spark is its flexibility as it integrates with dozens of different storage systems and file formats. However, it is not the same reading from a CSV file, or a SQL database, or an exotic stratified sampled multidimensional database. And finding the right balance between modularity and flexibility is not easy!
In this presentation, we will talk about the evolution of Spark's DataSource API, and how it integrates with the SQL optimizer, highlighting how we can make much faster queries with logical and the physical plans that better integrates with the storage. From theory to practise, we will then discuss how we extended the Spark's internals, and we built a new source integration that allows the push-down of both sampling and multidimensional filtering.
About the speakers:
Paola Pardo is a Computer Engineer from Barcelona. She graduated in Computer engineer this last summer at the Technical University of Catalunya with a thesis focused on Data storage push down optimization based on Apache Spark. She is, and she is currently working at Barcelona Supercomputing Center and in its spin-off Qbeast developing a Qbeast-Spark connector.
Cesare Cugnasco is a PhD in Computer Architecture and a researcher at the Barcelona Supercomputing Center. His research focuses on NoSQL databases, distributed computing and High-performance storage. He invented and patented a new database architecture for Big Data, and he is building a spin-off for its commercialization.
Flink provides concise summaries of key points:
1) After submitting a Flink job, the client creates and submits the job graph to the JobManager, which then creates an execution graph and deploys tasks across TaskManagers for parallel execution.
2) The batch optimizer chooses optimal execution plans by evaluating physical execution strategies like join algorithms and data shipping approaches to minimize data shuffling and network usage.
3) Flink iterations are optimized by having the runtime directly handle caching, state maintenance, and pushing work out of loops to avoid scheduling overhead between iterations. Delta iterations further improve efficiency by only updating changed elements in each iteration.
What is SamzaSQL, and what might I use it for? Does this mean that Samza is turning into a database? What is a query optimizer, and what can it do for my streaming queries?
How does Apache Calcite parse, validate and optimize streaming SQL queries? How is relational algebra extended to handle streaming?
OrientDB - Time Series and Event Sequences - Codemotion Milan 2014Luigi Dell'Aquila
This document discusses using a multi-model database approach to manage time series and event sequence data. It describes some common approaches like using a relational database with timestamp fields or storing events in a document database. It then outlines how OrientDB combines graph and document models to provide flexibility while maintaining fast write and read speeds. Events can be connected in the graph and stored as documents to allow for relationships and complex properties. The document summarizes how OrientDB allows aggregating data during writes using hooks and querying pre-aggregated data to enable fast analysis of time-based data.
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...Data Con LA
Debugging data processing logic in Data-Intensive Scalable Computing (DISC) systems is a difficult and time consuming effort. To aid this effort, we built Titian, a library that enables data provenance tracking data through transformations in Apache Spark.
The state of analytics has changed dramatically over the last few years. Hadoop is now commonplace, and the ecosystem has evolved to include new tools such as Spark, Shark, and Drill, that live alongside the old MapReduce-based standards. It can be difficult to keep up with the pace of change, and newcomers are left with a dizzying variety of seemingly similar choices. This is compounded by the number of possible deployment permutations, which can cause all but the most determined to simply stick with the tried and true. In this talk I will introduce you to a powerhouse combination of Cassandra and Spark, which provides a high-speed platform for both real-time and batch analysis.
Grokking Techtalk #38: Escape Analysis in Go compilerGrokking VN
This document discusses escape analysis in the Go compiler. It provides an overview of the Go language and compiler, including the main phases of parsing, type checking and AST transformations, SSA form, and generating machine code. It notes that the type checking phase contains several sub-phases, including escape analysis, which determines if local variables can be allocated to the stack instead of the heap. The document then delves into how escape analysis is implemented in the Go compiler.
What's new with Apache Spark's Structured Streaming?Miklos Christine
Structured Streaming in Apache Spark allows users to write streaming applications as batch-style queries on static or streaming data sources. It treats streams as continuous unbounded tables and allows batch queries written on DataFrames/Datasets to be automatically converted into incremental execution plans to process streaming data in micro-batches. This provides a simple yet powerful API for building robust stream processing applications with end-to-end fault tolerance guarantees and integration with various data sources and sinks.
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...Flink Forward
1. Google Cloud Dataflow is a fully managed service that allows users to define data processing pipelines that can run batch or streaming computations.
2. The Dataflow programming model defines pipelines as directed graphs of transformations on collections of data elements. This provides flexibility in how computations are defined across batch and streaming workloads.
3. The Dataflow service handles graph optimization, scaling of workers, and monitoring of jobs to efficiently execute user-defined pipelines on Google Cloud Platform.
This document provides an overview of Spark, including:
- Spark's processing model involves chopping live data streams into batches and treating each batch as an RDD to apply transformations and actions.
- Resilient Distributed Datasets (RDDs) are Spark's primary abstraction, representing an immutable distributed collection of objects that can be operated on in parallel.
- An example word count program is presented to illustrate how to create and manipulate RDDs to count the frequency of words in a text file.
Extending Flink State Serialization for Better Performance and Smaller Checkp...Flink Forward
Operations with Flink state are a common source of performance issues for a typical stateful stream processing application. One tiny mistake can easily make your job to spend most of a precious CPU time in serialization and inflate a checkpoint size to the sky. In this talk we’ll focus on a Flink serialization framework and common problems happening around it:
* Is Kryo fallback is really that expensive from the CPU and state size perspective?
* How to plug your own or existing serializers into the Flink (like protobuf).
* Using Scala sealed traits without Kryo fallback.
* Using custom integer variable-length encoding and delta encoding for primitive arrays to further reduce the state size.
Rapid Web API development with Kotlin and KtorTrayan Iliev
Introduction to Kotlin and Ktor with flow, async and channel examples. Ktor is an async web framework with minimal ceremony that leverages the advantages of Kotlin like coroutines and extensible functional DSLs..
Pinot: Realtime OLAP for 530 Million Users - Sigmod 2018Seunghyun Lee
Pinot is a real-time OLAP data store that can support multiple analytics use cases like interactive dashboards, site facing queries, and anomaly detection in a single system. It achieves this through features like configurable indexes, dynamic query planning and execution, smart data partitioning and routing, and pre-materialized indexes like star-trees that optimize for latency and throughput across different workloads. The document discusses Pinot's architecture and optimizations that enable it to meet the performance requirements of these different use cases.
Design and Implementation of the Security Graph LanguageAsankhaya Sharma
Today software is built in fundamentally different
ways from how it was a decade ago. It is increasingly common
for applications to be assembled out of open-source components,
resulting in the use of large amounts of third-party code. This
third-party code is a means for vulnerabilities to make their
way downstream into applications. Recent vulnerabilities such
as Heartbleed, FREAK SSL/TLS, GHOST, and the Equifax data
breach (due to a flaw in Apache Struts) were ultimately caused
by third-party components. We argue that an automated way to
audit the open-source ecosystem, catalog existing vulnerabilities,
and discover new flaws is essential to using open-source safely.
To this end, we describe the Security Graph Language (SGL), a
domain-specific language for analysing graph-structured datasets
of open-source code and cataloguing vulnerabilities. SGL allows
users to express complex queries on relations between libraries
and vulnerabilities in the style of a program analysis language.
SGL queries double as an executable representation for vulnerabilities, allowing vulnerabilities to be automatically checked
against a database and deduplicated using a canonical representation. We outline a novel optimisation for SGL queries based on
regular path query containment, improving query performance up to 3 orders of magnitude. We also demonstrate the
effectiveness of SGL in practice to find zero-day vulnerabilities
by identifying sever
This document discusses Apache Spark, an open-source cluster computing framework. It provides an overview of Spark, including its main concepts like RDDs (Resilient Distributed Datasets) and transformations. Spark is presented as a faster alternative to Hadoop for iterative jobs and machine learning through its ability to keep data in-memory. Example code is shown for Spark's programming model in Scala and Python. The document concludes that Spark offers a rich API to make data analytics fast, achieving speedups of up to 100x over Hadoop in real applications.
Distributing Queries the Citus Way | PostgresConf US 2018 | Marco SlotCitus Data
Citus is a sharding extension for postgres that can efficiently distribute a wide range of SQL queries. It uses postgres' planner hook to transparently intercept and plan queries on "distributed" tables. Citus then executes the queries in parallel across many servers, in a way that delegates most of the heavy lifting back to postgres.
Within Citus, we distinguish between several types of SQL queries, which each have their own planning logic:
Local-only queries
Single-node “router” queries
Multi-node “real-time” queries
Multi-stage queries
Each type of query corresponds to a different use case, and Citus implements several planners and executors using different techniques to accommodate the performance requirements and trade-offs for each use case.
This talk will discuss the internals of the different types of planners and executors for distributing SQL on top of postgres, and how they can be applied to different use cases.
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...Martin Zapletal
The document discusses distributed machine learning and data processing. It covers several topics including reasons for using distributed machine learning, different distributed computing architectures and primitives, distributed data stores and analytics tools like Spark, streaming architectures like Lambda and Kappa, and challenges around distributed state management and fault tolerance. It provides examples of failures in distributed databases and suggestions to choose the appropriate tools based on the use case and understand their internals.
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraSomnath Mazumdar
This document provides an overview of several big data technologies including MapReduce, Pig, Flume, Cascading, and Dremel. It describes what each technology is used for, how it works, and example applications. MapReduce is a programming model for processing large datasets in a distributed environment, while Pig, Flume, and Cascading build upon MapReduce to provide higher-level abstractions. Dremel is an interactive query system for nested and complex datasets that uses a column-oriented data storage format.
This document discusses different tools that can be used to generate random test data and load test applications, including Tsung, ScalaCheck, and Gatling. It provides an overview of how each tool works and how they can be combined. Tsung is an open source load testing tool that can simulate users and load test applications. ScalaCheck is a property-based testing library that can generate random test data. Gatling is an open source load testing framework that supports load testing applications using scenarios and simulated users. It discusses how ScalaCheck can be used to generate random test data and how that data can be fed into Gatling load tests using feeders.
What s an Event ? How Ontologies and Linguistic Semantics ...butest
The document discusses challenges for machine learning models in extracting information about events from text. It describes different approaches to representing events, from single relations to complex structures of interconnected subevents, and issues around learning event representations, connectivity between subevents, and encoding temporal and causal relationships. Proper representation of events requires associating linguistic and ontological knowledge about events and their components.
This document summarizes a request for proposal from the Missouri Office of Administration for janitorial services in various state-owned buildings in St. Louis. It provides information on submitting proposals, including revising the return date to April 14, 2010. It outlines amendments made to sections of the RFP related to the contract period, specifications, and pricing. A pre-proposal conference was scheduled for March 30, 2010 to discuss the RFP and allow for questions.
Dokumen tersebut membahas solusi menghadapi biaya hidup yang semakin mahal di masa tua, melalui pendapatan pasif dari bisnis jaringan pemasaran binary PT Melia Nature Indonesia. Bisnis ini menawarkan dua produk kesehatan dengan modal awal Rp580.000, dan memungkinkan pendapatan pasif Rp850.000 per minggu dengan merekrut dua anggota. Pendapatan dapat meningkat dengan merekrut lebih banyak anggota.
Apache Samza is a stream processing framework that provides high-level APIs and powerful stream processing capabilities. It is used by many large companies for real-time stream processing. The document discusses Samza's stream processing architecture at LinkedIn, how it scales to process billions of messages per day across thousands of machines, and new features around faster onboarding, powerful APIs including Apache Beam support, easier development through high-level APIs and tables, and better operability in YARN and standalone clusters.
LINQ to HPC: Developing Big Data Applications on Windows HPC ServerSaptak Sen
This document discusses large-scale data processing enabled by new technologies. It notes that large data volumes from 100s of TBs to 10s of PBs can now be processed at low cost using distributed parallel frameworks like MapReduce. New data sources include sensors, devices, and unstructured data like text and images. These new technologies enable analyzing this data to answer questions and gain new insights about product popularity, best ads to serve, and detecting fraud.
The document discusses a gate pass system project that provides security and shows comparability with passing elements through multiple steps. It includes checking an element, showing its status, and passing the element through the security levels. The project defines many attributes and properties to demonstrate the security features.
Visual Studio 2010 includes many new features to improve the developer experience such as breakpoint grouping, parallel debugging tools, and a more extensible architecture. It can be used both as a robust code editor and as a platform for extensions. .NET 4.0 focuses on four main areas: better component integration, improved performance through parallelism and concurrency, enhanced language features, and reducing bugs. It includes new libraries like PLINQ and TPL for parallel programming and MEF for extensibility.
This document discusses LINQ (Language Integrated Query), which is Microsoft's technology for querying data from various sources using a common language. LINQ allows querying data from in-memory collections, databases, XML documents, and other sources. It provides a uniform programming model that is independent of data sources. LINQ queries are executed using deferred execution, where the query is not run until its results are iterated over. The document covers LINQ query expressions, extension methods, implicit typing with LINQ queries, and returning query results.
Continuous Application with Structured Streaming 2.0Anyscale
Introduction to Continuous Application with Apache Spark 2.0 Structured Streaming. This presentation is a culmination and curation from talks and meetups presented by Databricks engineers.
The notebooks on Structured Streaming demonstrates aspects of the Structured Streaming APIs
Hadoop and HBase experiences in perf log projectMao Geng
This document discusses experiences using Hadoop and HBase in the Perf-Log project. It provides an overview of the Perf-Log data format and architecture, describes how Hadoop and HBase were configured, and gives examples of using MapReduce jobs and HBase APIs like Put and Scan to analyze log data. Key aspects covered include matching Hadoop and HBase versions, running MapReduce jobs, using column families in HBase, and filtering Scan results.
This document provides an overview of new features and enhancements in Visual Studio 2005, also known as Whidbey. It summarizes improvements to languages like Visual Basic .NET, Visual C#, C++, and Visual J#, the .NET Framework, data access, web services, and mobile development. Key areas of focus include increased productivity, simplified development, and better integration across languages, frameworks, and platforms.
This document discusses different data access technologies in .NET, including LINQ, the ADO.NET Entity Framework, and ADO.NET Data Services. It introduces LINQ as a way to query objects using SQL-like syntax. It describes how LINQ can be used with datasets, entity models, and other data stores. The Entity Framework is introduced as providing LINQ support along with a richer object model and query capabilities. ADO.NET Data Services exposes data through a RESTful API for web and mobile clients.
LINQ (Language Integrated Query) is Microsoft's technology that allows querying of data from various sources using a common language syntax. It provides a uniform programming model to query and manipulate data regardless of data source. LINQ queries can be executed against in-memory collections, databases, XML documents, and other data sources. The document discusses various LINQ concepts such as LINQ query expressions, deferred execution, implicit typing with var, extension methods of the Enumerable class, and LINQ to Objects for querying arrays and generic/non-generic collections.
Fabric is a scalable real-time stream processing framework developed by Ola. It is designed for high throughput event ingestion from various sources and writing events to different targets. Fabric provides batch processing of events, scalability, reliability and makes data available for other applications in near real-time. It uses components like sources, processors and executors along with a compute framework to orchestrate event flows. Fabric is proven to reliably handle over 2.5 million events per second for applications like fraud detection at Ola.
BenchmarkDotNet - Powerful .NET library for benchmarkingLarry Nung
BenchmarkDotNet is a powerful .NET library for benchmarking code. It allows developers to easily write benchmarking code, run benchmarks across different runtimes and environments, and view results in various formats like markdown, CSV, HTML and plain text. Key features include support for multiple .NET implementations, automatic warmup and overhead evaluation, parameterization of benchmarks, and diagnostic tools to analyze benchmark performance.
OSCON 2014 - API Ecosystem with Scala, Scalatra, and Swagger at NetflixManish Pandit
This document discusses Netflix's API ecosystem built using Scala, Scalatra, and Swagger. It summarizes Netflix's use of these technologies to build APIs that power their consumer electronics partner portal and enable certification of Netflix ready devices. It describes how the APIs provide a single source of truth for all device data at Netflix and correlate streaming quality metrics. It then discusses aspects of the architecture including the manager layer containing business logic, HTTP layer for handling requests/responses, and use of Scala, Scalatra, Swagger, and deployment process including immutable infrastructure.
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic SystemAccumulo Summit
Timely was born to visualize and analyze metric data at a scale untenable for existing solutions. We're returning to talk about what we've achieved over the past year, provide a detailed look into production architecture and discuss additional features added within the past year including alerting and support for external analytics.
– Speakers –
Drew Farris
Chief Technologist, Booz Allen Hamilton
Drew Farris is a software developer and technology consultant at Booz Allen Hamilton where he helps his client solve problems related to large scale analytics, distributed computing and machine learning. He is a member of the Apache Software Foundation and a contributing author to Manning Publications’ “Taming Text” and the Booz Allen Hamilton “Field Guide to Data Science”.
Bill Oley
Senior Lead Engineer, Booz Allen Hamilton
Bill Oley is a senior lead software engineer at Booz Allen Hamilton where he helps his clients analyze and solve problems related to large scale data ingest, storage, retrieval, and analysis. He is particularly interested in improving visibility into large scale systems by making actionable metrics scalable and usable. He has 16 years of experience designing and developing fault-tolerant distributed systems that operate on continuous streams of data. He holds a bachelor's degree in computer science from the United States Naval Academy and a master's degree in computer science from The Johns Hopkins University.
— More Information —
For more information see http://www.accumulosummit.com/
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionChetan Khatri
Scala Toronto July 2019 event at 500px.
Pure Functional API Integration
Apache Spark Internals tuning
Performance tuning
Query execution plan optimisation
Cats Effects for switching execution model runtime.
Discovery / experience with Monix, Scala Future.
Dataservices: Processing (Big) Data the Microservice WayQAware GmbH
Apache Big Data 2017, Miami (Florida/USA): Talk by Josef Adersberger (@adersberger, CTO at QAware)
Abstract:
We see a big data processing pattern emerging using the Microservice approach to build an integrated, flexible, and distributed system of data processing tasks. We call this the Dataservice pattern. In this presentation we'll introduce into Dataservices: their basic concepts, the technology typically in use (like Kubernetes, Kafka, Cassandra and Spring) and some architectures from real-life.
The .NET Framework allows developers to easily develop applications across various platforms and devices. Some key aspects of the .NET Framework 4.0 include improved support for parallel and asynchronous programming using technologies like the Task Parallel Library and improvements to the garbage collector to better optimize application performance. The Dynamic Language Runtime also allows dynamic languages to better interact with the .NET Framework and CLR.
Serverless London 2019 FaaS composition using Kafka and CloudEventsNeil Avery
FaaS composition using Kafka and Cloud-Events
LOCATION: Burton & Redgrave, DATE: November 7, 2019, TIME: 2:30 pm - 3:15 pm
https://serverlesscomputing.london/sessions/faas-composition-using-kafka-and-cloud-events/
Serverless functions or FaaS are all the rage. By leveraging well established event-driven microservice design principles and applying them to serverless functions we can build a homogenous ecosystem to run FaaS applications.
Kafka’s natural ability to store and replay events means serverless functions can not only be replayed, but they can also be used to choreograph call chains or driven using orchestration. Kafka also means we can democratize and organize FaaS environments in a way that scales across the enterprise.
Underpinning this mantra is the use of Cloud Events by the CNCF serverless working group (of which Confluent is an active member).
Objective of the talk
You will leave the talk with an understanding of what the future of cloud holds, a methodology for embracing serverless functions and how they become part of your journey to a cloud-native, event-driven architecture.
This presentation held in at Inovex GmbH in Munich in November 2015 was about a general introduction of the streaming space, an overview of Flink and use cases of production users as presented at Flink Forward.
Similar to SVR17: Data-Intensive Computing on Windows HPC Server with the ... (20)
Este documento analiza el modelo de negocio de YouTube. Explica que YouTube y otros sitios de video online representan un nuevo modelo de negocio para contenidos audiovisuales debido al cambio en los hábitos de consumo causado por las nuevas tecnologías. Describe cómo YouTube aprovecha la participación de los usuarios para mejorar continuamente y atraer una audiencia diferente a la de los medios tradicionales.
The defense was successful in portraying Michael Jackson favorably to the jury in several ways:
1) They dressed Jackson in ornate costumes that conveyed images of purity, innocence, and humility.
2) Jackson was shown entering the courtroom as if on a red carpet, emphasizing his celebrity status.
3) Jackson appeared vulnerable, childlike, and in declining health during the trial, eliciting sympathy from jurors.
4) Defense attorney Tom Mesereau effectively presented a coherent narrative of Jackson as a victim and portrayed Neverland as a place of refuge, undermining the prosecution's arguments.
Michael Jackson was born in 1958 in Gary, Indiana and rose to fame in the 1960s as the lead singer of The Jackson 5, topping music charts in the 1970s. As a solo artist in the 1980s, his album Thriller broke music records. In the 1990s and 2000s, Jackson faced several legal issues related to child abuse allegations while continuing to release music. He married Lisa Marie Presley and Debbie Rowe and had two children before his death in 2009.
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...butest
This document appears to be a list of popular books from various authors. It includes over 150 book titles across many genres such as fiction, non-fiction, memoirs, and novels. The books cover a wide range of topics from politics to cooking to autobiographies.
The prosecution lost the Michael Jackson trial due to several key mistakes and weaknesses in their case:
1) The lead prosecutor, Thomas Sneddon, was too personally invested in the case against Jackson, having pursued him for over a decade without success.
2) Sneddon's opening statement was disorganized and weak, failing to effectively outline the prosecution's case.
3) The accuser's mother was not credible and damaged the prosecution's case through her erratic testimony, history of lies and con artist behavior.
4) Many prosecution witnesses were not credible due to prior lawsuits against Jackson, debts owed to him, or having been fired by him. Several witnesses even took the Fifth Amendment.
Here are three examples of public relations from around the world:
1. The UK government's "Be Clear on Cancer" campaign which aims to raise awareness of cancer symptoms and encourage early diagnosis.
2. Samsung's global brand marketing and sponsorship activities which aim to increase brand awareness and favorability of Samsung products worldwide.
3. The Brazilian government's efforts to improve its international image and relations with other countries through strategic communication and diplomacy.
The three most important functions of public relations are:
1. Media relations because the media is how most organizations reach their key audiences. Strong media relationships are crucial.
2. Writing, because written communication is at the core of public relations and how most information is
Michael Jackson Please Wait... provides biographical information about Michael Jackson including his birthdate, birthplace, parents, height, interests, idols, favorite foods, films, and more. It discusses his background, career highlights including influential albums like Thriller, and films he appeared in such as The Wiz and Moonwalker. The document contains photos and details about Jackson's life and illustrious music career.
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazzbutest
The document discusses the process of manufacturing celebrity and its negative byproducts. It argues that celebrities are rarely the best in their individual pursuits like singing, dancing, etc. but become famous due to being products of a system controlled by wealthy elites. This system stifles opportunities for worthy artists and creates feudalism. The document also asserts that manufactured celebrities should not be viewed as role models due to behaviors like drug abuse and narcissism that result from the celebrity-making process.
Michael Jackson was a child star who rose to fame with the Jackson 5 in the late 1960s and early 1970s. As a solo artist in the 1970s and 1980s, he had immense commercial success with albums like Off the Wall, Thriller, and Bad, which featured hit singles and groundbreaking music videos. However, his career and public image were plagued by controversies related to allegations of child sexual abuse in the 1990s and 2000s. He continued recording and performing but faced ongoing media scrutiny into his private life until his death in 2009.
Social Networks: Twitter Facebook SL - Slide 1butest
The document discusses using social networking tools like Twitter and Facebook in K-12 education. Twitter allows students and teachers to share short updates and can be used to give parents a window into classroom activities. Facebook allows targeted advertising that could be used to promote educational activities. Both tools could help facilitate communication between schools and communities if used properly while managing privacy and security concerns.
Facebook has over 300 million active users who log on daily, and allows brands to create public profile pages to interact with users. Pages are for brands and organizations only, while groups can be made by any user about any topic. Pages do not show admin names and have no limits on fans, while groups display admin names and are limited to 5,000 members. Content on pages should aim to provoke action from subscribers and establish a regular posting schedule using a conversational tone.
Executive Summary Hare Chevrolet is a General Motors dealership ...butest
Hare Chevrolet is a car dealership located in Noblesville, Indiana that has successfully used social media platforms like Twitter, Facebook, and YouTube to create a positive brand image. They invest significant time interacting directly with customers online to foster a sense of community rather than overtly advertising. As a result, Hare Chevrolet has built a large, engaged audience on social media and serves as a model for how brands can use online presences strategically.
Welcome to the Dougherty County Public Library's Facebook and ...butest
This document provides instructions for signing up for Facebook and Twitter accounts. It outlines the sign up process for both platforms, including filling out forms with name, email, password and other details. It describes how the platforms will then search for friends and suggest people to connect with. It also explains how to search for and follow the Dougherty County Public Library page on both Facebook and Twitter once signed up. The document concludes by thanking participants and providing a contact for any additional questions.
Paragon Software announces the release of Paragon NTFS for Mac OS X 8.0, which provides full read and write access to NTFS partitions on Macs. It is the fastest NTFS driver on the market, achieving speeds comparable to native Mac file systems. Paragon NTFS for Mac 8.0 fully supports the latest Mac OS X Snow Leopard operating system in 64-bit mode and allows easy transfer of files between Windows and Mac partitions without additional hardware or software.
This document provides compatibility information for Olympus digital products used with Macintosh OS X. It lists various digital cameras, photo printers, voice recorders, and accessories along with their connection type and any notes on compatibility. Some products require booting into OS 9.1 for software compatibility or do not support devices that need a serial port. Drivers and software are available for download from Olympus and other websites for many products to enable use with OS X.
To use printers managed by the university's Information Technology Services (ITS), students and faculty must install the ITS Remote Printing software on their Mac OS X computer. This allows them to add network printers, log in with their ITS account credentials, and print documents while being charged per page to funds in their pre-paid ITS account. The document provides step-by-step instructions for installing the software, adding a network printer, and printing to that printer from any internet connection on or off campus. It also explains the pay-in-advance printing payment system and how to check printing charges.
The document provides an overview of the Mac OS X user interface for beginners, including descriptions of the desktop, login screen, desktop elements like the dock and hard disk, and how to perform common tasks like opening files and folders. It also addresses frequently asked questions for Windows users switching to Mac OS X, such as where documents are stored, how to save or find documents, and what the equivalent of the C: drive is in Mac OS X. The document concludes with sections on file management tasks like creating and deleting folders, organizing files within applications, using Spotlight search, and an overview of the Dashboard feature.
This document provides a checklist for securing Mac OS X version 10.5, focusing on hardening the operating system, securing user accounts and administrator accounts, enabling file encryption and permissions, implementing intrusion detection, and maintaining password security. It describes the Unix infrastructure and security framework that Mac OS X is built on, leveraging open source software and following the Common Data Security Architecture model. The checklist can be used to audit a system or harden it against security threats.
This document summarizes a course on web design that was piloted in the summer of 2003. The course was a 3 credit course that met 4 times a week for lectures and labs. It covered topics such as XHTML, CSS, JavaScript, Photoshop, and building a basic website. 18 students from various majors enrolled. Student and instructor evaluations found the course to be very successful overall, though some improvements were suggested like ensuring proper software and pairing programming/non-programming students. The document also discusses implications of incorporating web design material into existing computer science curriculums.
SVR17: Data-Intensive Computing on Windows HPC Server with the ...
1. Data-Intensive Computing on Windows HPC Server with the DryadLINQ Framework John Vert Architect Microsoft Corporation SVR17
2. Moving Parts Windows HPC Server 2008 – cluster management, job scheduling Dryad – distributed execution engine, failure recovery, distribution, scalability across very large partitioned datasets LINQ – .NET extensions for declarative query, easy expression of data parallelism, unified data model PLINQ – multi-core parallelism across LINQ queries. DryadLINQ – Bring LINQ ease of programming to Dryad
3. Software Stack … Image Processing MachineLearning Graph Analysis DataMining .NET Applications DryadLINQ Dryad HPC Job Scheduler Windows HPC Server 2008 Windows HPC Server 2008 Windows HPC Server 2008 Windows HPC Server 2008
4. Dryad Provides a general, flexible distributed execution layer Dataflow graph as the computation model Can be modified by runtime optimizations Higher language layer supplies graph, vertex code, serialization code, hints for data locality Automatically handles distributed execution Distributes code, routes data Schedules processes on machines near data Masks failures in cluster and network
7. LINQLanguage Integrated Query Declarative extensions to C# and VB.NET for iterating over collections In memory Via data providers SQL-Like Broadly adoptable by developers Easy to use Reduces written code Predictable results Scalable experience Deep tooling support
8. PLINQ Parallel Language Integrated Query Value Proposition: Enable LINQ developers to take advantage of parallel hardware—with basic understanding of data parallelism. Declarative data parallelism (focus on the “what” not the “how”) Alternative to LINQ-to-Objects Same set of query operators + some extras Default is IEnumerable<T> based Preview in Parallel Extensions to .NET Framework 3.5 CTP Shipping in .NET Framework 4.0 Beta 2
9. DryadLINQLINQ to clusters Declarative programming style of LINQ for clusters Automatic parallelization Parallel query plan exploits multi-node parallelism PLINQ underneath exploits multi-core parallelism Integration with VS and .NET Type safety, automatic serialization Query plan optimizations Static optimization rules to optimize locality Dynamic run-time optimizations
10. DryadLINQ: From LINQ to Dryad Automatic query plan generation Distributed query execution by Dryad LINQ query Query plan Dryad varlogentries = from line in logs where !line.StartsWith("#") select new LogEntry(line); logs where select
11. A Simple LINQ Query IEnumerable<BabyInfo> babies = ...; varresults = from baby in babies where baby.Name == queryName && baby.State == queryState && baby.Year >= yearStart && baby.Year <= yearEnd orderbybaby.Yearascending select baby;
12. A Simple PLINQ Query IEnumerable<BabyInfo> babies = ...; varresults = from baby in babies.AsParallel() where baby.Name == queryName && baby.State == queryState && baby.Year >= yearStart && baby.Year <= yearEnd orderbybaby.Yearascending select baby;
13. A Simple DryadLINQQuery PartitionedTable<BabyInfo> babies = PartitionedTable.Get<BabyInfo>(“BabyInfo.pt”); varresults = from baby in babies where baby.Name == queryName && baby.State == queryState && baby.Year >= yearStart && baby.Year <= yearEnd orderbybaby.Yearascending select baby;
14. PartitionedTable<T>Core data structure for DryadLINQ Scale-out, partitioned container for .NET objects Derives from IQueryable<T>, IEnumerable<T> ToPartitionedTable() extension methods DryadLINQ operators consume and produce PartitionedTable<T> DryadLINQ generates code to serialize/deserialize your .NET objects Underlying storage can be partitioned file, partitioned SQL table, cluster filesystem
17. A typical data-intensive query var logs = PartitionedTable.Get<string>(“weblogs.pt”); varlogentries = from line in logs where !line.StartsWith("#") select new LogEntry(line); var user = from access in logentries where access.user.EndsWith(@"vert") select access; var accesses = from access in user group access by access.page into pages select new UserPageCount(“jvert", pages.Key, pages.Count()); varhtmAccesses = from access in accesses where access.page.EndsWith(".htm") orderbyaccess.count descending select access; Go through logs and keep only lines that are not comments. Parse each line into a new LogEntryobject. Go through logentries and keep only entries that are accesses by jvert. Group jvertaccesses according to what page they correspond to. For each page, count the occurrences. Sort the pages jverthas accessed according to access frequency.
18. Dryad Parallel DAG execution logs logentries varlogentries = from line in logs where !line.StartsWith("#") select new LogEntry(line); var user = from access in logentries where access.user.EndsWith(@"vert") select access; var accesses = from access in user group access by access.page into pages select new UserPageCount(“jvert", pages.Key, pages.Count()); varhtmAccesses = from access in accesses where access.page.EndsWith(".htm") orderbyaccess.count descending select access; user accesses htmAccesses output
19. Query plan generation Separation of query from its execution context Add all the loaded assemblies as resources Eliminate references to local variables by partially evaluating all the expressions in the query Distribute objects used by the query Detect impure queries when possible Automatic code generation Object serialization code for Dryad channels Managed code for Dryad Vertices Static query plan optimizations Pipelining: composing multiple operators into one vertex Minimize unnecessary data repartitions Other standard DB optimizations
20. DryadLINQ query plan Query 0 Output: file://hpcmetahn01Cutput7e651a4-38b7-490c-8399-f63eaba7f29a.pt DryadLinq0.dll was built successfully. Input: [PartitionedTable: file://weblogs.pt] Super__1: Where(line => !(line.StartsWith(_))) Select(line => new logdemo.LogEntry(line)) Where(access => access.user.EndsWith(_)) DryadGroupBy(access => access.page,(k__0, pages) => new LinqToDryad.Pair<String,Int32>(k__0, pages.Count())) DryadHashPartition(e => e.Key,e => e.Key) Super__12: DryadMerge() DryadGroupBy(e => e.Key,e => e.Value,(k__0, g__1) => new LinqToDryad.Pair<String,Int32>(k__0, g__1.Sum())) Select(pages => new logdemo.UserPageCount(_, pages.Key, pages.Count()))
21. XML representationGenerated by DryadLINQ and passed to Dryad <Query> <DryadLinqVersion>1.0.1401.0</DryadLinqVersion> <ClusterName>hpcmetahn01</ClusterName> ... <Resources> <Resource>wrappernativeinfo.dll</Resource> <Resource>DryadLinq0.dll</Resource> <Resource>System.Threading.dll</Resource> <Resource>logdemo.exe</Resource> <Resource>LinqToDryad.dll</Resource> </Resources> <QueryPlan> <Vertex> <UniqueId>0</UniqueId> <Type>InputTable</Type> <Name>weblogs.pt</Name> ... </Vertex> <Vertex> <UniqueId>1</UniqueId> <Type>Super</Type> <Name>Super__1</Name> ... <Children> <Child> <UniqueId>0</UniqueId> </Child> </Children> </Vertex> ... </QueryPlan> <Query> List of files to be shipped to the cluster Vertex definitions
22. DryadLINQ generated codeCompiled at runtime, assembly passed to Dryad to implement vertices public sealed class DryadLinq__Vertex { public static int Super__1(string args) { < . . . > DryadVertexEnvdenv = new DryadVertexEnv(args, dvertexparam); var dwriter__2 = denv.MakeWriter(DryadLinq__Extension.FactoryType__0); var dreader__3 = denv.MakeReader(DryadLinq__Extension.FactoryString); var source__4 = DryadLinqVertex.DryadWhere(dreader__3, line => (!(line.StartsWith(@"#"))), true); var source__5 = DryadLinqVertex.DryadSelect(source__4, line => new logdemo.LogEntry(line), true); var source__6 = DryadLinqVertex.DryadWhere(source__5, access => access.user.EndsWith(@"vert"), true); var source__7 = DryadLinqVertex.DryadGroupBy(source__6, access => access.page, (k__0, pages) => new LinqToDryad.Pair<System.String,System.Int32>(k__0, pages.Count<logdemo.LogEntry>()), null, true, true, false); DryadLinqVertex.DryadHashPartition(source__7, e => e.Key, null, dwriter__2); DryadLinqLog.Add("Vertex Super__1 completed at {0}", DateTime.Now.ToString("MM/dd/yyyyHH:mm:ss.fff")); return 0; } public static int Super__12(string args) { < . . . > }
23. DryadLINQ query operators Almost all the useful LINQ operators Where, Select, SelectMany, OrderBy, GroupBy, Join, GroupJoin, Distinct, Concat, Union, Intersect, Except, Count, Contains, Sum, Min, Max, Average, Any, All, Skip, Take, Aggregate Operators introduced by DryadLINQ HashPartition, RangePartition, Merge, Fork Dryad Apply Operates on sequences rather than items
24. MapReduce in DryadLINQ MapReduce(source, // sequence of Ts mapper, // T -> Ms keySelector, // M -> K reducer) // (K, Ms) -> Rs { var map = source.SelectMany(mapper); var group = map.GroupBy(keySelector); var result = group.SelectMany(reducer); return result; // sequence of Rs }
25. K-means in DryadLINQ public static Vector NearestCenter(Vector v, IEnumerable<Vector> centers) { return centers.Aggregate((r, c) => (r - v).Norm2() < (c - v).Norm2() ? r : c); } public static IQueryable<Vector> Step(IQueryable<Vector> vectors, IQueryable<Vector> centers) { return vectors.GroupBy(point => NearestCenter(point, centers)) .Select(group => group.Aggregate((x,y) => x + y) / group.Count()); } var vectors = PartitionedTable.Get<Vector>("vectors.pt"); IQueryable<Vector> centers = vectors.Take(100); for (int i = 0; i < 10; i++) { centers = Step(vectors, centers); } centers.ToPartitionedTable<Vector>(“centers.pt”); public class Vector { public double[] entries; [Associative] public static Vector operator +(Vector v1, Vector v2) { … } public static Vector operator -(Vector v1, Vector v2) { … } public double Norm2() {…} }
26. Putting it all togetherIt’s LINQ all the way down Major League Baseball dataset Pitch-by-pitch data for every MLB game since 2007 47,909 pitch XML files (one for each pitcher appearance) 6,127 player XML files (one for each player) Hash partition the input data files to distribute the work LINQ to XML to shred the data DryadLINQ to analyze dataset
27. Load the dataset and partitionDefine Pitch and Player classes void StagePitchData(string[] fileList, string PartitionedFile) { // partition the list of filenames across // 20 nodes of the cluster varpitches = fileList.ToPartitionedTable("filelist") .HashPartition((x) => (x), 20) .SelectMany((f) => XElement.Load(f).Elements("atbat")) .SelectMany((a) => a.Elements("pitch") .Select((p) => new Pitch((string)a.Attribute("pitcher"), (string)a.Attribute("batter"), p))); pitches.ToPartitionedTable(PartitionedFile); } Void StagePlayerData(string[] fileList, string PartitionedFile) { varplayers = fileList.Select((p) => new Player(XElement.Load(p))); players.ToPartitionedTable(PartitionedFile); return 0; }
30. DryadLINQ on HPC Server DryadLINQ program runs on client workstation Develop, debug, run locally When ToPartitionedTable() is called, the query expression is materialized (codegen, query plan, optimization) and a job is submitted to HPC Server HPC Server allocates resources for the job and schedules the single task. This task is the Dryad Job Manager The JM then schedules additional tasks to execute the vertices of the DryadLINQ query When the job completes, the client program picks up the output result and continues.
31. Examples of DryadLINQ Applications Data mining Analysis of service logs for network security Analysis of Windows Watson/SQM data Cluster monitoring and performance analysis Graph analysis Accelerated Page-Rank computation Road network shortest-path preprocessing Image processing Image indexing Decision tree training Epitome computation Simulation light flow simulations for next-generation display research Monte-Carlo simulations for mobile data eScience Machine learning platform for health solutions Astrophysics simulation
32. Ongoing Work Advanced query optimizations Combination of static analysis and annotations Sampling execution of the query plan Dynamic query optimization Incremental computation Real-time event processing Global scheduling Dynamically allocate cluster resources between multiple concurrent DryadLINQ applications Scale-out partitioned storage Pluggable storage providers DryadLINQ on Azure Better debugging, performance analysis, visualization, etc.
33. Additional Resources Dryad and DryadLINQ http://connect.microsoft.com/DryadLINQ DryadLINQ source, Dryad binaries, documentation, samples, blog, discussion group, etc. PLINQ Available in Parallel Extensions to .NET Framework 3.5 CTP Available in .NET Framework 4.0 Beta 2 http://msdn.microsoft.com/en-us/concurrency/default.aspx http://msdn.microsoft.com/en-us/magazine/cc163329.aspx Windows HPC Server 2008 http://www.microsoft.com/hpc Download it, try it, we want your feedback!
35. YOUR FEEDBACK IS IMPORTANT TO US! Please fill out session evaluation forms online at MicrosoftPDC.com
36. Learn More On Channel 9 Expand your PDC experience through Channel 9. Explore videos, hands-on labs, sample code and demos through the new Channel 9 training courses. channel9.msdn.com/learn Built by Developers for Developers….