This document discusses batch and stream graph processing with Apache Flink. It provides an overview of distributed graph processing and Flink's graph processing APIs, Gelly for batch graph processing and Gelly-Stream for continuous graph processing on data streams. It describes how Gelly and Gelly-Stream allow for processing large and dynamic graphs in a distributed fashion using Flink's dataflow engine.
Presenter: Kenn Knowles, Software Engineer, Google & Apache Beam (incubating) PPMC member
Apache Beam (incubating) is a programming model and library for unified batch & streaming big data processing. This talk will cover the Beam programming model broadly, including its origin story and vision for the future. We will dig into how Beam separates concerns for authors of streaming data processing pipelines, isolating what you want to compute from where your data is distributed in time and when you want to produce output. Time permitting, we might dive deeper into what goes into building a Beam runner, for example atop Apache Apex.
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveSachin Aggarwal
We will give a detailed introduction to Apache Spark and why and how Spark can change the analytics world. Apache Spark's memory abstraction is RDD (Resilient Distributed DataSet). One of the key reason why Apache Spark is so different is because of the introduction of RDD. You cannot do anything in Apache Spark without knowing about RDDs. We will give a high level introduction to RDD and in the second half we will have a deep dive into RDDs.
Presenter: Kenn Knowles, Software Engineer, Google & Apache Beam (incubating) PPMC member
Apache Beam (incubating) is a programming model and library for unified batch & streaming big data processing. This talk will cover the Beam programming model broadly, including its origin story and vision for the future. We will dig into how Beam separates concerns for authors of streaming data processing pipelines, isolating what you want to compute from where your data is distributed in time and when you want to produce output. Time permitting, we might dive deeper into what goes into building a Beam runner, for example atop Apache Apex.
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveSachin Aggarwal
We will give a detailed introduction to Apache Spark and why and how Spark can change the analytics world. Apache Spark's memory abstraction is RDD (Resilient Distributed DataSet). One of the key reason why Apache Spark is so different is because of the introduction of RDD. You cannot do anything in Apache Spark without knowing about RDDs. We will give a high level introduction to RDD and in the second half we will have a deep dive into RDDs.
Is it easier to add functional programming features to a query language, or to add query capabilities to a functional language? In Morel, we have done the latter.
Functional and query languages have much in common, and yet much to learn from each other. Functional languages have a rich type system that includes polymorphism and functions-as-values and Turing-complete expressiveness; query languages have optimization techniques that can make programs several orders of magnitude faster, and runtimes that can use thousands of nodes to execute queries over terabytes of data.
Morel is an implementation of Standard ML on the JVM, with language extensions to allow relational expressions. Its compiler can translate programs to relational algebra and, via Apache Calcite’s query optimizer, run those programs on relational backends.
In this talk, we describe the principles that drove Morel’s design, the problems that we had to solve in order to implement a hybrid functional/relational language, and how Morel can be applied to implement data-intensive systems.
(A talk given by Julian Hyde at Strange Loop 2021, St. Louis, MO, on October 1st, 2021.)
Reactive Streams: Handling Data-Flow the Reactive WayRoland Kuhn
Building on the success of Reactive Extensions—first in Rx.NET and now in RxJava—we are taking Observers and Observables to the next level: by adding the capability of handling back-pressure between asynchronous execution stages we enable the distribution of stream processing across a cluster of potentially thousands of nodes. The project defines the common interfaces for interoperable stream implementations on the JVM and is the result of a collaboration between Twitter, Netflix, Pivotal, RedHat and Typesafe. In this presentation I introduce the guiding principles behind its design and show examples using the actor-based implementation in Akka.
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn
This presentation about Apache Spark covers all the basics that a beginner needs to know to get started with Spark. It covers the history of Apache Spark, what is Spark, the difference between Hadoop and Spark. You will learn the different components in Spark, and how Spark works with the help of architecture. You will understand the different cluster managers on which Spark can run. Finally, you will see the various applications of Spark and a use case on Conviva. Now, let's get started with what is Apache Spark.
Below topics are explained in this Spark presentation:
1. History of Spark
2. What is Spark
3. Hadoop vs Spark
4. Components of Apache Spark
5. Spark architecture
6. Applications of Spark
7. Spark usecase
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
Simplilearn’s Apache Spark and Scala certification training are designed to:
1. Advance your expertise in the Big Data Hadoop Ecosystem
2. Help you master essential Apache and Spark skills, such as Spark Streaming, Spark SQL, machine learning programming, GraphX programming and Shell Scripting Spark
3. Help you land a Hadoop developer job requiring Apache Spark expertise by giving you a real-life industry project coupled with 30 demos
What skills will you learn?
By completing this Apache Spark and Scala course you will be able to:
1. Understand the limitations of MapReduce and the role of Spark in overcoming these limitations
2. Understand the fundamentals of the Scala programming language and its features
3. Explain and master the process of installing Spark as a standalone cluster
4. Develop expertise in using Resilient Distributed Datasets (RDD) for creating applications in Spark
5. Master Structured Query Language (SQL) using SparkSQL
6. Gain a thorough understanding of Spark streaming features
7. Master and describe the features of Spark ML programming and GraphX programming
Who should take this Scala course?
1. Professionals aspiring for a career in the field of real-time big data analytics
2. Analytics professionals
3. Research professionals
4. IT developers and testers
5. Data scientists
6. BI and reporting professionals
7. Students who wish to gain a thorough understanding of Apache Spark
Learn more at https://www.simplilearn.com/big-data-and-analytics/apache-spark-scala-certification-training
Apache Flink is an open source platform which is a streaming data flow engine that provides communication, fault-tolerance, and data-distribution for distributed computations over data streams. Flink is a top level project of Apache. Flink is a scalable data analytics framework that is fully compatible to Hadoop. Flink can execute both stream processing and batch processing easily.
Flink Streaming is the real-time data processing framework of Apache Flink. Flink streaming provides high level functional apis in Scala and Java backed by a high performance true-streaming runtime.
Introduction to Apache Flink - Fast and reliable big data processingTill Rohrmann
This presentation introduces Apache Flink, a massively parallel data processing engine which currently undergoes the incubation process at the Apache Software Foundation. Flink's programming primitives are presented and it is shown how easily a distributed PageRank algorithm can be implemented with Flink. Intriguing features such as dedicated memory management, Hadoop compatibility, streaming and automatic optimisation make it an unique system in the world of Big Data processing.
Presentation of the Gradoop Framework at the Flink & Neo4j Meetup in Berlin (http://www.meetup.com/graphdb-berlin/events/228576494/). The talk is about the extended property graph model, its operators and how they are implemented on top of Apache Flink. The talk also includes some benchmark results on scalability and a demo involving Neo4j, Flink and Gradoop (see www.gradoop.com)
Is it easier to add functional programming features to a query language, or to add query capabilities to a functional language? In Morel, we have done the latter.
Functional and query languages have much in common, and yet much to learn from each other. Functional languages have a rich type system that includes polymorphism and functions-as-values and Turing-complete expressiveness; query languages have optimization techniques that can make programs several orders of magnitude faster, and runtimes that can use thousands of nodes to execute queries over terabytes of data.
Morel is an implementation of Standard ML on the JVM, with language extensions to allow relational expressions. Its compiler can translate programs to relational algebra and, via Apache Calcite’s query optimizer, run those programs on relational backends.
In this talk, we describe the principles that drove Morel’s design, the problems that we had to solve in order to implement a hybrid functional/relational language, and how Morel can be applied to implement data-intensive systems.
(A talk given by Julian Hyde at Strange Loop 2021, St. Louis, MO, on October 1st, 2021.)
Reactive Streams: Handling Data-Flow the Reactive WayRoland Kuhn
Building on the success of Reactive Extensions—first in Rx.NET and now in RxJava—we are taking Observers and Observables to the next level: by adding the capability of handling back-pressure between asynchronous execution stages we enable the distribution of stream processing across a cluster of potentially thousands of nodes. The project defines the common interfaces for interoperable stream implementations on the JVM and is the result of a collaboration between Twitter, Netflix, Pivotal, RedHat and Typesafe. In this presentation I introduce the guiding principles behind its design and show examples using the actor-based implementation in Akka.
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn
This presentation about Apache Spark covers all the basics that a beginner needs to know to get started with Spark. It covers the history of Apache Spark, what is Spark, the difference between Hadoop and Spark. You will learn the different components in Spark, and how Spark works with the help of architecture. You will understand the different cluster managers on which Spark can run. Finally, you will see the various applications of Spark and a use case on Conviva. Now, let's get started with what is Apache Spark.
Below topics are explained in this Spark presentation:
1. History of Spark
2. What is Spark
3. Hadoop vs Spark
4. Components of Apache Spark
5. Spark architecture
6. Applications of Spark
7. Spark usecase
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
Simplilearn’s Apache Spark and Scala certification training are designed to:
1. Advance your expertise in the Big Data Hadoop Ecosystem
2. Help you master essential Apache and Spark skills, such as Spark Streaming, Spark SQL, machine learning programming, GraphX programming and Shell Scripting Spark
3. Help you land a Hadoop developer job requiring Apache Spark expertise by giving you a real-life industry project coupled with 30 demos
What skills will you learn?
By completing this Apache Spark and Scala course you will be able to:
1. Understand the limitations of MapReduce and the role of Spark in overcoming these limitations
2. Understand the fundamentals of the Scala programming language and its features
3. Explain and master the process of installing Spark as a standalone cluster
4. Develop expertise in using Resilient Distributed Datasets (RDD) for creating applications in Spark
5. Master Structured Query Language (SQL) using SparkSQL
6. Gain a thorough understanding of Spark streaming features
7. Master and describe the features of Spark ML programming and GraphX programming
Who should take this Scala course?
1. Professionals aspiring for a career in the field of real-time big data analytics
2. Analytics professionals
3. Research professionals
4. IT developers and testers
5. Data scientists
6. BI and reporting professionals
7. Students who wish to gain a thorough understanding of Apache Spark
Learn more at https://www.simplilearn.com/big-data-and-analytics/apache-spark-scala-certification-training
Apache Flink is an open source platform which is a streaming data flow engine that provides communication, fault-tolerance, and data-distribution for distributed computations over data streams. Flink is a top level project of Apache. Flink is a scalable data analytics framework that is fully compatible to Hadoop. Flink can execute both stream processing and batch processing easily.
Flink Streaming is the real-time data processing framework of Apache Flink. Flink streaming provides high level functional apis in Scala and Java backed by a high performance true-streaming runtime.
Introduction to Apache Flink - Fast and reliable big data processingTill Rohrmann
This presentation introduces Apache Flink, a massively parallel data processing engine which currently undergoes the incubation process at the Apache Software Foundation. Flink's programming primitives are presented and it is shown how easily a distributed PageRank algorithm can be implemented with Flink. Intriguing features such as dedicated memory management, Hadoop compatibility, streaming and automatic optimisation make it an unique system in the world of Big Data processing.
Presentation of the Gradoop Framework at the Flink & Neo4j Meetup in Berlin (http://www.meetup.com/graphdb-berlin/events/228576494/). The talk is about the extended property graph model, its operators and how they are implemented on top of Apache Flink. The talk also includes some benchmark results on scalability and a demo involving Neo4j, Flink and Gradoop (see www.gradoop.com)
Graphs as Streams: Rethinking Graph Processing in the Streaming EraVasia Kalavri
Streaming is the latest hot topic in the big data world. We want to process data immediately and continuously. Modern stream processors have matured significantly and offer exceptional features, including sub-second latencies, high throughput, fault-tolerance, and seamless integration with various data sources and sinks.
Many sources of streaming data consist of related or connected events: user interactions in a social network, web page clicks, movie ratings, product purchases. These connected events can be naturally represented as edges in an evolving graph.
In this talk I will explain how we can leverage a powerful stream processor, such as Apache Flink, and academic research of the past two decades, to build graph streaming applications. I will describe how we can model graphs as streams and how we can compute graph properties without storing and managing the graph state. I will introduce useful graph summary data structures and show how they allow us to build graph algorithms in the streaming model, such as connected components, bipartiteness detection, and distance estimation.
Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15Vasia Kalavri
Apache Flink is a general-purpose platform for batch and streaming distributed data processing. This talk describes how Flink’s powerful APIs, iterative operators and other unique features make it a competitive alternative for large-scale graph processing as well. We take a close look at how one can elegantly express graph analysis tasks, using common Flink operators and how different graph processing models, like vertex-centric, can be easily mapped to Flink dataflows. Next, we get a sneak preview into Flink's upcoming Graph API, Gelly, which further simplifies graph application development in Flink. Finally, we show how to perform end-to-end data analysis, mixing common Flink operators and Gelly, without having to build complex pipelines and combine different systems. We go through a step-by-step example, demonstrating how to perform loading, transformation, filtering, graph creation and analysis, with a single Flink program.
Overview of Apache Flink: Next-Gen Big Data Analytics FrameworkSlim Baltagi
These are the slides of my talk on June 30, 2015 at the first event of the Chicago Apache Flink meetup. Although most of the current buzz is about Apache Spark, the talk shows how Apache Flink offers the only hybrid open source (Real-Time Streaming + Batch) distributed data processing engine supporting many use cases: Real-Time stream processing, machine learning at scale, graph analytics and batch processing.
In these slides, you will find answers to the following questions: What is Apache Flink stack and how it fits into the Big Data ecosystem? How Apache Flink integrates with Apache Hadoop and other open source tools for data input and output as well as deployment? What is the architecture of Apache Flink? What are the different execution modes of Apache Flink? Why Apache Flink is an alternative to Apache Hadoop MapReduce, Apache Storm and Apache Spark? Who is using Apache Flink? Where to learn more about Apache Flink?
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1VhSzmy.
Robert Metzger provides an overview of the Apache Flink internals and its streaming-first philosophy, as well as the programming APIs. Filmed at qconlondon.com.
Robert Metzger is a PMC member at the Apache Flink project and a cofounder and software engineer at data Artisans. He is the author of many Flink components including the Kafka and YARN connectors.
Presentation of the Gradoop Framework at the Graph Database Meetup in Munich (https://www.meetup.com/inovex-munich/events/231187528/). The talk is about the extended property graph model, its operators and how they are implemented on top of Apache Flink. The talk also includes some benchmark results on scalability (see www.gradoop.com)
Computing recommendations at extreme scale with Apache Flink @Buzzwords 2015Till Rohrmann
How to scale recommendations to extremely large scale using Apache Flink. We use matrix factorization to calculate a latent factor model which can be used for collaborative filtering. The implemented alternating least squares algorithm is able to deal with data sizes on the scale of Netflix.
Introduction into scalable graph analysis with Apache Giraph and Spark GraphXrhatr
Graph relationships are everywhere. In fact, more often than not, analyzing relationships between points in your datasets lets you extract more business value from your data.
Consider social graphs, or relationships of customers to each other and products they purchase, as two of the most common examples. Now, if you think you have a scalability issue just analyzing points in your datasets, imagine what would happen if you wanted to start analyzing the arbitrary relationships between those data points: the amount of potential processing will increase dramatically, and the kind of algorithms you would typically want to run would change as well.
If your Hadoop batch-oriented approach with MapReduce works reasonably well, for scalable graph processing you have to embrace an in-memory, explorative, and iterative approach. One of the best ways to tame this complexity is known as the Bulk synchronous parallel approach. Its two most widely used implementations are available as Hadoop ecosystem projects: Apache Giraph (used at Facebook), and Apache GraphX (as part of a Spark project).
In this talk we will focus on practical advice on how to get up and running with Apache Giraph and GraphX; start analyzing simple datasets with built-in algorithms; and finally how to implement your own graph processing applications using the APIs provided by the projects. We will finally compare and contrast the two, and try to lay out some principles of when to use one vs. the other.
Processing large-scale graphs with Google(TM) PregelArangoDB Database
Many popular graph databases are optimized to run on a single machine, using efficient traversals to query the stored graphs. This boosts performance of algorithms originating at a single vertex and iterating through the graph e.g. finding shortest paths or neighbors. However, graphs are getting bigger and traversals are poorly performing if they require a large depth. If you need to distribute a large-scale graph thru several machines, traversals won't be the best choice (in case of performance) to process the graph. Therefore Google has released it's Pregel framework offering an environment to query distributed graphs, Pregel is also known as the map-reduce for graphs. In this talk I want to present the architecture and requirements of the Pregel framework and introduce you to the different mind-set required to write a Pregel algorithm. Furthermore I will give a short introduction to three implementations or Pregel — Giraph, TinkerPop3 and ArangoDB.
Frank Celler – Processing large-scale graphs with Google(TM) Pregel - NoSQL m...NoSQLmatters
Frank Celler – Processing large-scale graphs with Google(TM) Pregel
Many popular graph databases are optimized to run on a single machine, using efficient traversals to query the stored graphs. This boosts performance of algorithms originating at a single vertex and iterating through the graph e.g. finding shortest paths or neighbors. However, graphs are getting bigger and traversals are poorly performing if they require a large depth. If you need to distribute a large-scale graph thru several machines, traversals won't be the best choice (in case of performance) to process the graph. Therefore Google has released it's Pregel framework offering an environment to query distributed graphs, Pregel is also known as the map-reduce for graphs. In this talk I want to present the architecture and requirements of the Pregel framework and introduce you to the different mind-set required to write a Pregel algorithm. Furthermore I will give a short introduction to three implementations or Pregel — Giraph, TinkerPop3 and ArangoDB.
These are the slides for the Productionizing your Streaming Jobs webinar on 5/26/2016.
Apache Spark Streaming is one of the most popular stream processing framework that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. In this talk, we will focus on the following aspects of Spark streaming:
- Motivation and most common use cases for Spark Streaming
- Common design patterns that emerge from these use cases and tips to avoid common pitfalls while implementing these design patterns
- Performance Optimization Techniques
Greg Hogan – To Petascale and Beyond- Apache Flink in the CloudsFlink Forward
http://flink-forward.org/kb_sessions/to-petascale-and-beyond-apache-flink-in-the-clouds/
Apache Flink performs with low latency but can also scale to great heights. Gelly is Flink’s laboratory for building and tuning scalable graph algorithms and analytics. In this talk we’ll discuss writing algorithms optimized for the Flink architecture, assembling and configuring a cloud compute cluster, and boosting performance through benchmarking and system profiling. This talk will cover recent developments in the Gelly library to include scalable graph generators and a mixed collection of modular algorithms written with native Flink operators. We’ll think like a data stream, keep a cool cache, and send the garbage collector on holiday. To this we’ll add a lightweight benchmarking harness to stress and validate core Flink and to identify and refactor hot code with aplomb.
Best Hadoop Institutes : kelly tecnologies is the best Hadoop training Institute in Bangalore.Providing hadoop courses by realtime faculty in Bangalore.
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Iterative Spark Developmen...Data Con LA
This presentation will explore how Bloomberg uses Spark, with its formidable computational model for distributed, high-performance analytics, to take this process to the next level, and look into one of the innovative practices the team is currently developing to increase efficiency: the introduction of a logical signature for datasets.
Big Data Scala by the Bay: Interactive Spark in your Browsergethue
Supporting running Spark scripts directly from a browser would bring the user experience up. Indeed, everybody has a Web navigator, the command line can be avoided, built-in graphing and visualization make it easy to explore and understand data with just a few clicks. This also simplifies the administration as now everything becomes centralized in a service and is accessible by non native clients. For this purpose, an open source Spark Job Server was developed in order to provide Scala, SQL and Python in a Web shell. The main Hadoop components of the platform are also integrated in the same interface. This talk describes the architecture of the Spark Server and its main features: # Scala, Python, SQL submissions # Impersonation # Security # Job progress / canceling # YARN / HDFS / Hive integration The server also ships with a friendly user interface built as a Hue app. We will focus on explaining how they were built, how to use the API and which lessons were learned. The final end user interaction will be live demoed.
Processing large-scale graphs with Google(TM) Pregel by MICHAEL HACKSTEIN at...Big Data Spain
This talk will give a good overview over the complex architecture of the Pregel framework and will give some insights where there are potential bottlenecks when writing a Pregel algorithm.
Natural Language Processing with CNTK and Apache Spark with Ali ZaidiDatabricks
Apache Spark provides an elegant API for developing machine learning pipelines that can be deployed seamlessly in production. However, one of the most intriguing and performant family of algorithms – deep learning – remains difficult for many groups to deploy in production, both because of the need for tremendous compute resources and also because of the inherent difficulty in tuning and configuring.
In this session, you’ll discover how to deploy the Microsoft Cognitive Toolkit (CNTK) inside of Spark clusters on the Azure cloud platform. Learn about the key considerations for administering GPU-enabled Spark clusters, configuring such workloads for maximum performance, and techniques for distributed hyperparameter optimization. You’ll also see a real-world example of training distributed deep learning learning algorithms for speech recognition and natural language processing.Microsoft Cognitive Toolkit (CNTK) inside of Spark clusters on the Azure cloud platform. We’ll discuss the key considerations for administering GPU-enabled Spark clusters, configuring such workloads for maximum performance, and techniques for distributed hyperparameter optimization. We’ll illustrate a real-world example of training distributed deep learning learning algorithms for speech recognition and natural language processing.
From data stream management to distributed dataflows and beyondVasia Kalavri
Recent efforts by academia and open-source communities have established stream processing as a principal data analysis technology across industry. All major cloud vendors offer streaming dataflow pipelines and online analytics as managed services. Notable use-cases include real-time fault detection in space networks, city traffic management, dynamic pricing for car-sharing, and anomaly detection in financial transactions. At the same time, streaming dataflow systems are increasingly being used for event-driven applications beyond analytics, such as orchestrating microservices and model serving. In the past decades, streaming technology has evolved significantly, however, emerging applications are once more challenging the design decisions of modern streaming systems. In this talk, I will discuss the evolution of stream processing and bring current trends and open problems to the attention of our community.
Self-managed and automatically reconfigurable stream processingVasia Kalavri
With its superior state management and savepoint mechanism, Apache Flink is unique among modern stream processors in supporting minimal-effort job reconfiguration. Savepoints are being extensively used to enable dynamic scaling, bug fixing, upgrades, and numerous other reconfiguration use-cases, all while preserving exactly-once semantics. However, when it comes to dynamic scaling, the burden of reconfiguration decisions -when and how much to scale- is currently placed on the user.
In this talk, I share our recent work at ETH Zurich on providing support for self-managed and automatically reconfigurable stream processing. I present SnailTrail (NSDI’18), an online critical path analysis module that detects bottlenecks and provides insights on streaming application performance, and DS2 (OSDI’18), an automatic scaling controller which identifies optimal backpressure-free configurations and operates reactively online. Both SnailTrail and DS2 are integrated with Apache Flink and publicly available. I conclude with evaluation results, ongoing work, and and future challenges in this area.
Predictive Datacenter Analytics with StrymonVasia Kalavri
A modern enterprise datacenter is a complex, multi-layered system whose components often interact in unpredictable ways. Yet, to keep operational costs low and maximize efficiency, we would like to foresee the impact of changing workloads, updating configurations, modifying policies, or deploying new services.
In this talk, I will share our research group’s ongoing work on Strymon: a system for predicting datacenter behavior in hypothetical scenarios using queryable online simulation. Strymon leverages existing logging and monitoring pipelines of modern production datacenters to ingest cross-layer events in a streaming fashion and predict possible effects of such events in what-if scenarios. Predictions are made online by simulating the hypothetical datacenter state alongside the real one. Driven by a real-use case from our industrial partners, I will highlight the challenges we are facing in building Strymon to support a diverse set of data representations, input sources, query languages, and execution models.
Finally, I will share our initial design decisions and give an overview of Timely Dataflow; a high-performance distributed streaming engine and our platform of choice for Strymon’s core implementation.
Online performance analysis of distributed dataflow systems (O'Reilly Velocit...Vasia Kalavri
Understanding the performance of distributed dataflow systems like Apache Spark, Apache Flink, and Tensorflow is hard. Parallel computation is interleaved with data and control communication, and execution dependencies typically span multiple system components. In such environments, bottleneck detection is cumbersome and currently relies heavily on humans. After decades of systems research, state-of-the-art performance analysis techniques are commonly based on offline trace processing and thus are only suitable for batch computations and postmortem reports.
Vasia Kalavri offers an overview of Strymon, a system for predictive data center analytics, and its online critical path analysis module. Strymon analyzes live traces from distributed dataflow systems to predict bottlenecks and provide insights on streaming application performance—leveraging logging and monitoring pipelines of modern production data centers to ingest cross-layer events in a streaming fashion and predict possible effects of such events in what-if sc
dotScale 2016 presentation
Writing distributed graph applications is inherently hard. In this talk, Vasia gives an overview of high-level programming models and platforms for distributed graph processing. She exposes and discusses common misconceptions, shares lessons learnt, and suggests best practices.
This is the "Deep Dive" talk given at the first Apache Flink Meetup Stockholm. The talk describes three components of the Apache Flink Internals: (a) job life-cycle, (b) the batch optimizer and (c) native iterations.
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
1. Batch & Stream Graph Processing
with Apache Flink
Vasia Kalavri
vasia@apache.org
@vkalavri
Apache Flink Meetup London
October 5th, 2016
2. 2
Graphs capture relationships
between data items
connections, interactions, purchases,
dependencies, friendships, etc.
Recommenders
Social networks
Bioinformatics
Web search
8. NAIVE WHO(M)-T0-FOLLOW
▸ Naive Who(m) to Follow:
▸ compute a friends-of-friends
list per user
▸ exclude existing friends
▸ rank by common
connections
15. WHEN DO YOU NEED DISTRIBUTED GRAPH PROCESSING?
▸ When you do have really big graphs
▸ When the intermediate data is big
▸ When your data is already distributed
▸ When you want to build end-to-end graph pipelines
16. HOW DO WE EXPRESS A
DISTRIBUTED GRAPH
ANALYSIS TASK?
20. PAGERANK: THE WORD COUNT OF GRAPH PROCESSING
VertexID Out-degree
Transition
Probability
1 2 1/2
2 2 1/2
3 0 -
4 3 1/3
5 1 1
1
5
4
3
2
21. PAGERANK: THE WORD COUNT OF GRAPH PROCESSING
VertexID Out-degree
Transition
Probability
1 2 1/2
2 2 1/2
3 0 -
4 3 1/3
5 1 1
PR(3) = 0.5*PR(1) + 0.33*PR(4) + PR(5)
1
5
4
3
2
22. 1
5
4
3
2
PAGERANK: THE WORD COUNT OF GRAPH PROCESSING
VertexID Out-degree
Transition
Probability
1 2 1/2
2 2 1/2
3 0 -
4 3 1/3
5 1 1
PR(3) = 0.5*PR(1) + 0.33*PR(4) + PR(5)
23. PAGERANK: THE WORD COUNT OF GRAPH PROCESSING
VertexID Out-degree
Transition
Probability
1 2 1/2
2 2 1/2
3 0 -
4 3 1/3
5 1 1
PR(3) = 0.5*PR(1) + 0.33*PR(4) + PR(5)
1
5
4
3
2
24. PAGERANK: THE WORD COUNT OF GRAPH PROCESSING
VertexID Out-degree
Transition
Probability
1 2 1/2
2 2 1/2
3 0 -
4 3 1/3
5 1 1
PR(3) = 0.5*PR(1) + 0.33*PR(4) + PR(5)
1
5
4
3
2
25. PREGEL EXAMPLE: PAGERANK
void compute(messages):
sum = 0.0
for (m <- messages) do
sum = sum + m
end for
setValue(0.15/numVertices() + 0.85*sum)
for (edge <- getOutEdges()) do
sendMessageTo(
edge.target(), getValue()/numEdges)
end for
sum up received
messages
update vertex rank
distribute rank
to neighbors
27. SIGNAL-COLLECT EXAMPLE: PAGERANK
void signal():
for (edge <- getOutEdges()) do
sendMessageTo(
edge.target(), getValue()/numEdges)
end for
void collect(messages):
sum = 0.0
for (m <- messages) do
sum = sum + m
end for
setValue(0.15/numVertices() + 0.85*sum)
distribute rank
to neighbors
sum up
messages
update vertex
rank
30. PREGEL VS. SIGNAL-COLLECT VS. GSA
Update Function
Properties
Update Function
Logic
Communication
Scope
Communication
Logic
Pregel arbitrary arbitrary any vertex arbitrary
Signal-Collect arbitrary
based on
received
messages
any vertex
based on vertex
state
GSA
associative &
commutative
based on
neighbors’
values
neighborhood
based on vertex
state
31. CAN WE HAVE IT ALL?
▸ Data pipeline integration: built on top of an
efficient distributed processing engine
▸ Graph ETL: high-level API with abstractions and
methods to transform graphs
▸ Familiar programming model: support popular
programming abstractions
34. Meet Gelly
• Java & Scala Graph APIs on top of Flink’s DataSet API
Flink Core
Scala API
(batch and streaming)
Java API
(batch and streaming)
FlinkML GellyTable API ...
Transformations
and Utilities
Iterative Graph
Processing
Graph Library
34
35. Gelly is NOT
• a graph database
• a specialized graph processor
35
36. Hello, Gelly!
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
DataSet<Edge<Long, NullValue>> edges = getEdgesDataSet(env);
Graph<Long, Long, NullValue> graph = Graph.fromDataSet(edges, env);
DataSet<Vertex<Long, Long>> verticesWithMinIds = graph.run(
new ConnectedComponents(maxIterations));
val env = ExecutionEnvironment.getExecutionEnvironment
val edges: DataSet[Edge[Long, NullValue]] = getEdgesDataSet(env)
val graph = Graph.fromDataSet(edges, env)
val components = graph.run(new ConnectedComponents(maxIterations))
Java
Scala
38. Example: mapVertices
// increment each vertex value by one
val graph = Graph.fromDataSet(...)
// increment each vertex value by one
val updatedGraph = graph.mapVertices(v => v.getValue + 1)
4
2
8
5
5
3
1
7
4
5
39. Example: subGraph
val graph: Graph[Long, Long, Long] = ...
// keep only vertices with positive values
// and only edges with negative values
val subGraph = graph.subgraph(
vertex => vertex.getValue > 0,
edge => edge.getValue < 0
)
40. Neighborhood Methods
Apply a reduce function to the 1st-hop neighborhood
of each vertex in parallel
graph.reduceOnNeighbors(
new MinValue, EdgeDirection.OUT)
41. What makes Gelly unique?
• Batch graph processing on top of a streaming
dataflow engine
• Built for end-to-end analytics
• Support for multiple iteration abstractions
• Graph algorithm building blocks
• A large open-source library of graph algorithms
42. Why streaming dataflow?
• Batch engines materialize data… even if they don’t
have to
• the graph is always loaded and materialized in memory,
even if not needed, e.g. mapping, filtering, transformation
• Communication and computation overlap
• We can do continuous graph processing (more
after the break!)
43. End-to-end analytics
• Graphs don’t appear out of thin air…
• We need to support pre- and post-processing
• Gelly can be easily mixed with the DataSet API:
pre-processing, graph analysis, and post-
processing in the same Flink program
44. Iterative Graph Processing
• Gelly offers iterative graph processing abstractions
on top of Flink’s Delta iterations
• vertex-centric
• scatter-gather
• gather-sum-apply
• partition-centric*
46. Optimization
• the runtime is aware of the iterative execution
• no scheduling overhead between iterations
• caching and state maintenance are handled automatically
Push work
“out of the loop”
Maintain state as indexCache Loop-invariant Data
47. Vertex-Centric SSSP
final class SSSPComputeFunction extends ComputeFunction {
override def compute(vertex: Vertex, messages: MessageIterator) = {
var minDistance = if (vertex.getId == srcId) 0 else Double.MaxValue
while (messages.hasNext) {
val msg = messages.next
if (msg < minDistance)
minDistance = msg
}
if (vertex.getValue > minDistance) {
setNewVertexValue(minDistance)
for (edge: Edge <- getEdges)
sendMessageTo(edge.getTarget, vertex.getValue + edge.getValue)
}
48. Algorithms building blocks
• Allow operator re-use across graph algorithms
when processing the same input with a similar
configuration
49. Library of Algorithms
• PageRank
• Single Source Shortest Paths
• Label Propagation
• Weakly Connected Components
• Community Detection
• Triangle Count & Enumeration
• Local and Global Clustering Coefficient
• HITS
• Jaccard & Adamic-Adar Similarity
• Graph Summarization
• val ranks = inputGraph.run(new PageRank(0.85, 20))
51. Can’t we block them?
proxy
Tracker
Tracker
Ad Server
Legitimate site
52. • not frequently updated
• not sure who or based on what criteria URLs are
blacklisted
• miss “hidden” trackers or dual-role nodes
• blocking requires manual matching against the list
• can you buy your way into the whitelist?
Available Solutions
Crowd-sourced “black lists” of tracker URLs:
- AdBlock, DoNotTrack, EasyPrivacy
53. DataSet
• 6 months (Nov 2014 - April 2015) of augmented
Apache logs from a web proxy
• 80m requests, 2m distinct URLs, 3k users
56. Data Pipeline
raw logs
cleaned
logs
1: logs pre-
processing
2: bipartite graph
creation
3: largest
connected
component
extraction
4: hosts-
projection
graph creation
5: community
detection
google-analytics.com: T
bscored-research.com: T
facebook.com: NT
github.com: NT
cdn.cxense.com: NT
...
6: results
DataSet API
Gelly
DataSet API
57. Feeling Gelly?
• Gelly Guide
https://ci.apache.org/projects/flink/flink-docs-master/libs/
gelly_guide.html
• To Petascale and Beyond @Flink Forward ‘16
http://flink-forward.org/kb_sessions/to-petascale-and-beyond-apache-
flink-in-the-clouds/
• Web Tracker Detection @Flink Forward ’15
https://www.youtube.com/watch?v=ZBCXXiDr3TU
paper: Kalavri, Vasiliki, et al. "Like a pack of wolves: Community
structure of web trackers." International Conference on Passive and
Active Network Measurement, 2016.
59. Real Graphs are dynamic
Graphs are created from events happening in real-time
60.
61. How we’ve done graph processing so far
1. Load: read the graph
from disk and partition it in
memory
62. 2. Compute: read and
mutate the graph state
How we’ve done graph processing so far
1. Load: read the graph
from disk and partition it in
memory
63. 3. Store: write the final
graph state back to disk
How we’ve done graph processing so far
2. Compute: read and
mutate the graph state
1. Load: read the graph
from disk and partition it in
memory
64. What’s wrong with this model?
• It is slow
• wait until the computation is over before you see
any result
• pre-processing and partitioning
• It is expensive
• lots of memory and CPU required in order to
scale
• It requires re-computation for graph changes
• no efficient way to deal with updates
65. Can we do graph processing
on streams?
• Maintain the
dynamic graph
structure
• Provide up-to-date
results with low
latency
• Compute on fresh
state only
66. Single-pass graph streaming
• Each event is an edge addition
• Maintain only a graph summary
• Recent events are grouped in graph
windows
67.
68. Graph Summaries
• spanners for distance estimation
• sparsifiers for cut estimation
• sketches for homomorphic properties
graph summary
algorithm algorithm~R1 R2
84. Stream Connected
Components with Flink
DataStream<DisjointSet> cc =
edgeStream
.keyBy(0)
.timeWindow(Time.of(100, TimeUnit.MILLISECONDS))
.fold(new DisjointSet(), new UpdateCC())
.flatMap(new Merger())
.setParallelism(1);
85. Stream Connected
Components with Flink
DataStream<DisjointSet> cc =
edgeStream
.keyBy(0)
.timeWindow(Time.of(100, TimeUnit.MILLISECONDS))
.fold(new DisjointSet(), new UpdateCC())
.flatMap(new Merger())
.setParallelism(1);
Partition the edge
stream
86. Stream Connected
Components with Flink
DataStream<DisjointSet> cc =
edgeStream
.keyBy(0)
.timeWindow(Time.of(100, TimeUnit.MILLISECONDS))
.fold(new DisjointSet(), new UpdateCC())
.flatMap(new Merger())
.setParallelism(1);
Define the merging
frequency
87. Stream Connected
Components with Flink
DataStream<DisjointSet> cc =
edgeStream
.keyBy(0)
.timeWindow(Time.of(100, TimeUnit.MILLISECONDS))
.fold(new DisjointSet(), new UpdateCC())
.flatMap(new Merger())
.setParallelism(1);
merge locally
88. Stream Connected
Components with Flink
DataStream<DisjointSet> cc =
edgeStream
.keyBy(0)
.timeWindow(Time.of(100, TimeUnit.MILLISECONDS))
.fold(new DisjointSet(), new UpdateCC())
.flatMap(new Merger())
.setParallelism(1); merge globally
90. Introducing Gelly-Stream
Gelly-Stream enriches the DataStream API with two new additional ADTs:
• GraphStream:
• A representation of a data stream of edges.
• Edges can have state (e.g. weights).
• Supports property streams, transformations and aggregations.
• GraphWindow:
• A “time-slice” of a graph stream.
• It enables neighborhood aggregations
92. Graph Stream Aggregations
result
aggregate
property streamgraph
stream
(window) fold
combine
fold
reduce
local
summaries
global
summary
edges
agg
global aggregates
can be persistent or transient
graphStream.aggregate(
new MyGraphAggregation(window, fold, combine, transform))