Apache Giraph performs offline, batch processing of very large graph datasets on top of a Hadoop cluster. Giraph replaces iterative MapReduce-style solutions with Bulk Synchronous Parallel graph processing using in-memory or disk-based data sets, loosely following the model of Google`s Pregel. Many recent advances have left Giraph more robust, efficient, fast, and able to accept a variety of I/O formats typical for graph data in and out of the Hadoop ecosystem. Giraph's recent port to a pure YARN platform offers increased performance, fine-grained resource control, and scalability that Giraph atop Hadoop MRv1 cannot, while paving the way for ports to other platforms like Apache Mesos. Come see whats on the roadmap for Giraph, what Giraph on YARN means, and how Giraph is leveraging the power of YARN to become a more robust, usable, and useful platform for processing Big Graph datasets.
Slides for the tutorial about Apache Giraph for the Data Mining class.
Sapienza, University of Rome.
Master of Science in Engineering in Computer Science
Prof. A. Anagnostopoulos, I. Chatzigiannakis, A. Gionis
Data Mining class
Fall 2016
Apache Giraph: start analyzing graph relationships in your bigdata in 45 minu...rhatr
The genesis of Hadoop was in analyzing massive amounts of data with a mapreduce framework. SQL-onHadoop has followed shortly after that, paving a way to the whole schema-on-read notion. Discovering graph relationship in your data is the next logical step. Apache Giraph (modeled on Google’s Pregel) lets you apply the power of BSP approach to the unstructured data. In this talk we will focus on practical advice of how to get up and running with Apache Giraph, start analyzing simple data sets with built-in algorithms and finally how to implement your own graph processing applications using the APIs provided by the project. We will then dive into how Giraph integrates with the Hadoop ecosystem (Hive, HBase, Accumulo, etc.) and will also provide a whirlwind tour of Giraph architecture.
2014.02.13 (Strata) Graph Analysis with One Trillion Edges on Apache GiraphAvery Ching
(Abstract from Strata talk)
http://strataconf.com/strata2014/public/schedule/detail/32137
Graph analytics have applications beyond large web scale organizations. Many computing problems can be efficiently expressed and processed as a graph and can lead to useful insights that drive product and business decisions
While you can express graph algorithms as SQL queries in Hive or Hadoop MapReduce programs, an API designed specifically for graph processing makes writing many iterative graph computations (such as page rank, connected components, label propagation, graph-based clustering, etc.) easy to express in simpler and easier to understand code. Apache Giraph provides such a native graph processing API, runs on existing Hadoop infrastructure and can directly access HDFS and/or Hive tables.
This talk describes our efforts at Facebook to scale Apache Giraph to very large graphs of up to one trillion edges and how we run Apache Giraph in production. We will also talk about several algorithms that we have implemented and their use cases.
Slides for the tutorial about Apache Giraph for the Data Mining class.
Sapienza, University of Rome.
Master of Science in Engineering in Computer Science
Prof. A. Anagnostopoulos, I. Chatzigiannakis, A. Gionis
Data Mining class
Fall 2016
Apache Giraph: start analyzing graph relationships in your bigdata in 45 minu...rhatr
The genesis of Hadoop was in analyzing massive amounts of data with a mapreduce framework. SQL-onHadoop has followed shortly after that, paving a way to the whole schema-on-read notion. Discovering graph relationship in your data is the next logical step. Apache Giraph (modeled on Google’s Pregel) lets you apply the power of BSP approach to the unstructured data. In this talk we will focus on practical advice of how to get up and running with Apache Giraph, start analyzing simple data sets with built-in algorithms and finally how to implement your own graph processing applications using the APIs provided by the project. We will then dive into how Giraph integrates with the Hadoop ecosystem (Hive, HBase, Accumulo, etc.) and will also provide a whirlwind tour of Giraph architecture.
2014.02.13 (Strata) Graph Analysis with One Trillion Edges on Apache GiraphAvery Ching
(Abstract from Strata talk)
http://strataconf.com/strata2014/public/schedule/detail/32137
Graph analytics have applications beyond large web scale organizations. Many computing problems can be efficiently expressed and processed as a graph and can lead to useful insights that drive product and business decisions
While you can express graph algorithms as SQL queries in Hive or Hadoop MapReduce programs, an API designed specifically for graph processing makes writing many iterative graph computations (such as page rank, connected components, label propagation, graph-based clustering, etc.) easy to express in simpler and easier to understand code. Apache Giraph provides such a native graph processing API, runs on existing Hadoop infrastructure and can directly access HDFS and/or Hive tables.
This talk describes our efforts at Facebook to scale Apache Giraph to very large graphs of up to one trillion edges and how we run Apache Giraph in production. We will also talk about several algorithms that we have implemented and their use cases.
Apache Spark: killer or savior of Apache Hadoop?rhatr
The Big Boss(tm) has just OKed the first Hadoop cluster in the company. You are the guy in charge of analyzing petabytes of your company's valuable data using a combination of custom MapReduce jobs and SQL-on-Hadoop solutions. All of a sudden the web is full of articles telling you that Hadoop is dead, Spark has won and you should quit while you're still ahead. But should you?
The Elephant in the Cloud: A Quest for the Next Generation
In this talk, I will go through the evolution of Hadoop and its ecosystem projects and will try to peer into the crystal ball to predict what may be coming down the pike. I will discuss various way of crunching the data on Hadoop (MapReduce, OpenMPI, Spark and various SQL engines) and how these tools compliment each other.
Apache Hadoop is no longer just a faithful, open source, scalable implementation of two seminal papers that came out of Google 10 years ago. It has evolved into a project that provides the enterprises with a reliable layer for storing massive amounts of unstructured data (HDFS) while allowing different computational frameworks to leverage those datasets.
The original computational framework (MapReduce) has evolved into a much more scalable set of general purpose cluster management APIs collectively known as YARN. With YARN underneath, MapReduce is still there to support batch-oriented computations, but it is no longer the only game in town. With OpenMPI, Spark, and Tez rapidly becoming available now is truly the most exciting time to be a developer in a Hadoop ecosystem. It is also the time when you don't have to be employed by Yahoo!, Facebook or EBay to have access to mind-blowing compute power. That power is a credit card and a pivotal.io account away from anybody on the planet.
I will conclude by outlining some of the ongoing work that makes Hadoop and its ecosystem projects first class citizens in cloud environments based on the work that Pivotal engineers have done with integrating Hadoop into PivotalONE PaaS.
Apache con big data 2015 - Data Science from the trenchesVinay Shukla
ApacheBigData - Budapest, 2015
Data Science from the trenches
What are the issues?
How to select best algorithm?
How to tune?
What are the problems with visualization?
How does Zeppelin help
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng ShiDatabricks
Apache Kylin is a distributed OLAP engine on Hadoop, which provides sub-second level query latency over datasets scaling to petabytes. Kylin’s superior query performance relies on pre-calculated multi-dimension Cube, which is often time-consuming to build. By default, Kylin uses MapReduce Cube Engine built atop of Hadoop MapReduce framework to aggregate huge amounts of source data. The MR Engine has been well-tuned over years and proven to be stable in hundreds of production deployments. Recently, the Kylin team is trying to further speed up the process of cube building by replacing MR with Spark. Kyligence has initiated the new Spark Cube Engine with some benchmarks between Spark and MR over different datasets, and has received some promising results. Hear about their results and experiences on moving Cube building, which is a huge computing task, to Spark.
"Big Data" is a much-hyped term nowadays in Business Computing. However, the core concept of collaborative environments conducting experiments over large shared data repositories has existed for decades. In this talk, I will outline how recent advances in Cloud Computing, Big Data processing frameworks, and agile application development platforms enable Data Intensive Cloud Applications. I will provide a brief history of efforts in building scalable & adaptive run-time environments, and the role these runtime systems will play in new Cloud Applications. I will present a vision for cloud platforms for science, where data-intensive frameworks such as Apache Hadoop will play a key role.
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)Alexey Zinoviev
Alexey Zinoviev presented this paper on the Highload++ conference http://www.highload.ru/2014/abstracts/1516.html
This paper covers next topics: Pregel, Graph Theory, Giraph, Okapi, GraphX, GraphChi, Spark, Shrotest Path Problem, Road Network, Road Graph
The Zoo Expands: Labrador *Loves* Elephant, Thanks to HamsterMilind Bhandarkar
The refactoring of Hadoop MapReduce framework, by separating resource management (YARN) from job execution (MapReduce) has allowed multiple programming paradigms to take advantage of the massive scale Hadoop Distributed File System (HDFS) clusters. Hamster (Hadoop And Mpi on the same cluSTER) is a port of OpenMPI to use YARN as a resource manager. Hamster allows applications written using MPI (Message Passing Interface) to run alongside other YARN applications and frameworks, such as MapReduce, on the same Hadoop cluster. In this talk, I will describe the architecture of Hamster, and present a few MPI applications that have been demonstrated to run in Hadoop. GraphLab uses MPI as one of the supported communication libraries, and can read/write data from/to HDFS. I will describe how GraphLab runs on top of Hadoop using Hamster, and present a few benchmarks in graph analytics, comparing GraphLab with other machine frameworks.
Spotify in the Cloud - An evolution of data infrastructure - Strata NYCJosh Baer
Slides from a presentation given by Alison Gilles and Josh Baer during StrataNYC 2017.
Covers the decision, challenge and strategy (technical, organizational, people) for converting Spotify's 2500 node Hadoop cluster's worth of data and processing to Google Cloud.
Finally, touches on Spotify's resulting infrastructure on GCP.
Introduction into scalable graph analysis with Apache Giraph and Spark GraphXrhatr
Graph relationships are everywhere. In fact, more often than not, analyzing relationships between points in your datasets lets you extract more business value from your data.
Consider social graphs, or relationships of customers to each other and products they purchase, as two of the most common examples. Now, if you think you have a scalability issue just analyzing points in your datasets, imagine what would happen if you wanted to start analyzing the arbitrary relationships between those data points: the amount of potential processing will increase dramatically, and the kind of algorithms you would typically want to run would change as well.
If your Hadoop batch-oriented approach with MapReduce works reasonably well, for scalable graph processing you have to embrace an in-memory, explorative, and iterative approach. One of the best ways to tame this complexity is known as the Bulk synchronous parallel approach. Its two most widely used implementations are available as Hadoop ecosystem projects: Apache Giraph (used at Facebook), and Apache GraphX (as part of a Spark project).
In this talk we will focus on practical advice on how to get up and running with Apache Giraph and GraphX; start analyzing simple datasets with built-in algorithms; and finally how to implement your own graph processing applications using the APIs provided by the projects. We will finally compare and contrast the two, and try to lay out some principles of when to use one vs. the other.
Apache Spark: killer or savior of Apache Hadoop?rhatr
The Big Boss(tm) has just OKed the first Hadoop cluster in the company. You are the guy in charge of analyzing petabytes of your company's valuable data using a combination of custom MapReduce jobs and SQL-on-Hadoop solutions. All of a sudden the web is full of articles telling you that Hadoop is dead, Spark has won and you should quit while you're still ahead. But should you?
The Elephant in the Cloud: A Quest for the Next Generation
In this talk, I will go through the evolution of Hadoop and its ecosystem projects and will try to peer into the crystal ball to predict what may be coming down the pike. I will discuss various way of crunching the data on Hadoop (MapReduce, OpenMPI, Spark and various SQL engines) and how these tools compliment each other.
Apache Hadoop is no longer just a faithful, open source, scalable implementation of two seminal papers that came out of Google 10 years ago. It has evolved into a project that provides the enterprises with a reliable layer for storing massive amounts of unstructured data (HDFS) while allowing different computational frameworks to leverage those datasets.
The original computational framework (MapReduce) has evolved into a much more scalable set of general purpose cluster management APIs collectively known as YARN. With YARN underneath, MapReduce is still there to support batch-oriented computations, but it is no longer the only game in town. With OpenMPI, Spark, and Tez rapidly becoming available now is truly the most exciting time to be a developer in a Hadoop ecosystem. It is also the time when you don't have to be employed by Yahoo!, Facebook or EBay to have access to mind-blowing compute power. That power is a credit card and a pivotal.io account away from anybody on the planet.
I will conclude by outlining some of the ongoing work that makes Hadoop and its ecosystem projects first class citizens in cloud environments based on the work that Pivotal engineers have done with integrating Hadoop into PivotalONE PaaS.
Apache con big data 2015 - Data Science from the trenchesVinay Shukla
ApacheBigData - Budapest, 2015
Data Science from the trenches
What are the issues?
How to select best algorithm?
How to tune?
What are the problems with visualization?
How does Zeppelin help
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng ShiDatabricks
Apache Kylin is a distributed OLAP engine on Hadoop, which provides sub-second level query latency over datasets scaling to petabytes. Kylin’s superior query performance relies on pre-calculated multi-dimension Cube, which is often time-consuming to build. By default, Kylin uses MapReduce Cube Engine built atop of Hadoop MapReduce framework to aggregate huge amounts of source data. The MR Engine has been well-tuned over years and proven to be stable in hundreds of production deployments. Recently, the Kylin team is trying to further speed up the process of cube building by replacing MR with Spark. Kyligence has initiated the new Spark Cube Engine with some benchmarks between Spark and MR over different datasets, and has received some promising results. Hear about their results and experiences on moving Cube building, which is a huge computing task, to Spark.
"Big Data" is a much-hyped term nowadays in Business Computing. However, the core concept of collaborative environments conducting experiments over large shared data repositories has existed for decades. In this talk, I will outline how recent advances in Cloud Computing, Big Data processing frameworks, and agile application development platforms enable Data Intensive Cloud Applications. I will provide a brief history of efforts in building scalable & adaptive run-time environments, and the role these runtime systems will play in new Cloud Applications. I will present a vision for cloud platforms for science, where data-intensive frameworks such as Apache Hadoop will play a key role.
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)Alexey Zinoviev
Alexey Zinoviev presented this paper on the Highload++ conference http://www.highload.ru/2014/abstracts/1516.html
This paper covers next topics: Pregel, Graph Theory, Giraph, Okapi, GraphX, GraphChi, Spark, Shrotest Path Problem, Road Network, Road Graph
The Zoo Expands: Labrador *Loves* Elephant, Thanks to HamsterMilind Bhandarkar
The refactoring of Hadoop MapReduce framework, by separating resource management (YARN) from job execution (MapReduce) has allowed multiple programming paradigms to take advantage of the massive scale Hadoop Distributed File System (HDFS) clusters. Hamster (Hadoop And Mpi on the same cluSTER) is a port of OpenMPI to use YARN as a resource manager. Hamster allows applications written using MPI (Message Passing Interface) to run alongside other YARN applications and frameworks, such as MapReduce, on the same Hadoop cluster. In this talk, I will describe the architecture of Hamster, and present a few MPI applications that have been demonstrated to run in Hadoop. GraphLab uses MPI as one of the supported communication libraries, and can read/write data from/to HDFS. I will describe how GraphLab runs on top of Hadoop using Hamster, and present a few benchmarks in graph analytics, comparing GraphLab with other machine frameworks.
Spotify in the Cloud - An evolution of data infrastructure - Strata NYCJosh Baer
Slides from a presentation given by Alison Gilles and Josh Baer during StrataNYC 2017.
Covers the decision, challenge and strategy (technical, organizational, people) for converting Spotify's 2500 node Hadoop cluster's worth of data and processing to Google Cloud.
Finally, touches on Spotify's resulting infrastructure on GCP.
Introduction into scalable graph analysis with Apache Giraph and Spark GraphXrhatr
Graph relationships are everywhere. In fact, more often than not, analyzing relationships between points in your datasets lets you extract more business value from your data.
Consider social graphs, or relationships of customers to each other and products they purchase, as two of the most common examples. Now, if you think you have a scalability issue just analyzing points in your datasets, imagine what would happen if you wanted to start analyzing the arbitrary relationships between those data points: the amount of potential processing will increase dramatically, and the kind of algorithms you would typically want to run would change as well.
If your Hadoop batch-oriented approach with MapReduce works reasonably well, for scalable graph processing you have to embrace an in-memory, explorative, and iterative approach. One of the best ways to tame this complexity is known as the Bulk synchronous parallel approach. Its two most widely used implementations are available as Hadoop ecosystem projects: Apache Giraph (used at Facebook), and Apache GraphX (as part of a Spark project).
In this talk we will focus on practical advice on how to get up and running with Apache Giraph and GraphX; start analyzing simple datasets with built-in algorithms; and finally how to implement your own graph processing applications using the APIs provided by the projects. We will finally compare and contrast the two, and try to lay out some principles of when to use one vs. the other.
Graph Sample and Hold: A Framework for Big Graph AnalyticsNesreen K. Ahmed
Sampling is a standard approach in big-graph analytics; the goal is to efficiently estimate the graph properties by consulting a sample of the whole population. A perfect sample is assumed to mirror every property of the whole population. Unfortunately, such a perfect sample is hard to collect in complex populations such as graphs(e.g. web graphs, social networks), where an underlying network connects the units of the population. Therefore, a good sample will be representative in the sense that graph properties of interest can be estimated with a known degree of accuracy.While previous work focused particularly on sampling schemes to estimate certain graph properties (e.g. triangle count), much less is known for the case when we need to estimate various graph properties with the same sampling scheme. In this paper, we pro-pose a generic stream sampling framework for big-graph analytics,called Graph Sample and Hold (gSH), which samples from massive graphs sequentially in a single pass, one edge at a time, while maintaining a small state in memory. We use a Horvitz-Thompson construction in conjunction with a scheme that samples arriving edges without adjacencies to previously sampled edges with probability p and holds edges with adjacencies with probability q. Our sample and hold framework facilitates the accurate estimation of subgraph patterns by enabling the dependence of the sampling process to vary based on previous history. Within our framework, we show how to produce statistically unbiased estimators for various graph properties from the sample. Given that the graph analytic swill run on a sample instead of the whole population, the runtime complexity is kept under control. Moreover, given that the estimators are unbiased, the approximation error is also kept under control.
PayPal prvoides an online transfer money network. Each payment flow connects senders and receivers into a giant network where each sender/receiver is a node and each transaction is an edge. Traditionally, the risk score of a transaction is computed based on the characteristics of the involved sender/receiver/transaction. In this talk, we will describe a novel network inference approach to calculate transaction risk score that also includes the risk profile of neighboring senders and receivers using Apache Giraph. The approach reveals additional risk insights not possible with the traditional method. We leverage Hadoop to support a graph computation involving hundreds of millions of nodes and edges.
Graphs are everywhere! Distributed graph computing with Spark GraphXAndrea Iacono
These are the slide for the talk given at Codemotion Milan on november 2015. The source code shown is available at https://github.com/andreaiacono/TalkGraphX .
Dec'2013 webinar from the EUCLID project on managing large volumes of Linked Data
webinar recording at https://vimeo.com/84126769 and https://vimeo.com/84126770
more info on EUCLID: http://euclid-project.eu/
A VERY high level over view of Graph Analytics concepts and techniques, including structural analytics, Connectivity Analytics, Community Analytics, Path Analytics, as well as Pattern Matching
Quick introduction to community detection.
Structural properties of real world networks, definition of "communities", fundamental techniques and evaluation measures.
Applying large scale text analytics with graph databasesData Ninja API
Data Ninja Services collaborated with Oracle to reach a major milestone in the integration of text analytics with Oracle Spatial and Graph. The Data Ninja Services client in Java can be used to analyze free texts, extract entities, generate RDF semantic graphs, and choose from a number of graph analytics to infer entity relationships. We demonstrated two case studies involving mining health news and detecting anomalies in product reviews.
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)Ankur Dave
GraphX is a graph processing framework built into Apache Spark. This talk introduces GraphX, describes key features of its API, and gives an update on its status.
2015 01-17 Lambda Architecture with Apache Spark, NextML ConferenceDB Tsai
Lambda architecture is a data-processing architecture designed to handle massive quantities of data by taking advantage of both batch- and stream-processing methods. In Lambda architecture, the system involves three layers: batch processing, speed (or real-time) processing, and a serving layer for responding to queries, and each comes with its own set of requirements.
In batch layer, it aims at perfect accuracy by being able to process the all available big dataset which is an immutable, append-only set of raw data using distributed processing system. Output will be typically stored in a read-only database with result completely replacing existing precomputed views. Apache Hadoop, Pig, and HIVE are
the de facto batch-processing system.
In speed layer, the data is processed in streaming fashion, and the real-time views are provided by the most recent data. As a result, the speed layer is responsible for filling the "gap" caused by the batch layer's lag in providing views based on the most recent data. This layer's views may not be as accurate as the views provided by batch layer's views created with full dataset, so they will be eventually replaced by the batch layer's views. Traditionally, Apache Storm is
used in this layer.
In serving layer, the result from batch layer and speed layer will be stored here, and it responds to queries in a low-latency and ad-hoc way.
One of the lambda architecture examples in machine learning context is building the fraud detection system. In speed layer, the incoming streaming data can be used for online learning to update the model learnt in batch layer to incorporate the recent events. After a while, the model can be rebuilt using the full dataset.
Why Spark for lambda architecture? Traditionally, different
technologies are used in batch layer and speed layer. If your batch system is implemented with Apache Pig, and your speed layer is implemented with Apache Storm, you have to write and maintain the same logics in SQL and in Java/Scala. This will very quickly becomes a maintenance nightmare. With Spark, we have an unified development framework for batch and speed layer at scale. In this talk, an end-to-end example implemented in Spark will be shown, and we will
discuss about the development, testing, maintenance, and deployment of lambda architecture system with Apache Spark.
we will see an overview of Spark in Big Data. We will start with an introduction to Apache Spark Programming. Then we will move to know the Spark History. Moreover, we will learn why Spark is needed. Afterward, will cover all fundamental of Spark components. Furthermore, we will learn about Spark’s core abstraction and Spark RDD. For more detailed insights, we will also cover spark features, Spark limitations, and Spark Use cases.
Transitioning Compute Models: Hadoop MapReduce to SparkSlim Baltagi
This presentation is an analysis of the observed trends in the transition from the Hadoop ecosystem to the Spark ecosystem. The related talk took place at the Chicago Hadoop User Group (CHUG) meetup held on February 12, 2015.
This talk gives an introduction into Hadoop 2 and YARN. Then the changes for MapReduce 2 are explained. Finally Tez and Spark are explained and compared in detail.
The talk has been held on the Parallel 2014 conference in Karlsruhe, Germany on 06.05.2014.
Agenda:
- Introduction to Hadoop 2
- MapReduce 2
- Tez, Hive & Stinger Initiative
- Spark
Guagua an iterative computing framework on hadooppengshanzhang
Guagua, a sub-project of Shifu, is a distributed, pluggable and scalable iterative computing framework based on Hadoop MapReduce and YARN.
Typical use cases for Guagua are distributed machine learning model training based on Hadoop. By using Guagua, we implement distributed neural network algorithm which can reduce model training time from days to hours on 500GB data sets.
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Agile Testing Alliance
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Processing by "Sampat Kumar" from "Harman". The presentation was done at #doppa17 DevOps++ Global Summit 2017. All the copyrights are reserved with the author
Large scale preservation workflows with Taverna – SCAPE Training event, Guima...SCAPE Project
Sven Schlarb of the Austrian National Library gave this introduction to large scale preservation workflows with Taverna at the first SCAPE Training event, ‘Keeping Control: Scalable Preservation Environments for Identification and Characterisation’, in Guimarães, Portugal on 6-7 December 2012.
June 2015 Berlin Buzzwords Presentation
http://berlinbuzzwords.de/file/bbuzz-2015-szehon-ho-hive-spark
https://berlinbuzzwords.de/session/hive-spark
Speaker Interview:
https://berlinbuzzwords.de/news/speaker-interview-szehon-ho
Similar to Fast, Scalable Graph Processing: Apache Giraph on YARN (20)
Introduction: This workshop will provide a hands-on introduction to Machine Learning (ML) with an overview of Deep Learning (DL).
Format: An introductory lecture on several supervised and unsupervised ML techniques followed by light introduction to DL and short discussion what is current state-of-the-art. Several python code samples using the scikit-learn library will be introduced that users will be able to run in the Cloudera Data Science Workbench (CDSW).
Objective: To provide a quick and short hands-on introduction to ML with python’s scikit-learn library. The environment in CDSW is interactive and the step-by-step guide will walk you through setting up your environment, to exploring datasets, training and evaluating models on popular datasets. By the end of the crash course, attendees will have a high-level understanding of popular ML algorithms and the current state of DL, what problems they can solve, and walk away with basic hands-on experience training and evaluating ML models.
Prerequisites: For the hands-on portion, registrants must bring a laptop with a Chrome or Firefox web browser. These labs will be done in the cloud, no installation needed. Everyone will be able to register and start using CDSW after the introductory lecture concludes (about 1hr in). Basic knowledge of python highly recommended.
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
In a world with a myriad of distributed storage systems to choose from, the majority of Apache HBase clusters still rely on Apache HDFS. Theoretically, any distributed file system could be used by HBase. One major reason HDFS is predominantly used are the specific durability requirements of HBase's write-ahead log (WAL) and HDFS providing that guarantee correctly. However, HBase's use of HDFS for WALs can be replaced with sufficient effort.
This talk will cover the design of a "Log Service" which can be embedded inside of HBase that provides a sufficient level of durability that HBase requires for WALs. Apache Ratis (incubating) is a library-implementation of the RAFT consensus protocol in Java and is used to build this Log Service. We will cover the design choices of the Ratis Log Service, comparing and contrasting it to other log-based systems that exist today. Next, we'll cover how the Log Service "fits" into HBase and the necessary changes to HBase which enable this. Finally, we'll discuss how the Log Service can simplify the operational burden of HBase.
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
Utilizing Apache NiFi we read various open data REST APIs and camera feeds to ingest crime and related data real-time streaming it into HBase and Phoenix tables. HBase makes an excellent storage option for our real-time time series data sources. We can immediately query our data utilizing Apache Zeppelin against Phoenix tables as well as Hive external tables to HBase.
Apache Phoenix tables also make a great option since we can easily put microservices on top of them for application usage. I have an example Spring Boot application that reads from our Philadelphia crime table for front-end web applications as well as RESTful APIs.
Apache NiFi makes it easy to push records with schemas to HBase and insert into Phoenix SQL tables.
Resources:
https://community.hortonworks.com/articles/54947/reading-opendata-json-and-storing-into-phoenix-tab.html
https://community.hortonworks.com/articles/56642/creating-a-spring-boot-java-8-microservice-to-read.html
https://community.hortonworks.com/articles/64122/incrementally-streaming-rdbms-data-to-your-hadoop.html
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
Whilst HBase is the most logical answer for use cases requiring random, realtime read/write access to Big Data, it may not be so trivial to design applications that make most of its use, neither the most simple to operate. As it depends/integrates with other components from Hadoop ecosystem (Zookeeper, HDFS, Spark, Hive, etc) or external systems ( Kerberos, LDAP), and its distributed nature requires a "Swiss clockwork" infrastructure, many variables are to be considered when observing anomalies or even outages. Adding to the equation there's also the fact that HBase is still an evolving product, with different release versions being used currently, some of those can carry genuine software bugs. On this presentation, we'll go through the most common HBase issues faced by different organisations, describing identified cause and resolution action over my last 5 years supporting HBase to our heterogeneous customer base.
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
LocationTech GeoMesa enables spatial and spatiotemporal indexing and queries for HBase and Accumulo. In this talk, after an overview of GeoMesa’s capabilities in the Cloudera ecosystem, we will dive into how GeoMesa leverages Accumulo’s Iterator interface and HBase’s Filter and Coprocessor interfaces. The goal will be to discuss both what spatial operations can be pushed down into the distributed database and also how the GeoMesa codebase is organized to allow for consistent use across the two database systems.
OCLC has been using HBase since 2012 to enable single-search-box access to over a billion items from your library and the world’s library collection. This talk will provide an overview of how HBase is structured to provide this information and some of the challenges they have encountered to scale to support the world catalog and how they have overcome them.
Many individuals/organizations have a desire to utilize NoSQL technology, but often lack an understanding of how the underlying functional bits can be utilized to enable their use case. This situation can result in drastic increases in the desire to put the SQL back in NoSQL.
Since the initial commit, Apache Accumulo has provided a number of examples to help jumpstart comprehension of how some of these bits function as well as potentially help tease out an understanding of how they might be applied to a NoSQL friendly use case. One very relatable example demonstrates how Accumulo could be used to emulate a filesystem (dirlist).
In this session we will walk through the dirlist implementation. Attendees should come away with an understanding of the supporting table designs, a simple text search supporting a single wildcard (on file/directory names), and how the dirlist elements work together to accomplish its feature set. Attendees should (hopefully) also come away with a justification for sometimes keeping the SQL out of NoSQL.
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
Data serves as the platform for decision-making at Uber. To facilitate data driven decisions, many datasets at Uber are ingested in a Hadoop Data Lake and exposed to querying via Hive. Analytical queries joining various datasets are run to better understand business data at Uber.
Data ingestion, at its most basic form, is about organizing data to balance efficient reading and writing of newer data. Data organization for efficient reading involves factoring in query patterns to partition data to ensure read amplification is low. Data organization for efficient writing involves factoring the nature of input data - whether it is append only or updatable.
At Uber we ingest terabytes of many critical tables such as trips that are updatable. These tables are fundamental part of Uber's data-driven solutions, and act as the source-of-truth for all the analytical use-cases across the entire company. Datasets such as trips constantly receive updates to the data apart from inserts. To ingest such datasets we need a critical component that is responsible for bookkeeping information of the data layout, and annotates each incoming change with the location in HDFS where this data should be written. This component is called as Global Indexing. Without this component, all records get treated as inserts and get re-written to HDFS instead of being updated. This leads to duplication of data, breaking data correctness and user queries. This component is key to scaling our jobs where we are now handling greater than 500 billion writes a day in our current ingestion systems. This component will need to have strong consistency and provide large throughputs for index writes and reads.
At Uber, we have chosen HBase to be the backing store for the Global Indexing component and is a critical component in allowing us to scaling our jobs where we are now handling greater than 500 billion writes a day in our current ingestion systems. In this talk, we will discuss data@Uber and expound more on why we built the global index using Apache Hbase and how this helps to scale out our cluster usage. We’ll give details on why we chose HBase over other storage systems, how and why we came up with a creative solution to automatically load Hfiles directly to the backend circumventing the normal write path when bootstrapping our ingestion tables to avoid QPS constraints, as well as other learnings we had bringing this system up in production at the scale of data that Uber encounters daily.
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
Recently, Apache Phoenix has been integrated with Apache (incubator) Omid transaction processing service, to provide ultra-high system throughput with ultra-low latency overhead. Phoenix has been shown to scale beyond 0.5M transactions per second with sub-5ms latency for short transactions on industry-standard hardware. On the other hand, Omid has been extended to support secondary indexes, multi-snapshot SQL queries, and massive-write transactions.
These innovative features make Phoenix an excellent choice for translytics applications, which allow converged transaction processing and analytics. We share the story of building the next-gen data tier for advertising platforms at Verizon Media that exploits Phoenix and Omid to support multi-feed real-time ingestion and AI pipelines in one place, and discuss the lessons learned.
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
Cybersecurity requires an organization to collect data, analyze it, and alert on cyber anomalies in near real-time. This is a challenging endeavor when considering the variety of data sources which need to be collected and analyzed. Everything from application logs, network events, authentications systems, IOT devices, business events, cloud service logs, and more need to be taken into consideration. In addition, multiple data formats need to be transformed and conformed to be understood by both humans and ML/AI algorithms.
To solve this problem, the Aetna Global Security team developed the Unified Data Platform based on Apache NiFi, which allows them to remain agile and adapt to new security threats and the onboarding of new technologies in the Aetna environment. The platform currently has over 60 different data flows with 95% doing real-time ETL and handles over 20 billion events per day. In this session learn from Aetna’s experience building an edge to AI high-speed data pipeline with Apache NiFi.
In the healthcare sector, data security, governance, and quality are crucial for maintaining patient privacy and ensuring the highest standards of care. At Florida Blue, the leading health insurer of Florida serving over five million members, there is a multifaceted network of care providers, business users, sales agents, and other divisions relying on the same datasets to derive critical information for multiple applications across the enterprise. However, maintaining consistent data governance and security for protected health information and other extended data attributes has always been a complex challenge that did not easily accommodate the wide range of needs for Florida Blue’s many business units. Using Apache Ranger, we developed a federated Identity & Access Management (IAM) approach that allows each tenant to have their own IAM mechanism. All user groups and roles are propagated across the federation in order to determine users’ data entitlement and access authorization; this applies to all stages of the system, from the broadest tenant levels down to specific data rows and columns. We also enabled audit attributes to ensure data quality by documenting data sources, reasons for data collection, date and time of data collection, and more. In this discussion, we will outline our implementation approach, review the results, and highlight our “lessons learned.”
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale in a variety of use cases at Airbnb, Bloomberg, Comcast, Facebook, FINRA, LinkedIn, Lyft, Netflix, Twitter, and Uber, in the last few years Presto experienced an unprecedented growth in popularity in both on-premises and cloud deployments over Object Stores, HDFS, NoSQL and RDBMS data stores.
With the ever-growing list of connectors to new data sources such as Azure Blob Storage, Elasticsearch, Netflix Iceberg, Apache Kudu, and Apache Pulsar, recently introduced Cost-Based Optimizer in Presto must account for heterogeneous inputs with differing and often incomplete data statistics. This talk will explore this topic in detail as well as discuss best use cases for Presto across several industries. In addition, we will present recent Presto advancements such as Geospatial analytics at scale and the project roadmap going forward.
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
Specialized tools for machine learning development and model governance are becoming essential. MlFlow is an open source platform for managing the machine learning lifecycle. Just by adding a few lines of code in the function or script that trains their model, data scientists can log parameters, metrics, artifacts (plots, miscellaneous files, etc.) and a deployable packaging of the ML model. Every time that function or script is run, the results will be logged automatically as a byproduct of those lines of code being added, even if the party doing the training run makes no special effort to record the results. MLflow application programming interfaces (APIs) are available for the Python, R and Java programming languages, and MLflow sports a language-agnostic REST API as well. Over a relatively short time period, MLflow has garnered more than 3,300 stars on GitHub , almost 500,000 monthly downloads and 80 contributors from more than 40 companies. Most significantly, more than 200 companies are now using MLflow. We will demo MlFlow Tracking , Project and Model components with Azure Machine Learning (AML) Services and show you how easy it is to get started with MlFlow on-prem or in the cloud.
Extending Twitter's Data Platform to Google CloudDataWorks Summit
Twitter's Data Platform is built using multiple complex open source and in house projects to support Data Analytics on hundreds of petabytes of data. Our platform support storage, compute, data ingestion, discovery and management and various tools and libraries to help users for both batch and realtime analytics. Our DataPlatform operates on multiple clusters across different data centers to help thousands of users discover valuable insights. As we were scaling our Data Platform to multiple clusters, we also evaluated various cloud vendors to support use cases outside of our data centers. In this talk we share our architecture and how we extend our data platform to use cloud as another datacenter. We walk through our evaluation process, challenges we faced supporting data analytics at Twitter scale on cloud and present our current solution. Extending Twitter's Data platform to cloud was complex task which we deep dive in this presentation.
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
At Comcast, our team has been architecting a customer experience platform which is able to react to near-real-time events and interactions and deliver appropriate and timely communications to customers. By combining the low latency capabilities of Apache Flink and the dataflow capabilities of Apache NiFi we are able to process events at high volume to trigger, enrich, filter, and act/communicate to enhance customer experiences. Apache Flink and Apache NiFi complement each other with their strengths in event streaming and correlation, state management, command-and-control, parallelism, development methodology, and interoperability with surrounding technologies. We will trace our journey from starting with Apache NiFi over three years ago and our more recent introduction of Apache Flink into our platform stack to handle more complex scenarios. In this presentation we will compare and contrast which business and technical use cases are best suited to which platform and explore different ways to integrate the two platforms into a single solution.
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
Companies are increasingly moving to the cloud to store and process data. One of the challenges companies have is in securing data across hybrid environments with easy way to centrally manage policies. In this session, we will talk through how companies can use Apache Ranger to protect access to data both in on-premise as well as in cloud environments. We will go into details into the challenges of hybrid environment and how Ranger can solve it. We will also talk through how companies can further enhance the security by leveraging Ranger to anonymize or tokenize data while moving into the cloud and de-anonymize dynamically using Apache Hive, Apache Spark or when accessing data from cloud storage systems. We will also deep dive into the Ranger’s integration with AWS S3, AWS Redshift and other cloud native systems. We will wrap it up with an end to end demo showing how policies can be created in Ranger and used to manage access to data in different systems, anonymize or de-anonymize data and track where data is flowing.
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
Advanced Big Data Processing frameworks have been proposed to harness the fast data transmission capability of Remote Direct Memory Access (RDMA) over high-speed networks such as InfiniBand, RoCEv1, RoCEv2, iWARP, and OmniPath. However, with the introduction of the Non-Volatile Memory (NVM) and NVM express (NVMe) based SSD, these designs along with the default Big Data processing models need to be re-assessed to discover the possibilities of further enhanced performance. In this talk, we will present, NRCIO, a high-performance communication runtime for non-volatile memory over modern network interconnects that can be leveraged by existing Big Data processing middleware. We will show the performance of non-volatile memory-aware RDMA communication protocols using our proposed runtime and demonstrate its benefits by incorporating it into a high-performance in-memory key-value store, Apache Hadoop, Tez, Spark, and TensorFlow. Evaluation results illustrate that NRCIO can achieve up to 3.65x performance improvement for representative Big Data processing workloads on modern data centers.
Background: Some early applications of Computer Vision in Retail arose from e-commerce use cases - but increasingly, it is being used in physical stores in a variety of new and exciting ways, such as:
● Optimizing merchandising execution, in-stocks and sell-thru
● Enhancing operational efficiencies, enable real-time customer engagement
● Enhancing loss prevention capabilities, response time
● Creating frictionless experiences for shoppers
Abstract: This talk will cover the use of Computer Vision in Retail, the implications to the broader Consumer Goods industry and share business drivers, use cases and benefits that are unfolding as an integral component in the remaking of an age-old industry.
We will also take a ‘peek under the hood’ of Computer Vision and Deep Learning, sharing technology design principles and skill set profiles to consider before starting your CV journey.
Deep learning has matured considerably in the past few years to produce human or superhuman abilities in a variety of computer vision paradigms. We will discuss ways to recognize these paradigms in retail settings, collect and organize data to create actionable outcomes with the new insights and applications that deep learning enables.
We will cover the basics of object detection, then move into the advanced processing of images describing the possible ways that a retail store of the near future could operate. Identifying various storefront situations by having a deep learning system attached to a camera stream. Such things as; identifying item stocks on shelves, a shelf in need of organization, or perhaps a wandering customer in need of assistance.
We will also cover how to use a computer vision system to automatically track customer purchases to enable a streamlined checkout process, and how deep learning can power plausible wardrobe suggestions based on what a customer is currently wearing or purchasing.
Finally, we will cover the various technologies that are powering these applications today. Deep learning tools for research and development. Production tools to distribute that intelligence to an entire inventory of all the cameras situation around a retail location. Tools for exploring and understanding the new data streams produced by the computer vision systems.
By the end of this talk, attendees should understand the impact Computer Vision and Deep Learning are having in the Consumer Goods industry, key use cases, techniques and key considerations leaders are exploring and implementing today.
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
Whole genome shotgun based next generation transcriptomics and metagenomics studies often generate 100 to 1000 gigabytes (GB) sequence data derived from tens of thousands of different genes or microbial species. De novo assembling these data requires an ideal solution that both scales with data size and optimizes for individual gene or genomes. Here we developed an Apache Spark-based scalable sequence clustering application, SparkReadClust (SpaRC), that partitions the reads based on their molecule of origin to enable downstream assembly optimization. SpaRC produces high clustering performance on transcriptomics and metagenomics test datasets from both short read and long read sequencing technologies. It achieved a near linear scalability with respect to input data size and number of compute nodes. SpaRC can run on different cloud computing environments without modifications while delivering similar performance. In summary, our results suggest SpaRC provides a scalable solution for clustering billions of reads from the next-generation sequencing experiments, and Apache Spark represents a cost-effective solution with rapid development/deployment cycles for similar big data genomics problems.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
3. Fast, Scalable Graph Processing:
Apache Giraph on YARN
Eli is...
• Apache Giraph Committer and PMC Member
• Apache Tajo Committer
• Wrote initial port of Giraph to YARN
• Collaborating with fellow Giraph committers on
Giraph in Action book for Manning publishing
4. Fast, Scalable Graph Processing:
Apache Giraph on YARN
Eli is...
• Only able to do all this with the support of:
5. Fast, Scalable Graph Processing:
Apache Giraph on YARN
Eli is a software engineer at
6. Fast, Scalable Graph Processing:
Apache Giraph on YARN
Etsy enables non-technical folks to sell
handmade and vintage stuff:
We have a great blog called Code As Craft:
7. Fast, Scalable Graph Processing:
Apache Giraph on YARN
...but, enough about me, lets talk Giraph!
8. Fast, Scalable Graph Processing:
Apache Giraph on YARN
Key Topics
What is Apache Giraph?
Why do I need it?
Giraph + MapReduce
Giraph + YARN
Giraph Roadmap
9. Fast, Scalable Graph Processing:
Apache Giraph on YARN
What is Apache Giraph?
Giraph is a framework for performing offline
batch processing of semi-structured graph
data on a massive scale.
Giraph is loosely based upon Google's Pregel
graph processing framework.
10. Fast, Scalable Graph Processing:
Apache Giraph on YARN
What is Apache Giraph?
Giraph performs iterative calculations on top of an
existing Hadoop cluster.
11. Fast, Scalable Graph Processing:
Apache Giraph on YARN
What is Apache Giraph?
Giraph uses Apache Zookeeper to enforce atomic
barrier waits and perform leader election.
Done! Done! ...Still
working...
12. Fast, Scalable Graph Processing:
Apache Giraph on YARN
What is Apache Giraph?
Giraph benefits from a vibrant Apache community, and is
under active development:
13. Fast, Scalable Graph Processing:
Apache Giraph on YARN
Why do I need it?
Giraph makes graph algorithms easy to reason about
and implement by following the Bulk Synchronous
Parallel (BSP) programming model.
In BSP, all algorithms are implemented from the point
of view of a single vertex in the input graph
performing a single iteration of the computation.
14. Fast, Scalable Graph Processing:
Apache Giraph on YARN
Why do I need it?
• Giraph makes iterative data processing more
practical for Hadoop users.
• Giraph can avoid costly disk and network
operations that are mandatory in MR.
• No concept of message passing in MR.
15. Fast, Scalable Graph Processing:
Apache Giraph on YARN
Why do I need it?
Each cycle of an iterative calculation on
Hadoop means running a full MapReduce
job.
16. Fast, Scalable Graph Processing:
Apache Giraph on YARN
Let's use simple PageRank as a quick
example:
http://en.wikipedia.org/wiki/PageRank
1.0
1.0
1.0
17. Fast, Scalable Graph Processing:
Apache Giraph on YARN
1. All vertices start with same PageRank
1.0
1.0
1.0
18. Fast, Scalable Graph Processing:
Apache Giraph on YARN
2. Each vertex distributes an equal portion of
its PageRank to all neighbors:
0.5
0.5
1
1
19. Fast, Scalable Graph Processing:
Apache Giraph on YARN
3. Each vertex sums incoming values times a
weight factor and adds in small adjustment:
1/(# vertices in graph)
(.5*.85) + (.15/3)
(1.5*.85) + (.15/3)
(1*.85) + (.15/3)
20. Fast, Scalable Graph Processing:
Apache Giraph on YARN
4. This value becomes the vertices' PageRank
for the next iteration
.43
.21
.64
21. Fast, Scalable Graph Processing:
Apache Giraph on YARN
5. Repeat until convergence:
(change in PR per-iteration < epsilon)
22. Fast, Scalable Graph Processing:
Apache Giraph on YARN
Vertices with more in-degrees converge to higher
PageRank
24. Fast, Scalable Graph Processing:
Apache Giraph on YARN
PageRank on MapReduce
1. Load complete input graph from disk as
[K= Vertex ID, V = out-edges and PR]
Map Sort/Shuffle Reduce
25. Fast, Scalable Graph Processing:
Apache Giraph on YARN
PageRank on MapReduce
2. Emit all input records (full graph state),
Emit [K = edgeTarget, V = share of PR]
Map Sort/Shuffle Reduce
26. Fast, Scalable Graph Processing:
Apache Giraph on YARN
PageRank on MapReduce
3. Sort and Shuffle this entire mess!
Map Sort/Shuffle Reduce
27. Fast, Scalable Graph Processing:
Apache Giraph on YARN
PageRank on MapReduce
4. Sum incoming PR shares for each vertex,
update PR values in graph state records
Map Sort/Shuffle Reduce
28. Fast, Scalable Graph Processing:
Apache Giraph on YARN
PageRank on MapReduce
5. Emit full graph state to disk...
Map Sort/Shuffle Reduce
29. Fast, Scalable Graph Processing:
Apache Giraph on YARN
PageRank on MapReduce
6. ...and start over!
Map Sort/Shuffle Reduce
30. Fast, Scalable Graph Processing:
Apache Giraph on YARN
PageRank on MapReduce
• Awkward to reason about
• I/O bound despite simple core business logic
Map Sort/Shuffle Reduce
31. Fast, Scalable Graph Processing:
Apache Giraph on YARN
PageRank on Giraph
1. Hadoop Mappers are "hijacked" to host
Giraph master and worker tasks.
Map Sort/Shuffle Reduce
32. Fast, Scalable Graph Processing:
Apache Giraph on YARN
PageRank on Giraph
2. Input graph is loaded once, maintaining
code-data locality when possible.
Map Sort/Shuffle Reduce
33. Fast, Scalable Graph Processing:
Apache Giraph on YARN
PageRank on Giraph
3. All iterations are performed on data in memory,
optionally spilled to disk. Disk access is linear/
scan-based.
Map Sort/Shuffle Reduce
34. Fast, Scalable Graph Processing:
Apache Giraph on YARN
PageRank on Giraph
4. Output is written from the Mappers hosting
the calculation, and the job run ends.
Map Sort/Shuffle Reduce
35. Fast, Scalable Graph Processing:
Apache Giraph on YARN
This is all well and good, but must we
manipulate Hadoop this way?
?
36. Fast, Scalable Graph Processing:
Apache Giraph on YARN
Giraph + MapReduce
• Heap and other resources are set once, globally, for all
Mappers in the computation.
• No control of which cluster nodes host which tasks.
• No control over how Mappers are scheduled.
• Mapper and Reducer slots abstraction is meaningless
for Giraph at best, an artificial limit at worst.
37. Fast, Scalable Graph Processing:
Apache Giraph on YARN
YARN
• YARN (Yet Another Resource Negotiator) is Hadoop's
next-gen job management platform.
• Powers MapReduce v2, but is a general purpose
framework that is not tied to the MapReduce paradigm.
• Offers fine-grained control over each task's resource
allocations and host placement for clients that need it.
39. Fast, Scalable Graph Processing:
Apache Giraph on YARN
Giraph + YARN
Its a natural fit!
40. Fast, Scalable Graph Processing:
Apache Giraph on YARN
Giraph + YARN
• Giraph has maintained compatibility with Hadoop since
0.1 release by executing via MapReduce interface.
• Giraph has featured a "pure YARN" build profile since
1.0 release. It supports Hadoop-2.0.3 and trunk.
*Patches to add 2.0.4 and 2.0.5 support are in review :)
• Giraph's YARN component is easy to extend or use as
a template to port other projects!
41. Fast, Scalable Graph Processing:
Apache Giraph on YARN
Giraph + YARN: Roadmap
• YARN Application Master allows for more natural and
stable bootstrapping of Giraph jobs.
• Zookeeper management can find natural home in
Application Master.
• Giraph on YARN can stop borrowing from Hadoop and
have its own web interface.
42. Fast, Scalable Graph Processing:
Apache Giraph on YARN
Giraph + YARN: Roadmap
• Variable per-task resource allocation opens up the
possibility of Supertasks to manage graph supernodes.
• Ability to spawn or retire tasks per-iteration enables in-
flight reassignment of data partitions.
• AppMaster managed utility tasks such as dedicated
sub-aggregators for tree-like aggregation, or data pre-
samplers.
43. Fast, Scalable Graph Processing:
Apache Giraph on YARN
Giraph New Developments
• Decoupling of logic and graph data means tasks host
computations that are pluggable per-iteration.
• Support for Giraph job scripting, starting with Jython.
More to follow...
• New website, fresh docs, upcoming Manning book, and
large, active community means Giraph has never been
easier to use or contribute to!
44. Fast, Scalable Graph Processing:
Apache Giraph on YARN
Great! Where can I learn more?
http://giraph.apache.org
Mailing List:
user@giraph.apache.org