This document summarizes two graph algorithms for analyzing large graphs: connected components and clustering coefficient.
For connected components, it describes a two step approach: 1) partition the graph and summarize connectivity on each partition, reducing data size, and 2) recombine the summaries to find the overall connected components. This approach works for other problems like finding minimum spanning trees by summarizing each partition.
For clustering coefficient, it describes the challenge of enumerating all possible triangles for each node, which generates quadratic intermediate data. A different approach is needed to scale to very high degree nodes with millions of connections.
A Novel Solution Of Linear CongruencesJeffrey Gold
Proceedings - NCUR IX. (1995), Vol. II, pp. 708-712
Jeffrey F. Gold
Department of Mathematics, Department of Physics
University of Utah
Salt Lake City, Utah 84112
Don H. Tucker
Department of Mathematics
University of Utah
Salt Lake City, Utah 84112
Introduction
Although the solutions of linear congruences have been of interest for a very long time, they still remain somewhat pedagogically di cult. Because of the importance of linear congruences in fields such as public-key cryptosystems, new and innovative approaches are needed both to attract interest and to make them more accessible. While the potential for new ideas used in future research
is difficult to assess, some use may be found here. In this paper, the authors make use of the remodulization method developed in [1] as a vehicle to characterize the conditions under which solutions exist and then determine the solution space. The method is more efficient than those cited in the standard references. This novel approach relates the solution space of cx = a mod b to the Euler totient function for c rather than that of b, which allows one to develop an alternative and somewhat more efficient
approach to the problem of creating enciphering and deciphering keys in public-key cryptosystems.
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...Yahoo Developer Network
Spark and SQL-on-Hadoop have made it easier than ever for enterprises to create or migrate apps to the big data stack. Thousands of apps are being generated every day in the form of ETL and modeling pipelines, business intelligence and data cubes, deep machine learning, graph analytics, and real-time data streaming. However, the task of reliably operationalizing these big data apps involves many painpoints. Developers may not have the experience in distributed systems to tune apps for efficiency and performance. Diagnosing failures or unpredictable performance of apps can be a laborious process that involves multiple people. Apps may get stuck or steal resources and cause mission-critical apps to miss SLAs.
This talk with introduce the audience to these problems and their common causes. We will also demonstrate how to find and fix these problems quickly, as well as prevent such problems from happening in the first place.
Speakers:
Dr. Shivnath Babu is a Co-founder and CTO of Unravel and Associate Professor of Computer Science at Duke University. With more than a decade of experience researching the ease of use and manageability of data-intensive systems, he leads the Starfish project at Duke, which pioneered the automation of Hadoop application tuning, problem diagnosis, and resource management. Shivnath has more than 80 peer-reviewed publications to his credit and has received the U.S. National Science Foundation CAREER Award, the HP Labs Innovation Award, and three IBM Faculty Awards.
A Novel Solution Of Linear CongruencesJeffrey Gold
Proceedings - NCUR IX. (1995), Vol. II, pp. 708-712
Jeffrey F. Gold
Department of Mathematics, Department of Physics
University of Utah
Salt Lake City, Utah 84112
Don H. Tucker
Department of Mathematics
University of Utah
Salt Lake City, Utah 84112
Introduction
Although the solutions of linear congruences have been of interest for a very long time, they still remain somewhat pedagogically di cult. Because of the importance of linear congruences in fields such as public-key cryptosystems, new and innovative approaches are needed both to attract interest and to make them more accessible. While the potential for new ideas used in future research
is difficult to assess, some use may be found here. In this paper, the authors make use of the remodulization method developed in [1] as a vehicle to characterize the conditions under which solutions exist and then determine the solution space. The method is more efficient than those cited in the standard references. This novel approach relates the solution space of cx = a mod b to the Euler totient function for c rather than that of b, which allows one to develop an alternative and somewhat more efficient
approach to the problem of creating enciphering and deciphering keys in public-key cryptosystems.
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...Yahoo Developer Network
Spark and SQL-on-Hadoop have made it easier than ever for enterprises to create or migrate apps to the big data stack. Thousands of apps are being generated every day in the form of ETL and modeling pipelines, business intelligence and data cubes, deep machine learning, graph analytics, and real-time data streaming. However, the task of reliably operationalizing these big data apps involves many painpoints. Developers may not have the experience in distributed systems to tune apps for efficiency and performance. Diagnosing failures or unpredictable performance of apps can be a laborious process that involves multiple people. Apps may get stuck or steal resources and cause mission-critical apps to miss SLAs.
This talk with introduce the audience to these problems and their common causes. We will also demonstrate how to find and fix these problems quickly, as well as prevent such problems from happening in the first place.
Speakers:
Dr. Shivnath Babu is a Co-founder and CTO of Unravel and Associate Professor of Computer Science at Duke University. With more than a decade of experience researching the ease of use and manageability of data-intensive systems, he leads the Starfish project at Duke, which pioneered the automation of Hadoop application tuning, problem diagnosis, and resource management. Shivnath has more than 80 peer-reviewed publications to his credit and has received the U.S. National Science Foundation CAREER Award, the HP Labs Innovation Award, and three IBM Faculty Awards.
Community detection (Поиск сообществ в графах)Kirill Rybachuk
Моя презентация по кластеризации графов, прочитанная на курсах newprolab в Digital October весной 2015 года. Назначение: ликбез по основным подходам, метрикам и алгоритмам. Также приведено кое-что из наших наработок в DCA.
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexYahoo Developer Network
Apache Apex (http://apex.apache.org/) is a stream processing platform that helps organizations to build processing pipelines with fault tolerance and strong processing guarantees. It was built to support low processing latency, high throughput, scalability, interoperability, high availability and security. The platform comes with Malhar library - an extensive collection of processing operators and a wide range of input and output connectors for out-of-the-box integration with an existing infrastructure. In the talk I am going to describe how connectors together with the distributed checkpointing (a mechanism used by the Apex to support fault tolerance and high availability) provide exactly-once end-to-end processing guarantees.
Speakers:
Vlad Rozov is Apache Apex PMC member and back-end engineer at DataTorrent where he focuses on the buffer server, Apex platform network layer, benchmarks and optimizing the core components for low latency and high throughput. Prior to DataTorrent Vlad worked on distributed BI platform at Huawei and on multi-dimensional database (OLAP) at Hyperion Solutions and Oracle.
Presented at the SPIFFE Meetup in Tokyo.
Athenz (www.athenz.io) is an open source platform for X.509 certificate-based service authentication and fine-grained access control in dynamic infrastructures.
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Yahoo Developer Network
Athenz (www.athenz.io) is an open source platform for X.509 certificate-based service authentication and fine-grained access control in dynamic infrastructures that provides options to run multi-environments with a single access control model.
Jithin Emmanuel, Sr. Software Development Manager, Developer Platform Services, provides an overview of Screwdriver (http://www.screwdriver.cd), and shares how it’s used at scale for CI/CD at Oath. Jithin leads the product development and operations of Screwdriver, which is a flagship CI/CD product used at scale in Oath.
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathYahoo Developer Network
Offline and stream processing of big data sets can be done with tools such as Hadoop, Spark, and Storm, but what if you need to process big data at the time a user is making a request? Vespa (http://www.vespa.ai) allows you to search, organize and evaluate machine-learned models from e.g TensorFlow over large, evolving data sets with latencies in the tens of milliseconds. Vespa is behind the recommendation, ad targeting, and search at Yahoo where it handles billions of daily queries over billions of documents.
Community detection (Поиск сообществ в графах)Kirill Rybachuk
Моя презентация по кластеризации графов, прочитанная на курсах newprolab в Digital October весной 2015 года. Назначение: ликбез по основным подходам, метрикам и алгоритмам. Также приведено кое-что из наших наработок в DCA.
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexYahoo Developer Network
Apache Apex (http://apex.apache.org/) is a stream processing platform that helps organizations to build processing pipelines with fault tolerance and strong processing guarantees. It was built to support low processing latency, high throughput, scalability, interoperability, high availability and security. The platform comes with Malhar library - an extensive collection of processing operators and a wide range of input and output connectors for out-of-the-box integration with an existing infrastructure. In the talk I am going to describe how connectors together with the distributed checkpointing (a mechanism used by the Apex to support fault tolerance and high availability) provide exactly-once end-to-end processing guarantees.
Speakers:
Vlad Rozov is Apache Apex PMC member and back-end engineer at DataTorrent where he focuses on the buffer server, Apex platform network layer, benchmarks and optimizing the core components for low latency and high throughput. Prior to DataTorrent Vlad worked on distributed BI platform at Huawei and on multi-dimensional database (OLAP) at Hyperion Solutions and Oracle.
Presented at the SPIFFE Meetup in Tokyo.
Athenz (www.athenz.io) is an open source platform for X.509 certificate-based service authentication and fine-grained access control in dynamic infrastructures.
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Yahoo Developer Network
Athenz (www.athenz.io) is an open source platform for X.509 certificate-based service authentication and fine-grained access control in dynamic infrastructures that provides options to run multi-environments with a single access control model.
Jithin Emmanuel, Sr. Software Development Manager, Developer Platform Services, provides an overview of Screwdriver (http://www.screwdriver.cd), and shares how it’s used at scale for CI/CD at Oath. Jithin leads the product development and operations of Screwdriver, which is a flagship CI/CD product used at scale in Oath.
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathYahoo Developer Network
Offline and stream processing of big data sets can be done with tools such as Hadoop, Spark, and Storm, but what if you need to process big data at the time a user is making a request? Vespa (http://www.vespa.ai) allows you to search, organize and evaluate machine-learned models from e.g TensorFlow over large, evolving data sets with latencies in the tens of milliseconds. Vespa is behind the recommendation, ad targeting, and search at Yahoo where it handles billions of daily queries over billions of documents.
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Yahoo Developer Network
Offline and stream processing of big data sets can be done with tools such as Hadoop, Spark, and Storm, but what if you need to process big data at the time a user is making a request?
This presentation introduces Vespa (http://vespa.ai) – the open source big data serving engine.
Vespa allows you to search, organize, and evaluate machine-learned models from e.g TensorFlow over large, evolving data sets with latencies in the tens of milliseconds. Vespa is behind the recommendation, ad targeting, and search at Yahoo where it handles billions of daily queries over billions of documents and was recently open sourced at http://vespa.ai.
In recent times, YARN Capacity Scheduler has improved a lot in terms of some critical features and refactoring. Here is a quick look into some of the recent changes in scheduler:
Global Scheduling Support
General placement support
Better preemption model to handle resource anomalies across and within queue.
Absolute resources’ configuration support
Priority support between Queues and Applications
In this talk, we will deep dive into each of these new features to give a better picture of their usage and performance comparison. We will also provide some more brief overview about the ongoing efforts and how they can help to solve some of the core issues we face today.
Speakers:
Sunil Govind (Hortonworks), Jian He (Hortonworks)
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Yahoo Developer Network
In recent years, Yahoo has brought the big data ecosystem and machine learning together to discover mathematical models for search ranking, online advertising, content recommendation, and mobile applications. We use distributed computing clusters with CPUs and GPUs to train these models from 100’s of petabytes of data.
A collection of distributed algorithms have been developed to achieve 10-1000x the scale and speed of alternative solutions. Our algorithms construct regression/classification models and semantic vectors within hours, even for billions of training examples and parameters. We have made our distributed deep learning solutions, CaffeOnSpark and TensorFlowOnSpark, available as open source.
In this talk, we highlight Yahoo use cases where big data and machine learning technologies are best exemplified. We explain algorithm/system challenges to scale ML algorithms for massive datasets. We provide a technical overview of CaffeOnSpark and TensorFlowOnSpark to jumpstart your journey of large-scale machine learning.
Speakers:
Andy Feng is a VP of Architecture at Yahoo, leading the architecture and design of big data and machine learning initiatives. He has architected large-scale systems for personalization, ad serving, NoSQL, and cloud infrastructure. Prior to Yahoo, he was a Chief Architect at Netscape/AOL, and Principal Scientist at Xerox. He received a Ph.D. degree in computer science from Osaka University, Japan.
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsYahoo Developer Network
In the analysis of big data there are problematic queries that don’t scale because they require huge compute resources and time to generate exact results. Examples include count distinct, quantiles, most frequent items, joins, matrix computations, and graph analysis. If approximate results are acceptable, there is a class of sub-linear, stochastic streaming algorithms, called "sketches", that can produce results orders-of magnitude faster and with mathematically proven error bounds. For interactive queries there may not be other viable alternatives, and in the case of extracting results for these problem queries in real-time, sketches are the only known solution. For any analysis system that requires these problematic queries from big data, sketches are a required toolkit that should be tightly integrated into the system's analysis capabilities. This technology has helped Yahoo successfully reduce data processing times from days to hours, or minutes to seconds on a number of its internal platforms. This talk covers the current state of our Open Source DataSketches.github.io library, which includes adaptations and example code for Pig, Hive, Spark and Druid and gives architectural examples of use and a case study.
Speakers:
Jon Malkin is a scientist at Yahoo working to extend the DataSketches library. His previous roles have involved large scale data processing for sponsored search, display advertising, user counting, ad targeting, and cross-device user identity modeling.
Alexander Saydakov is a senior software engineer at Yahoo working on the open source Data Sketches project. In his previous roles he has been involved in building large-scale back-end data processing systems and frameworks for data analytics and experimentation based on Torque, Hadoop, Pig, Hive and Druid. Alexander’s education background is in the field of applied mathematics.
October 2016 HUG: Pulsar, a highly scalable, low latency pub-sub messaging s...Yahoo Developer Network
Yahoo recently open-sourced Pulsar, a highly scalable, low latency pub-sub messaging system running on commodity hardware. It provides simple pub-sub messaging semantics over topics, guaranteed at-least-once delivery of messages, automatic cursor management for subscribers, and cross-datacenter replication. Pulsar is used across various Yahoo applications for large scale data pipelines. Learn more about Pulsar architecture and use-cases in this talk.
Speakers:
Matteo Merli from Pulsar team at Yahoo
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...Yahoo Developer Network
Splice Machine is an open-source database that combines the benefits of modern lambda architectures with the full expressiveness of ANSI-SQL. Like lambda architectures, it employs separate compute engines for different workloads - some call this an HTAP database (Hybrid Transactional and Analytical Platform). This talk describes the architecture and implementation of Splice Machine V2.0. The system is powered by a sharded key-value store for fast short reads and writes, and short range scans (Apache HBase) and an in-memory, cluster data flow engine for analytics (Apache Spark). It differs from most other clustered SQL systems such as Impala, SparkSQL, and Hive because it combines analytical processing with a distributed Multi-Value Concurrency Method that provides fine-grained concurrency which is required to power real-time applications. This talk will highlight the Splice Machine storage representation, transaction engine, cost-based optimizer, and present the detailed execution of operational queries on HBase, and the detailed execution of analytical queries on Spark. We will compare and contrast how Splice Machine executes queries with other HTAP systems such as Apache Phoenix and Apache Trafodian. We will end with some roadmap items under development involving new row-based and column-based storage encodings.
Speakers:
Monte Zweben, is a technology industry veteran. Monte’s early career was spent with the NASA Ames Research Center as the Deputy Chief of the Artificial Intelligence Branch, where he won the prestigious Space Act Award for his work on the Space Shuttle program. He then founded and was the Chairman and CEO of Red Pepper Software, a leading supply chain optimization company, which merged in 1996 with PeopleSoft, where he was VP and General Manager, Manufacturing Business Unit. In 1998, he was the founder and CEO of Blue Martini Software – the leader in e-commerce and multi-channel systems for retailers. Blue Martini went public on NASDAQ in one of the most successful IPOs of 2000, and is now part of JDA. Following Blue Martini, he was the chairman of SeeSaw Networks, a digital, place-based media company. Monte is also the co-author of Intelligent Scheduling and has published articles in the Harvard Business Review and various computer science journals and conference proceedings. He currently serves on the Board of Directors of Rocket Fuel Inc. as well as the Dean’s Advisory Board for Carnegie-Mellon’s School of Computer Science.
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfPeter Spielvogel
Building better applications for business users with SAP Fiori.
• What is SAP Fiori and why it matters to you
• How a better user experience drives measurable business benefits
• How to get started with SAP Fiori today
• How SAP Fiori elements accelerates application development
• How SAP Build Code includes SAP Fiori tools and other generative artificial intelligence capabilities
• How SAP Fiori paves the way for using AI in SAP apps
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Welocme to ViralQR, your best QR code generator.ViralQR
Welcome to ViralQR, your best QR code generator available on the market!
At ViralQR, we design static and dynamic QR codes. Our mission is to make business operations easier and customer engagement more powerful through the use of QR technology. Be it a small-scale business or a huge enterprise, our easy-to-use platform provides multiple choices that can be tailored according to your company's branding and marketing strategies.
Our Vision
We are here to make the process of creating QR codes easy and smooth, thus enhancing customer interaction and making business more fluid. We very strongly believe in the ability of QR codes to change the world for businesses in their interaction with customers and are set on making that technology accessible and usable far and wide.
Our Achievements
Ever since its inception, we have successfully served many clients by offering QR codes in their marketing, service delivery, and collection of feedback across various industries. Our platform has been recognized for its ease of use and amazing features, which helped a business to make QR codes.
Our Services
At ViralQR, here is a comprehensive suite of services that caters to your very needs:
Static QR Codes: Create free static QR codes. These QR codes are able to store significant information such as URLs, vCards, plain text, emails and SMS, Wi-Fi credentials, and Bitcoin addresses.
Dynamic QR codes: These also have all the advanced features but are subscription-based. They can directly link to PDF files, images, micro-landing pages, social accounts, review forms, business pages, and applications. In addition, they can be branded with CTAs, frames, patterns, colors, and logos to enhance your branding.
Pricing and Packages
Additionally, there is a 14-day free offer to ViralQR, which is an exceptional opportunity for new users to take a feel of this platform. One can easily subscribe from there and experience the full dynamic of using QR codes. The subscription plans are not only meant for business; they are priced very flexibly so that literally every business could afford to benefit from our service.
Why choose us?
ViralQR will provide services for marketing, advertising, catering, retail, and the like. The QR codes can be posted on fliers, packaging, merchandise, and banners, as well as to substitute for cash and cards in a restaurant or coffee shop. With QR codes integrated into your business, improve customer engagement and streamline operations.
Comprehensive Analytics
Subscribers of ViralQR receive detailed analytics and tracking tools in light of having a view of the core values of QR code performance. Our analytics dashboard shows aggregate views and unique views, as well as detailed information about each impression, including time, device, browser, and estimated location by city and country.
So, thank you for choosing ViralQR; we have an offer of nothing but the best in terms of QR code services to meet business diversity!
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
A tale of scale & speed: How the US Navy is enabling software delivery from l...
XXL Graph Algorithms__HadoopSummit2010
1. XXL Graph Algorithms
Sergei Vassilvitskii
Yahoo! Research
With help from Jake Hofman, Siddharth Suri, Cong Yu and many others
2. Introduction
XXL Graphs are everywhere:
– Web graph
– Friend graphs
– Advertising graphs...
2
3. Introduction
XXL Graphs are everywhere:
– Web graph
– Friend graphs
– Advertising graphs...
But we have Hadoop!
– Few algorithms have been ported (no Hadoop Algorithms book)
– Few general algorithmic approaches
– Active area of research
3
4. Outline
Today:
– Act 1: Crawl before you walk
• Counting connected components
– Act 2: The curse of the last reducer
• Finding tight knit friend groups
4
5. Act 1: Connected Components
Given a graph, how many components does it have?
f
b
a
g
c
e h
d
5
6. Act 1: Connected Components
Given a graph, how many components does it have?
f
b
(b,c) 1
a (f,h) 1
g (b,d) 1
(a,c) 1 (a,b) 1
(c,d) 1
c
(c,e) 1 (f,g) 1
e h (d,e) 1
(d,e) 1
d (b,e) 1
(g,h) 1
Data too big to fit on one reducer!
6
7. CC Overview
Outline for Connected Components
– Partition the input into several chunks (map 1)
– Summarize the connectivity on each chunk (reduce 1)
– Combine all of the (small) summaries (map 2)
– Find the number of connected components
7
9. Connected Components
1. Partition (randomly):
f
b b
a
g
c c
e h
d
Reduce 1 Reduce 2
9
10. Connected Components
1. Partition:
2. Summarize (retain < n edges):
f
b b
a
g
c c
e h
d
Reduce 1 Reduce 2
10
11. Connected Components
1. Partition:
2. Summarize (retain < n edges):
f
b b
a
g
c c
e h
d
Reduce 1 Reduce 2
11
12. Connected Components
1. Partition:
2. Summarize:
3. Recombine: f
b b
a
g
c c
e h
d
Reduce 1 Reduce 2
12
13. Connected Components
1. Partition:
2. Summarize:
3. Recombine:
b f
a
g
c
e
h
d
Round 2
13
14. Connected Components
1. Partition:
2. Summarize:
3. Recombine:
b f (b,c) 1
a (f,h) 1
(b,d) 1
g (a,c) 1 (a,b) 1
(c,d) 1
c
(c,e) 1 (f,g) 1
(d,e) 1
e
h (d,e) 1
(b,e) 1
d (g,h) 1
Round 2
14
15. Connected Components
1. Partition:
2. Summarize:
3. Recombine:
b f
a
g (a,c) 1 (a,b) 1
(c,d) 1
c
(f,g) 1
e
h (d,e) 1
d (g,h) 1
Round 2
Small enough to fit!
15
16. Connected Components
The summarization does not affect connectivity
– Drops redundant edges
– Dramatically reduces data size
– Takes two MapReduce rounds
16
17. Connected Components
The summarization does not affect connectivity
– Drops redundant edges
– Dramatically reduces data size
– Takes two MapReduce rounds
Similar approach works in other situations:
– Consider vertices connected only if k edges between vertices
– Consider vertices connected if similarity score above a threshold
• E.g. approximate Jaccard similarity when computing for recommendation
systems
– Find minimum spanning trees
• Summarize by computing an MST on the subset graph
– Clustering
• Cluster each partition, then aggregate the clusters
17
18. Outline
Today:
– Act 1: Crawl before you walk
• Counting connected components
– Act 2: The curse of the last reducer
• Finding tight knit friend groups
18
19. Act 2: Clustering Coefficient
Finding tight knit groups of friends
19
20. Act 2: Clustering Coefficient
Finding tight knit groups of friends
vs.
19
21. Act 2: Clustering Coefficient
Finding tight knit groups of friends
vs.
2/15 ≈ 0.13 8/15 ≈ 0.53
CC(v) = Fraction of v’s friends who know each other
– Count: number of triangles incident on v
20
22. Finding CC For Each Node
Attempt 1:
– Look at each node
– Enumerate all possible triangles (Pivot)
21
23. Finding CC For Each Node
Attempt 1:
– Look at each node
– Enumerate all possible triangles (Pivot)
22
24. Finding CC For Each Node
Attempt 1:
– Look at each node
– Enumerate all possible triangles (Pivot)
– Check which of those edges exist:
∩ =
15 edges possible 2 edges present
23
25. Finding CC For Each Node
Attempt 1:
– Look at each node
– Enumerate all possible triangles (Pivot)
– Check which of those edges exist
24
26. Finding CC For Each Node
Attempt 1:
– Look at each node
– Enumerate all possible triangles
– Check which of those edges exist
Amount of intermediate data
– Quadratic in the degree of the nodes
– 6 friends: 15 possible triangles
– n friends, n(n-1)/2 possible triangles
25
27. Finding CC For Each Node
Attempt 1:
– Look at each node
– Enumerate all possible triangles
– Check which of those edges exist
Amount of intermediate data
– Quadratic in the degree of the nodes
– 6 friends: 15 possible triangles
– n friends, n(n-1)/2 possible triangles
There’s always “that guy”:
– tens of thousands of friends
– tens of thousands of movie ratings (really!)
– millions of followers
26
28. Finding CC For Each Node
Attempt 1:
– Look at each node a le
Sc triangles
ot
– Enumerate all possible
sn
oe
– Check which of those edges exist
D
27
29. Finding CC For Each Node
Attempt 1:
– Look at each node a le
Sc triangles
ot
– Enumerate all possible
sn
oe
– Check which of those edges exist
D
Attempt 2:
– There is a limited number of High degree nodes
– Count LLL, LLH, LHH, and HHH triangles differently
– If a triangle has at least one Low node
– Pivot on Low node to count the triangles
– If a triangle has all High nodes
– Pivot but only on other neighboring High nodes (not all nodes)
28
31. Algorithm in Pictures
When looking at Low degree nodes
– Check for all triangles
When looking at High degree nodes
– Check for triangles with other High degree nodes
30
32. Clustering Coefficient Discussion
Attempt 2:
– Main idea: treat High and Low degree nodes differently
• Limit the amount of data generated (No more than O(n) per node)
– All triangles accounted for
– Can set High-Low threshold to balance the two cases
• Rule of thumb: threshold around square root of number of vertices
– A bit more complex, but still easy to code
• Doesn’t suffer from the one high degree node problem
31
33. XXL Graphs: Conclusions
Algorithm Design
– Prove performance guarantees independent of input data
• Input skew (e.g. high degree nodes) should not severely affect
algorithm performance
• Number of rounds fixed (and hopefully small)
32
34. XXL Graphs: Conclusions
Algorithm Design
– Prove performance guarantees independent of input data
• Input skew (e.g. high degree nodes) should not severely affect
algorithm performance
• Number of rounds fixed (and hopefully small)
Rethink graph algorithms:
– Connected Components: Two round approach
– Clustering Coefficient: High-Low node decomposition
– (Breaking News) Matchings: Two round sampling technique
33