This document discusses the PageRank algorithm for ranking nodes in a graph based on their importance. It begins by introducing graph data examples like social networks and the web graph. It then describes how PageRank works by modeling a random walk over the graph and defining the stationary distribution of this random walk as the rank of each node. Key aspects covered include: using the eigenvector formulation to solve the system of equations efficiently via power iteration; adding random teleports to address problems of dead ends and spider traps; and formulating the full PageRank algorithm using a sparse matrix to handle large graphs. The document provides detailed explanations of the mathematical foundations and implementation of PageRank.
This document discusses PageRank, an algorithm used by Google Search to rank websites in their search results. It describes how PageRank works by modeling the web as a directed graph and calculating an importance score for each page based on the page's inlinks. It discusses how PageRank can be formulated as the principal eigenvector of the stochastic link matrix or as the stationary distribution of a random walk on the web graph. It also covers techniques like random teleportation to address issues like spider traps and dead ends.
This document discusses link analysis and PageRank, an algorithm for identifying important nodes in large network graphs. It begins with an overview of graph data structures and the goal of identifying influential nodes. It then introduces PageRank, explaining its basic assumptions and showing examples of how it calculates node importance scores. The document discusses problems with the initial PageRank approach and how it was improved with the Complete PageRank algorithm. Finally, it briefly introduces Topic-sensitive PageRank, which aims to identify important nodes related to specific topics.
Collaborative filtering algorithms recommend items to users based on the items liked by similar users. There are two main approaches: model-based builds a predictive model from user data, while memory-based identifies similar users and recommends popular items among them. The document describes memory-based collaborative filtering using cosine similarity to calculate user similarities based on common liked items, normalized by number of items per user. An example in R shows generating recommendations for a new user based on a training user-item matrix and similarity calculations.
Link Analysis
A technique that use the graph structure in order to determine the relative importance of the nodes (web pages). One of the biggest changes in our lives in the decade following the turn of the century was the availability of efficient and accurate Web search, through search engines such as Google. While Google was not the first search engine, it was the first able to defeat the spammers who had made search almost useless.
Moreover, the innovation provided by Google was a nontrivial technological advance, called “PageRank.” When PageRank was established as an essential technique for a search engine, spammers invented ways to manipulate the PageRank of a Web page, often called link spam. That development led to the response of TrustRank and other techniques for preventing spammers from attacking PageRank.
This document provides an overview of a course on network optimization. It introduces the instructor and textbook. It summarizes the Koenigsberg bridge problem, which helped establish the field of graph theory. It discusses the mathematical definitions and terminology used in networks, such as nodes, arcs, paths, and cycles. It outlines three fundamental network flow problems: the shortest path problem, maximum flow problem, and minimum cost flow problem. It describes where network optimization is applied, such as transportation and manufacturing systems. It introduces the topic of computational complexity and how algorithms are analyzed.
This document discusses the importance and advantages of MATLAB. It notes that MATLAB has matrices as its basic data element, supports vectorized operations, and has built-in graphical and statistical functions. Toolboxes can further expand MATLAB's functionality. While it uses more memory and CPU time than other languages, MATLAB allows both command line and programming capabilities. The document provides examples of how to create matrices, perform operations on matrices using functions like sum(), transpose(), and indexing. It also discusses matrix multiplication and how operations depend on matrix dimensions.
The document provides an introduction to MATLAB and Simulink. It describes MATLAB as a numerical computing environment and matrix laboratory that is used for data analysis, algorithm development, modeling, and more across many disciplines. Simulink is introduced as a block diagram environment for multi-domain simulation and model-based design. Key features and uses of MATLAB and Simulink are outlined, including acquiring and analyzing data, developing functions and algorithms, modeling and simulation.
This document discusses PageRank, an algorithm used by Google Search to rank websites in their search results. It describes how PageRank works by modeling the web as a directed graph and calculating an importance score for each page based on the page's inlinks. It discusses how PageRank can be formulated as the principal eigenvector of the stochastic link matrix or as the stationary distribution of a random walk on the web graph. It also covers techniques like random teleportation to address issues like spider traps and dead ends.
This document discusses link analysis and PageRank, an algorithm for identifying important nodes in large network graphs. It begins with an overview of graph data structures and the goal of identifying influential nodes. It then introduces PageRank, explaining its basic assumptions and showing examples of how it calculates node importance scores. The document discusses problems with the initial PageRank approach and how it was improved with the Complete PageRank algorithm. Finally, it briefly introduces Topic-sensitive PageRank, which aims to identify important nodes related to specific topics.
Collaborative filtering algorithms recommend items to users based on the items liked by similar users. There are two main approaches: model-based builds a predictive model from user data, while memory-based identifies similar users and recommends popular items among them. The document describes memory-based collaborative filtering using cosine similarity to calculate user similarities based on common liked items, normalized by number of items per user. An example in R shows generating recommendations for a new user based on a training user-item matrix and similarity calculations.
Link Analysis
A technique that use the graph structure in order to determine the relative importance of the nodes (web pages). One of the biggest changes in our lives in the decade following the turn of the century was the availability of efficient and accurate Web search, through search engines such as Google. While Google was not the first search engine, it was the first able to defeat the spammers who had made search almost useless.
Moreover, the innovation provided by Google was a nontrivial technological advance, called “PageRank.” When PageRank was established as an essential technique for a search engine, spammers invented ways to manipulate the PageRank of a Web page, often called link spam. That development led to the response of TrustRank and other techniques for preventing spammers from attacking PageRank.
This document provides an overview of a course on network optimization. It introduces the instructor and textbook. It summarizes the Koenigsberg bridge problem, which helped establish the field of graph theory. It discusses the mathematical definitions and terminology used in networks, such as nodes, arcs, paths, and cycles. It outlines three fundamental network flow problems: the shortest path problem, maximum flow problem, and minimum cost flow problem. It describes where network optimization is applied, such as transportation and manufacturing systems. It introduces the topic of computational complexity and how algorithms are analyzed.
This document discusses the importance and advantages of MATLAB. It notes that MATLAB has matrices as its basic data element, supports vectorized operations, and has built-in graphical and statistical functions. Toolboxes can further expand MATLAB's functionality. While it uses more memory and CPU time than other languages, MATLAB allows both command line and programming capabilities. The document provides examples of how to create matrices, perform operations on matrices using functions like sum(), transpose(), and indexing. It also discusses matrix multiplication and how operations depend on matrix dimensions.
The document provides an introduction to MATLAB and Simulink. It describes MATLAB as a numerical computing environment and matrix laboratory that is used for data analysis, algorithm development, modeling, and more across many disciplines. Simulink is introduced as a block diagram environment for multi-domain simulation and model-based design. Key features and uses of MATLAB and Simulink are outlined, including acquiring and analyzing data, developing functions and algorithms, modeling and simulation.
This document summarizes a lecture on graph algorithms and PageRank using MapReduce. It discusses representing graphs in MapReduce, performing breadth-first search, finding shortest paths, and calculating PageRank through an iterative process of redistributing PageRank values along edges in the graph. The PageRank algorithm is broken into phases that map nodes to PageRank fragments, reduce to calculate new PageRank values, and iterate until convergence is reached. While MapReduce has limitations for iterative algorithms, this approach allows processing graph partitions in parallel through multiple MapReduce jobs.
This document summarizes a lecture on graph algorithms and PageRank using MapReduce. It discusses graph representations like adjacency matrices and sparse matrices. It explains how breadth-first search and shortest path algorithms can be implemented in MapReduce through iterative passes. It then describes how PageRank can also be distributed by mapping graph nodes to PageRank value distributions, reducing the values, and iterating until convergence is reached.
This document summarizes a lecture on graph algorithms and PageRank using MapReduce. It discusses graph representations like adjacency matrices and sparse matrices. It explains how breadth-first search and shortest path algorithms can be implemented in MapReduce through iterative passes. It then describes how PageRank can also be distributed by mapping graph nodes to PageRank value distributions, reducing the values, and iterating until convergence is reached.
MATLAB is a numerical computing environment and programming language. It allows matrix manipulations, plotting of functions and data, implementation of algorithms, and interfacing with programs in other languages. MATLAB can be used for applications like signal processing, image processing, control systems, and computational finance. It offers advantages like ease of use, platform independence, and predefined functions. However, it can sometimes be slow and is commercial software. The MATLAB interface includes a command window, current directory, workspace, and command history. Arrays are fundamental data types in MATLAB and can be vectors, matrices, or multidimensional. Variables are used to store information in the workspace and can represent different data types. Common operations include arithmetic, functions, and following the
This document summarizes a talk on algorithms that use locality to solve network problems efficiently. It discusses how limitations on network visibility require local algorithms that make sequential decisions using limited information. It presents local algorithms for preferential attachment networks and general graphs that solve problems like finding high-degree nodes and computing minimum dominating sets. It also describes how locality can enable sublinear-time algorithms for estimating PageRank values and solving influence maximization problems in viral marketing models. The talk outlines techniques like multiscale analysis and sparse matrix methods that allow computing PageRank summaries and influential nodes faster than previous methods.
This document provides an overview of MATLAB including its history, applications, development environment, built-in functions, and toolboxes. MATLAB stands for Matrix Laboratory and was originally developed in the 1970s at the University of New Mexico to provide an interactive environment for matrix computations. It has since grown to be a comprehensive programming language and environment used widely in technical computing across many domains including engineering, science, and finance. The key components of MATLAB are its development environment, mathematical function library, programming language, graphics capabilities, and application programming interface. It also includes a variety of toolboxes that provide domain-specific functionality in areas like signal processing, neural networks, and optimization.
Traveling Salesman Problem in Distributed Environmentcsandit
In this paper, we focus on developing parallel algorithms for solving the traveling salesman problem (TSP) based on Nicos Christofides algorithm released in 1976. The parallel algorithm
is built in the distributed environment with multi-processors (Master-Slave). The algorithm is installed on the computer cluster system of National University of Education in Hanoi,
Vietnam (ccs1.hnue.edu.vn) and uses the library PJ (Parallel Java). The results are evaluated and compared with other works.
TRAVELING SALESMAN PROBLEM IN DISTRIBUTED ENVIRONMENTcscpconf
The document describes developing a parallel algorithm for solving the traveling salesman problem (TSP) based on Christofides' algorithm. It discusses implementing Christofides' algorithm in a distributed environment using multiple processors. The parallel algorithm divides the graph vertices and distance matrix across slave processors, which calculate the minimum spanning tree in parallel. The master processor then finds odd-degree vertices, performs matching, and finds the Hamiltonian cycle to solve TSP. The algorithm is tested on a computer cluster using graphs of 20,000 and 30,000 nodes, showing improved runtime over the sequential algorithm.
This document provides an introduction to MATLAB. It discusses what MATLAB is, how to perform basic matrix operations and use script files and M-files. It also covers some common MATLAB commands and functions. MATLAB can be used for applications like plotting, image processing, robotics and GUI design. Key topics covered include matrices, vectors, scalars, matrix operations, logical and relational operators, selection and repetition structures, and reading/writing data files. Plotting functions allow creating graphs and 3D surface plots. Image processing, robotics and GUI design are listed as potential application areas.
Data Science is concerned with the analysis of large amounts of data. When the volume of data is really large, it requires the use of cooperating, distributed machines. The most popular method of doing this is Hadoop, a collection of programs to perform computations on connected machines in a cluster. Hadoop began life as an open-source implementation of MapReduce, an idea first developed and implemented by Google for its own clusters. Though Hadoop's MapReduce is Java-based, and quite complex, this talk focuses on the "streaming" facility, which allows Python programmers to use MapReduce in a clean and simple way. We will present the core ideas of MapReduce and show you how to implement a MapReduce computation using Python streaming. The presentation will also include an overview of the various components of the Hadoop "ecosystem."
NYC Data Science Academy is excited to welcome Sam Kamin who will be presenting an Introduction to Hadoop for Python Programmers a well as a discussion of MapReduce with Streaming Python.
Sam Kamin was a professor in the University of Illinois Computer Science Department. His research was in programming languages, high-performance computing, and educational technology. He taught a wide variety of courses, and served as the Director of Undergraduate Programs. He retired as Emeritus Associate Professor, and worked at Google until taking his current position as VP of Data Engineering in NYC Data Science Academy.
--------------------------------------
Our fall 12-Week Data Science bootcamp starts on Sept 21st,2015. Apply now to get a spot!
If you are hiring Data Scientists, call us at (1)888-752-7585 or reach info@nycdatascience.com to share your openings and set up interviews with our excellent students.
This document provides an introduction to computational finance using MATLAB. It discusses MATLAB basics like matrices, vectors, solving linear equations, and generating random numbers. Key points covered include:
- MATLAB is well-suited for numerical linear algebra operations on matrices and vectors.
- Functions like rand and randn are used to generate uniformly distributed and Gaussian/normal distributed random numbers, which are important in finance.
- Histograms can be used to visualize the distributions of random numbers and converge to the probability density function as the number of samples increases.
This document provides an introduction to MATLAB. It discusses that MATLAB is a high-performance language for technical computing that integrates computation, visualization, and programming. It can be used for tasks like math and computation, algorithm development, modeling, simulation, prototyping, data analysis, and scientific graphics. MATLAB uses arrays as its basic data type and allows matrix and vector problems to be solved more quickly than with other languages. The document then provides examples of entering matrices, using basic MATLAB commands and functions, plotting graphs, and writing MATLAB code in M-files.
The document discusses graph algorithms and PageRank and how they can be implemented using MapReduce. It covers graph representations like adjacency matrices and sparse matrices that are suitable for distributed computing. It also describes how breadth-first search, shortest path finding, and PageRank calculations can be broken down into MapReduce jobs by iteratively processing portions of the graph in parallel. While not optimal for highly iterative algorithms, MapReduce can help distribute the computation across multiple machines to process large graphs.
PageRank is Google's algorithm for ranking web pages. It defines a page's importance as the number of important pages that link to it. PageRank is calculated through an iterative process where each page distributes its rank value evenly among the pages it links to. This helps rank pages and address issues like dead ends and spider traps that could accumulate all importance.
1) The document proposes using a peer-to-peer network model called Content Addressable Network (CAN) to solve the graph coloring problem in a distributed manner to reduce computational time.
2) It describes how CAN works as a virtual coordinate space and how nodes are inserted.
3) It then explains the graph coloring problem and presents a recursive algorithm that divides the problem among peer nodes, combines partial solutions, and resolves conflicts to find the final solution.
Gaps between the theory and practice of large-scale matrix-based network comp...David Gleich
This document discusses gaps between theory and practice in large scale matrix computations for networks. It provides an overview of representing networks as matrices and canonical problems like PageRank that can be modeled as matrix computations. It then discusses different methods for solving these problems, like Monte Carlo methods, relaxation methods, and Krylov subspace methods. It analyzes the computational complexity of these approaches and identifies open problems, such as developing unified convergence results for different algorithms and handling "top k" convergence. The talk concludes by identifying more structured problems on networks that could leverage matrix computations.
The document discusses several topics:
1. It explains the stream data model architecture with a diagram showing streams entering a processing system and being stored in an archival store or working store.
2. It defines a Bloom filter and describes how to calculate the probability of a false positive.
3. It outlines the Girvan-Newman algorithm for detecting communities in a graph by calculating betweenness values and removing edges.
4. It mentions PageRank and the Flajolet-Martin algorithm for approximating the number of unique objects in a data stream.
The document summarizes a lecture on further details of the graphics pipeline. It discusses how triangles described by vertex positions in normalized device coordinates are rasterized into pixels on the screen. The lecture covers oriented edge equations, being inside a triangle, rasterization approaches like scanline rasterization, and basic fragment shading through color interpolation. Homework status and the lecturer's office hours are also provided.
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Aggregage
This webinar will explore cutting-edge, less familiar but powerful experimentation methodologies which address well-known limitations of standard A/B Testing. Designed for data and product leaders, this session aims to inspire the embrace of innovative approaches and provide insights into the frontiers of experimentation!
Build applications with generative AI on Google CloudMárton Kodok
We will explore Vertex AI - Model Garden powered experiences, we are going to learn more about the integration of these generative AI APIs. We are going to see in action what the Gemini family of generative models are for developers to build and deploy AI-driven applications. Vertex AI includes a suite of foundation models, these are referred to as the PaLM and Gemini family of generative ai models, and they come in different versions. We are going to cover how to use via API to: - execute prompts in text and chat - cover multimodal use cases with image prompts. - finetune and distill to improve knowledge domains - run function calls with foundation models to optimize them for specific tasks. At the end of the session, developers will understand how to innovate with generative AI and develop apps using the generative ai industry trends.
This document summarizes a lecture on graph algorithms and PageRank using MapReduce. It discusses representing graphs in MapReduce, performing breadth-first search, finding shortest paths, and calculating PageRank through an iterative process of redistributing PageRank values along edges in the graph. The PageRank algorithm is broken into phases that map nodes to PageRank fragments, reduce to calculate new PageRank values, and iterate until convergence is reached. While MapReduce has limitations for iterative algorithms, this approach allows processing graph partitions in parallel through multiple MapReduce jobs.
This document summarizes a lecture on graph algorithms and PageRank using MapReduce. It discusses graph representations like adjacency matrices and sparse matrices. It explains how breadth-first search and shortest path algorithms can be implemented in MapReduce through iterative passes. It then describes how PageRank can also be distributed by mapping graph nodes to PageRank value distributions, reducing the values, and iterating until convergence is reached.
This document summarizes a lecture on graph algorithms and PageRank using MapReduce. It discusses graph representations like adjacency matrices and sparse matrices. It explains how breadth-first search and shortest path algorithms can be implemented in MapReduce through iterative passes. It then describes how PageRank can also be distributed by mapping graph nodes to PageRank value distributions, reducing the values, and iterating until convergence is reached.
MATLAB is a numerical computing environment and programming language. It allows matrix manipulations, plotting of functions and data, implementation of algorithms, and interfacing with programs in other languages. MATLAB can be used for applications like signal processing, image processing, control systems, and computational finance. It offers advantages like ease of use, platform independence, and predefined functions. However, it can sometimes be slow and is commercial software. The MATLAB interface includes a command window, current directory, workspace, and command history. Arrays are fundamental data types in MATLAB and can be vectors, matrices, or multidimensional. Variables are used to store information in the workspace and can represent different data types. Common operations include arithmetic, functions, and following the
This document summarizes a talk on algorithms that use locality to solve network problems efficiently. It discusses how limitations on network visibility require local algorithms that make sequential decisions using limited information. It presents local algorithms for preferential attachment networks and general graphs that solve problems like finding high-degree nodes and computing minimum dominating sets. It also describes how locality can enable sublinear-time algorithms for estimating PageRank values and solving influence maximization problems in viral marketing models. The talk outlines techniques like multiscale analysis and sparse matrix methods that allow computing PageRank summaries and influential nodes faster than previous methods.
This document provides an overview of MATLAB including its history, applications, development environment, built-in functions, and toolboxes. MATLAB stands for Matrix Laboratory and was originally developed in the 1970s at the University of New Mexico to provide an interactive environment for matrix computations. It has since grown to be a comprehensive programming language and environment used widely in technical computing across many domains including engineering, science, and finance. The key components of MATLAB are its development environment, mathematical function library, programming language, graphics capabilities, and application programming interface. It also includes a variety of toolboxes that provide domain-specific functionality in areas like signal processing, neural networks, and optimization.
Traveling Salesman Problem in Distributed Environmentcsandit
In this paper, we focus on developing parallel algorithms for solving the traveling salesman problem (TSP) based on Nicos Christofides algorithm released in 1976. The parallel algorithm
is built in the distributed environment with multi-processors (Master-Slave). The algorithm is installed on the computer cluster system of National University of Education in Hanoi,
Vietnam (ccs1.hnue.edu.vn) and uses the library PJ (Parallel Java). The results are evaluated and compared with other works.
TRAVELING SALESMAN PROBLEM IN DISTRIBUTED ENVIRONMENTcscpconf
The document describes developing a parallel algorithm for solving the traveling salesman problem (TSP) based on Christofides' algorithm. It discusses implementing Christofides' algorithm in a distributed environment using multiple processors. The parallel algorithm divides the graph vertices and distance matrix across slave processors, which calculate the minimum spanning tree in parallel. The master processor then finds odd-degree vertices, performs matching, and finds the Hamiltonian cycle to solve TSP. The algorithm is tested on a computer cluster using graphs of 20,000 and 30,000 nodes, showing improved runtime over the sequential algorithm.
This document provides an introduction to MATLAB. It discusses what MATLAB is, how to perform basic matrix operations and use script files and M-files. It also covers some common MATLAB commands and functions. MATLAB can be used for applications like plotting, image processing, robotics and GUI design. Key topics covered include matrices, vectors, scalars, matrix operations, logical and relational operators, selection and repetition structures, and reading/writing data files. Plotting functions allow creating graphs and 3D surface plots. Image processing, robotics and GUI design are listed as potential application areas.
Data Science is concerned with the analysis of large amounts of data. When the volume of data is really large, it requires the use of cooperating, distributed machines. The most popular method of doing this is Hadoop, a collection of programs to perform computations on connected machines in a cluster. Hadoop began life as an open-source implementation of MapReduce, an idea first developed and implemented by Google for its own clusters. Though Hadoop's MapReduce is Java-based, and quite complex, this talk focuses on the "streaming" facility, which allows Python programmers to use MapReduce in a clean and simple way. We will present the core ideas of MapReduce and show you how to implement a MapReduce computation using Python streaming. The presentation will also include an overview of the various components of the Hadoop "ecosystem."
NYC Data Science Academy is excited to welcome Sam Kamin who will be presenting an Introduction to Hadoop for Python Programmers a well as a discussion of MapReduce with Streaming Python.
Sam Kamin was a professor in the University of Illinois Computer Science Department. His research was in programming languages, high-performance computing, and educational technology. He taught a wide variety of courses, and served as the Director of Undergraduate Programs. He retired as Emeritus Associate Professor, and worked at Google until taking his current position as VP of Data Engineering in NYC Data Science Academy.
--------------------------------------
Our fall 12-Week Data Science bootcamp starts on Sept 21st,2015. Apply now to get a spot!
If you are hiring Data Scientists, call us at (1)888-752-7585 or reach info@nycdatascience.com to share your openings and set up interviews with our excellent students.
This document provides an introduction to computational finance using MATLAB. It discusses MATLAB basics like matrices, vectors, solving linear equations, and generating random numbers. Key points covered include:
- MATLAB is well-suited for numerical linear algebra operations on matrices and vectors.
- Functions like rand and randn are used to generate uniformly distributed and Gaussian/normal distributed random numbers, which are important in finance.
- Histograms can be used to visualize the distributions of random numbers and converge to the probability density function as the number of samples increases.
This document provides an introduction to MATLAB. It discusses that MATLAB is a high-performance language for technical computing that integrates computation, visualization, and programming. It can be used for tasks like math and computation, algorithm development, modeling, simulation, prototyping, data analysis, and scientific graphics. MATLAB uses arrays as its basic data type and allows matrix and vector problems to be solved more quickly than with other languages. The document then provides examples of entering matrices, using basic MATLAB commands and functions, plotting graphs, and writing MATLAB code in M-files.
The document discusses graph algorithms and PageRank and how they can be implemented using MapReduce. It covers graph representations like adjacency matrices and sparse matrices that are suitable for distributed computing. It also describes how breadth-first search, shortest path finding, and PageRank calculations can be broken down into MapReduce jobs by iteratively processing portions of the graph in parallel. While not optimal for highly iterative algorithms, MapReduce can help distribute the computation across multiple machines to process large graphs.
PageRank is Google's algorithm for ranking web pages. It defines a page's importance as the number of important pages that link to it. PageRank is calculated through an iterative process where each page distributes its rank value evenly among the pages it links to. This helps rank pages and address issues like dead ends and spider traps that could accumulate all importance.
1) The document proposes using a peer-to-peer network model called Content Addressable Network (CAN) to solve the graph coloring problem in a distributed manner to reduce computational time.
2) It describes how CAN works as a virtual coordinate space and how nodes are inserted.
3) It then explains the graph coloring problem and presents a recursive algorithm that divides the problem among peer nodes, combines partial solutions, and resolves conflicts to find the final solution.
Gaps between the theory and practice of large-scale matrix-based network comp...David Gleich
This document discusses gaps between theory and practice in large scale matrix computations for networks. It provides an overview of representing networks as matrices and canonical problems like PageRank that can be modeled as matrix computations. It then discusses different methods for solving these problems, like Monte Carlo methods, relaxation methods, and Krylov subspace methods. It analyzes the computational complexity of these approaches and identifies open problems, such as developing unified convergence results for different algorithms and handling "top k" convergence. The talk concludes by identifying more structured problems on networks that could leverage matrix computations.
The document discusses several topics:
1. It explains the stream data model architecture with a diagram showing streams entering a processing system and being stored in an archival store or working store.
2. It defines a Bloom filter and describes how to calculate the probability of a false positive.
3. It outlines the Girvan-Newman algorithm for detecting communities in a graph by calculating betweenness values and removing edges.
4. It mentions PageRank and the Flajolet-Martin algorithm for approximating the number of unique objects in a data stream.
The document summarizes a lecture on further details of the graphics pipeline. It discusses how triangles described by vertex positions in normalized device coordinates are rasterized into pixels on the screen. The lecture covers oriented edge equations, being inside a triangle, rasterization approaches like scanline rasterization, and basic fragment shading through color interpolation. Homework status and the lecturer's office hours are also provided.
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Aggregage
This webinar will explore cutting-edge, less familiar but powerful experimentation methodologies which address well-known limitations of standard A/B Testing. Designed for data and product leaders, this session aims to inspire the embrace of innovative approaches and provide insights into the frontiers of experimentation!
Build applications with generative AI on Google CloudMárton Kodok
We will explore Vertex AI - Model Garden powered experiences, we are going to learn more about the integration of these generative AI APIs. We are going to see in action what the Gemini family of generative models are for developers to build and deploy AI-driven applications. Vertex AI includes a suite of foundation models, these are referred to as the PaLM and Gemini family of generative ai models, and they come in different versions. We are going to cover how to use via API to: - execute prompts in text and chat - cover multimodal use cases with image prompts. - finetune and distill to improve knowledge domains - run function calls with foundation models to optimize them for specific tasks. At the end of the session, developers will understand how to innovate with generative AI and develop apps using the generative ai industry trends.
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Kaxil Naik
Navigating today's data landscape isn't just about managing workflows; it's about strategically propelling your business forward. Apache Airflow has stood out as the benchmark in this arena, driving data orchestration forward since its early days. As we dive into the complexities of our current data-rich environment, where the sheer volume of information and its timely, accurate processing are crucial for AI and ML applications, the role of Airflow has never been more critical.
In my journey as the Senior Engineering Director and a pivotal member of Apache Airflow's Project Management Committee (PMC), I've witnessed Airflow transform data handling, making agility and insight the norm in an ever-evolving digital space. At Astronomer, our collaboration with leading AI & ML teams worldwide has not only tested but also proven Airflow's mettle in delivering data reliably and efficiently—data that now powers not just insights but core business functions.
This session is a deep dive into the essence of Airflow's success. We'll trace its evolution from a budding project to the backbone of data orchestration it is today, constantly adapting to meet the next wave of data challenges, including those brought on by Generative AI. It's this forward-thinking adaptability that keeps Airflow at the forefront of innovation, ready for whatever comes next.
The ever-growing demands of AI and ML applications have ushered in an era where sophisticated data management isn't a luxury—it's a necessity. Airflow's innate flexibility and scalability are what makes it indispensable in managing the intricate workflows of today, especially those involving Large Language Models (LLMs).
This talk isn't just a rundown of Airflow's features; it's about harnessing these capabilities to turn your data workflows into a strategic asset. Together, we'll explore how Airflow remains at the cutting edge of data orchestration, ensuring your organization is not just keeping pace but setting the pace in a data-driven future.
Session in https://budapestdata.hu/2024/04/kaxil-naik-astronomer-io/ | https://dataml24.sessionize.com/session/667627
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
Codeless Generative AI Pipelines
(GenAI with Milvus)
https://ml.dssconf.pl/user.html#!/lecture/DSSML24-041a/rate
Discover the potential of real-time streaming in the context of GenAI as we delve into the intricacies of Apache NiFi and its capabilities. Learn how this tool can significantly simplify the data engineering workflow for GenAI applications, allowing you to focus on the creative aspects rather than the technical complexities. I will guide you through practical examples and use cases, showing the impact of automation on prompt building. From data ingestion to transformation and delivery, witness how Apache NiFi streamlines the entire pipeline, ensuring a smooth and hassle-free experience.
Timothy Spann
https://www.youtube.com/@FLaNK-Stack
https://medium.com/@tspann
https://www.datainmotion.dev/
milvus, unstructured data, vector database, zilliz, cloud, vectors, python, deep learning, generative ai, genai, nifi, kafka, flink, streaming, iot, edge
Open Source Contributions to Postgres: The Basics POSETTE 2024ElizabethGarrettChri
Postgres is the most advanced open-source database in the world and it's supported by a community, not a single company. So how does this work? How does code actually get into Postgres? I recently had a patch submitted and committed and I want to share what I learned in that process. I’ll give you an overview of Postgres versions and how the underlying project codebase functions. I’ll also show you the process for submitting a patch and getting that tested and committed.
6. 8.6
Web as a Directed Graph
Web as a directed graph:
Nodes: Webpages
Edges: Hyperlinks
7. 8.7
Broad Question
How to organize the Web?
First try: Human curated
Web directories
Yahoo, LookSmart, etc.
Second try: Web Search
Information Retrieval investigates:
Find relevant docs in a small
and trusted set
Newspaper articles, Patents, etc.
But: Web is huge, full of untrusted documents, random things, web
spam, etc.
What is the “best” answer to query “newspaper”?
No single right answer
8. 8.8
Ranking Nodes on the Graph
All web pages are not equally “important”
http://xxx.github.io/ vs. http://www.unsw.edu.au/
There is large diversity in the web-graph node connectivity. Let’s rank
the pages by the link structure!
9. 8.9
Link Analysis Algorithms
We will cover the following Link Analysis approaches for computing
importance of nodes in a graph:
Page Rank
Topic-Specific (Personalized) Page Rank
HITS
11. 8.11
Links as Votes
Idea: Links as votes
Page is more important if it has more links
In-coming links? Out-going links?
Think of in-links as votes:
http://www.unsw.edu.au/ has 23,400 in-links
http://xxx.github.io/ has 1 in-link
Are all in-links equal?
Links from important pages count more
Recursive question!
13. 8.13
Simple Recursive Formulation
Each link’s vote is proportional to the importance of its source page
If page j with importance rj has n out-links, each link gets rj / n votes
Page j’s own importance is the sum of the votes on its in-links
j
k
i
rj/3
rj/3
rj/3
rj = ri/3+rk/4
ri/3 rk/4
14. 8.14
PageRank: The “Flow” Model
A “vote” from an important page is
worth more
A page is important if it is pointed to by
other important pages
Define a “rank” rj for page j
j
i
i
j
r
r
i
d
y
m
a
a/2
y/2
a/2
m
y/2
“Flow” equations:
ry = ry /2 + ra /2
ra = ry /2 + rm
rm = ra /2
15. 8.15
Solving the Flow Equations
3 equations, 3 unknowns, no constants
No unique solution
All solutions equivalent modulo the scale factor
Additional constraint forces uniqueness:
𝒓𝒚 + 𝒓𝒂 + 𝒓𝒎 = 𝟏
Solution: 𝒓𝒚 =
𝟐
𝟓
, 𝒓𝒂 =
𝟐
𝟓
, 𝒓𝒎 =
𝟏
𝟓
Gaussian elimination method works for small examples, but we need
a better method for large web-size graphs
We need a new formulation!
ry = ry /2 + ra /2
ra = ry /2 + rm
rm = ra /2
Flow equations:
16. 8.16
PageRank: Matrix Formulation
Stochastic adjacency matrix 𝑴
Let page 𝑖 has 𝑑𝑖 out-links
If 𝑖 → 𝑗, then 𝑀𝑗𝑖 =
1
𝑑𝑖
else 𝑀𝑗𝑖 = 0
𝑴 is a column stochastic matrix
– Columns sum to 1
Rank vector 𝒓: vector with an entry per page
𝑟𝑖 is the importance score of page 𝑖
𝑖 𝑟𝑖 = 1
The flow equations can be written
𝒓 = 𝑴 ⋅ 𝒓
17. 8.17
Example
Remember the flow equation:
Flow equation in the matrix form
𝑴 ⋅ 𝒓 = 𝒓
Suppose page i links to 3 pages, including j
j
i
i
j
r
r
i
d
j
i
M r r
=
rj
1/3
ri
.
. =
18. 8.18
Eigenvector Formulation
The flow equations can be written
𝒓 = 𝑴 ∙ 𝒓
So the rank vector r is an eigenvector of the stochastic web matrix
M
In fact, its first or principal eigenvector,
with corresponding eigenvalue 1
Largest eigenvalue of M is 1 since M is
column stochastic (with non-negative entries)
– We know r is unit length and each column of M
sums to one, so 𝑴𝒓 ≤ 𝟏
We can now efficiently solve for r!
The method is called Power iteration
NOTE: x is an
eigenvector with
the corresponding
eigenvalue λ if:
𝑨𝒙 = 𝝀𝒙
19. 8.19
Example: Flow Equations & M
r = M∙r
y ½ ½ 0 y
a = ½ 0 1 a
m 0 ½ 0 m
y
a m
y a m
y ½ ½ 0
a ½ 0 1
m 0 ½ 0
ry = ry /2 + ra /2
ra = ry /2 + rm
rm = ra /2
20. 8.20
Power Iteration Method
Given a web graph with n nodes, where the nodes are pages and
edges are hyperlinks
Power iteration: a simple iterative scheme
Suppose there are N web pages
Initialize: r(0) = [1/N,….,1/N]T
Iterate: r(t+1) = M ∙ r(t)
Stop when |r(t+1) – r(t)|1 <
j
i
t
i
t
j
r
r
i
)
(
)
1
(
d
di …. out-degree of node i
|x|1 = 1≤i≤N|xi| is the L1 norm
Can use any other vector norm, e.g., Euclidean
21. 8.21
PageRank: How to solve?
y
a m
y a m
y ½ ½ 0
a ½ 0 1
m 0 ½ 0
ry = ry /2 + ra /2
ra = ry /2 + rm
rm = ra /2
Iteration 0, 1, 2, …
25. 8.25
Existence and Uniqueness
A central result from the theory of random walks (a.k.a. Markov
processes):
For graphs that satisfy certain conditions,
the stationary distribution is unique and
eventually will be reached no matter what the
initial probability distribution at time t = 0
26. 8.26
PageRank: Two Questions
Does this converge?
Does it converge to what we want?
j
i
t
i
t
j
r
r
i
)
(
)
1
(
d Mr
r
or
equivalently
27. 8.27
Does this converge?
Example:
ra 1 0 1 0 …
rb 0 1 0 1 …
b
a
j
i
t
i
t
j
r
r
i
)
(
)
1
(
d
Iteration 0, 1, 2, …
28. 8.28
Does it converge to what we want?
Example:
ra 1 0 0 0
rb 0 1 0 0
b
a
j
i
t
i
t
j
r
r
i
)
(
)
1
(
d
Iteration 0, 1, 2, …
29. 8.29
PageRank: Problems
2 problems:
(1) Some pages are dead ends (have no out-links)
Random walk has “nowhere” to go to
Such pages cause importance to “leak out”
(2) Spider traps: (all out-links are within the group)
Random walked gets “stuck” in a trap
And eventually spider traps absorb all importance
Dead end
30. 8.30
Problem: Dead Ends
y
a m
y a m
y ½ ½ 0
a ½ 0 0
m 0 ½ 0
ry = ry /2 + ra /2
ra = ry /2
rm = ra /2
Iteration 0, 1, 2, …
Here the PageRank “leaks” out since the matrix is not stochastic.
31. 8.31
Solution: Teleport!
Teleports: Follow random teleport links with probability 1.0 from dead-
ends
Adjust matrix accordingly
y
a m
y a m
y ½ ½ ⅓
a ½ 0 ⅓
m 0 ½ ⅓
y a m
y ½ ½ 0
a ½ 0 0
m 0 ½ 0
y
a m
32. 8.32
Problem: Spider Traps
Iteration 0, 1, 2, …
All the PageRank score gets “trapped” in node m.
y
a m
y a m
y ½ ½ 0
a ½ 0 0
m 0 ½ 1
ry = ry /2 + ra /2
ra = ry /2
rm = ra /2 + rm
m is a spider trap
33. 8.33
Solution: Always Teleports!
The Google solution for spider traps: At each time step, the random
surfer has two options
With prob. , follow a link at random
With prob. 1-, jump to some random page
Common values for are in the range 0.8 to 0.9
Surfer will teleport out of spider trap within a few time steps
y
a m
y
a m
34. 8.34
Why Teleports Solve the Problem?
Why are dead-ends and spider traps a problem
and why do teleports solve the problem?
Spider-traps are not a problem, but with traps PageRank scores are
not what we want
Solution: Never get stuck in a spider trap by teleporting out of it in
a finite number of steps
Dead-ends are a problem
The matrix is not column stochastic so our initial assumptions are
not met
Solution: Make matrix column stochastic by always teleporting
when there is nowhere else to go
37. 8.37
Random Teleports ( = 0.8)
y
a =
m
1/3
1/3
1/3
0.33
0.20
0.46
0.24
0.20
0.52
0.26
0.18
0.56
7/33
5/33
21/33
. . .
y
a
m
13/15
7/15
1/2 1/2 0
1/2 0 0
0 1/2 1
1/3 1/3 1/3
1/3 1/3 1/3
1/3 1/3 1/3
y 7/15 7/15 1/15
a 7/15 1/15 1/15
m 1/15 7/15 13/15
0.8 + 0.2
M [1/N]NxN
A
38. 8.38
Computing Page Rank
Key step is matrix-vector multiplication
rnew = A ∙ rold
Easy if we have enough main memory to hold A, rold, rnew
Say N = 1 billion pages
We need 4 bytes for
each entry (say)
2 billion entries for
vectors, approx 8GB
Matrix A has N2 entries
1018 is a large number!
½ ½ 0
½ 0 0
0 ½ 1
1/3 1/3 1/3
1/3 1/3 1/3
1/3 1/3 1/3
7/15 7/15 1/15
7/15 1/15 1/15
1/15 7/15 13/15
0.8 +0.2
A = ∙M + (1-) [1/N]NxN
=
A =
39. 8.39
Matrix Formulation
Suppose there are N pages
Consider page i, with di out-links
We have Mji = 1/|di| when i → j
and Mji = 0 otherwise
The random teleport is equivalent to:
Adding a teleport link from i to every other page and setting
transition probability to (1-)/N
Reducing the probability of following each
out-link from 1/|di| to /|di|
Equivalent: Tax each page a fraction (1-) of its score and
redistribute evenly
42. 8.42
PageRank: The Complete Algorithm
If the graph has no dead-ends then the amount of leaked PageRank is 1-β. But since we have dead-ends
the amount of leaked PageRank may be larger. We have to explicitly account for it by computing S.
where: 𝑆 = 𝑗 𝑟′𝑗
𝑛𝑒𝑤
43. 8.43
Sparse Matrix Encoding
Encode sparse matrix using only nonzero entries
Space proportional roughly to number of links
Say 10N, or 4*10*1 billion = 40GB
Still won’t fit in memory, but will fit on disk
0 3 1, 5, 7
1 5 17, 64, 113, 117, 245
2 2 13, 23
source
node degree destination nodes
44. 8.44
Basic Algorithm: Update Step
Assume enough RAM to fit rnew into memory
Store rold and matrix M on disk
1 step of power-iteration is:
0 3 1, 5, 6
1 4 17, 64, 113, 117
2 2 13, 23
source degree destination
0
1
2
3
4
5
0
1
2
3
4
5
6
rnew rold
Initialize all entries of rnew = (1-) / N
For each page i (of out-degree di):
Read into memory: i, di, dest1, …, destdi, rold(i)
For j = 1…di
rnew(destj) += rold(i) / di
45. 8.45
Analysis
Assume enough RAM to fit rnew into memory
Store rold and matrix M on disk
In each iteration, we have to:
Read rold and M
Write rnew back to disk
Cost per iteration of Power method:
= 2|r| + |M|
Question:
What if we could not even fit rnew in memory?
Split rnew into blocks. Details ignored
46. 8.46
Some Problems with Page Rank
Measures generic popularity of a page
Biased against topic-specific authorities
Solution: Topic-Specific (Personalized) PageRank (next)
Uses a single measure of importance
Other models of importance
Solution: Hubs-and-Authorities
48. 8.48
Topic-Specific PageRank
Instead of generic popularity, can we measure popularity within a
topic?
Goal: Evaluate Web pages not just according to their popularity, but
by how close they are to a particular topic, e.g. “sports” or “history”
Allows search queries to be answered based on interests of the user
49. 8.49
Topic-Specific PageRank
Random walker has a small probability of teleporting at any step
Teleport can go to:
Standard PageRank: Any page with equal probability
To avoid dead-end and spider-trap problems
Topic Specific PageRank: A topic-specific set of “relevant”
pages (teleport set)
Idea: Bias the random walk
When walker teleports, she pick a page from a set S
S contains only pages that are relevant to the topic
E.g., Open Directory (DMOZ) pages for a given topic/query
For each teleport set S, we get a different vector rS
53. 8.53
Hubs and Authorities
HITS (Hypertext-Induced Topic Selection)
Is a measure of importance of pages or documents, similar to
PageRank
Proposed at around same time as PageRank (‘98)
Goal: Say we want to find good newspapers
Don’t just find newspapers. Find “experts” – people who link in a
coordinated way to good newspapers
Idea: Links as votes
Page is more important if it has more links
In-coming links? Out-going links?
54. 8.54
Finding Newspapers
Hubs and Authorities
Each page has 2 scores:
Quality as an expert (hub):
Total sum of votes of authorities pointed to
Quality as a content (authority):
Total sum of votes coming from experts
Principle of repeated improvement
NYT: 10
Ebay: 3
Yahoo: 3
CNN: 8
WSJ: 9
55. 8.55
Hubs and Authorities
Interesting pages fall into two classes:
Authorities are pages containing
useful information
Newspaper home pages
Course home pages
Home pages of auto manufacturers
Hubs are pages that link to authorities
List of newspapers
Course bulletin
List of US auto manufacturers
56. 8.56
Counting in-links: Authority
Each page starts with hub
score 1. Authorities collect
their votes
(Note this is idealized example. In reality graph is not bipartite and
each page has both the hub and authority score)
Sum of hub scores
of nodes pointing to
NYT.
57. 8.57
Expert Quality: Hub
Hubs collect authority scores
Sum of authority scores
of nodes that the node
points to.
(Note this is idealized example. In reality graph is not bipartite and
each page has both the hub and authority score)
63. 8.63
Existence and Uniqueness
h = A a
a = AT h
h = A AT h
a = AT A a
Under reasonable assumptions about A,
HITS converges to vectors h* and a*:
h* is the principal eigenvector of matrix A AT
a* is the principal eigenvector of matrix AT A
65. 8.65
PageRank and HITS
PageRank and HITS are two solutions to the same problem:
What is the value of an in-link from u to v?
In the PageRank model, the value of the link depends on the links
into u
In the HITS model, it depends on the value of the other links out
of u
PageRank computes authorities only. HITS computes both authorities
and hubs.
The existence of dead ends or spider traps does not affect the solution
of HITS.
68. 8.68
PageRank in MapReduce
One iteration of the PageRank algorithm involves taking an estimated
PageRank vector r and computing the next estimate r′ by
𝒓 = 𝜷 𝑴 ⋅ 𝒓 +
𝟏 − 𝜷
𝑵 𝑵
Mapper: input – a line containing node u, ru, a list of out-going
neighbors of u
For each neighbor v, emit(v, ru/deg(u))
Emit (u, a list of out-going neighbors of u)
Reducer: input – (node v, a list of values <ru/deg(u), …>)
Aggregate the results according to the equation to compute r’v
Emit node v, r’v, a list of out-going neighbors of v