Obtaining actionable insights from large datasets requires the use methods that must be, at once, fast, scalable, and statistically sound. This is the field of study of algorithmic data science, a discipline at the border of computer science and statistics. In this talk I outline the fundamental questions that motivate research in this area, present a general framework to formulate many problems in this field, introduce the challenges in balancing theoretical and statistical correctness with practical efficiency, and I show how sampling-based algorithms are extremely effective at striking the correct balance in many situations, giving examples from social network analysis and pattern mining. I will conclude with some research directions and areas for future explorations.
Graph Summarization with Quality GuaranteesTwo Sigma
Given a large graph, the authors we aim at producing a concise lossy representation (a summary) that can be stored in main memory and used to approximately answer queries about the original graph much faster than by using the exact representation.
TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fix...Two Sigma
The authors present TRIÈST, a suite of one-pass streaming algorithms to compute unbiased, low-variance, high-quality approximations of the global and local number of triangles in a fully-dynamic graph represented as an adversarial stream of edge insertions and deletions.
Mathematics (from Greek μάθημα máthēma, “knowledge, study, learning”) is the study of topics such as quantity (numbers), structure, space, and change. There is a range of views among mathematicians and philosophers as to the exact scope and definition of mathematics
Graph Summarization with Quality GuaranteesTwo Sigma
Given a large graph, the authors we aim at producing a concise lossy representation (a summary) that can be stored in main memory and used to approximately answer queries about the original graph much faster than by using the exact representation.
TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fix...Two Sigma
The authors present TRIÈST, a suite of one-pass streaming algorithms to compute unbiased, low-variance, high-quality approximations of the global and local number of triangles in a fully-dynamic graph represented as an adversarial stream of edge insertions and deletions.
Mathematics (from Greek μάθημα máthēma, “knowledge, study, learning”) is the study of topics such as quantity (numbers), structure, space, and change. There is a range of views among mathematicians and philosophers as to the exact scope and definition of mathematics
Enumeration methods are very important in a variety of settings, both mathematical and applications. For many problems there is actually no real hope to do the enumeration in reasonable time since the number of solutions is so big. This talk is about how to compute at the limit.
The talk is decomposed into:
(a) Regular enumeration procedure where one uses computerized case distinction.
(b) Use of symmetry groups for isomorphism checks.
(c) The augmentation scheme that allows to enumerate object up to isomorphism without keeping the full list in memory.
(d) The homomorphism principle that allows to map a complex problem to a simpler one.
Towards a stable definition of Algorithmic RandomnessHector Zenil
Although information content is invariant up to an additive constant, the range of possible additive constants applicable to programming languages is so large that in practice it plays a major role in the actual evaluation of K(s), the Kolmogorov complexity of a string s. We present a summary of the approach we've developed to overcome the problem by calculating its algorithmic probability and evaluating the algorithmic complexity via the coding theorem, thereby providing a stable framework for Kolmogorov complexity even for short strings. We also show that reasonable formalisms produce reasonable complexity classifications.
Fractal dimension versus Computational ComplexityHector Zenil
We investigate connections and tradeoffs between two important complexity measures: fractal dimension and computational (time) complexity. We report exciting results applied to space-time diagrams of small Turing machines with precise mathematical relations and formal conjectures connecting these measures. The preprint of the paper is available at: http://arxiv.org/abs/1309.1779
Fractal Dimension of Space-time Diagrams and the Runtime Complexity of Small ...Hector Zenil
Complexity measures are designed to capture complex behaviour and to quantify how complex that particular behaviour is. If a certain phenomenon is genuinely complex this means that it does not all of a sudden becomes simple by just translating the phenomenon to a different setting or framework with a different complexity value. It is in this sense that we expect different complexity measures from possibly entirely different fields to be related to each other. This work presents our work on a beautiful connection between the fractal dimension of space-time diagrams of Turing machines and their time complexity. Presented at Machines, Computations and Universality (MCU) 2013, Zurich, Switzerland.
Core–periphery detection in networks with nonlinear Perron eigenvectorsFrancesco Tudisco
Core–periphery detection is a highly relevant task in exploratory network analysis. Given a network of nodes and edges, one is interested in revealing the presence and measuring the consistency of a core–periphery structure using only the network topology. This mesoscale network structure consists of two sets: the core, a set of nodes that is highly connected across the whole network, and the periphery, a set of nodes that is well connected only to the nodes that are in the core. Networks with such a core–periphery structure have been observed in several applications, including economic, social, communication and citation networks.
In this talk we discuss a new core–periphery detection model based on the optimization of a class of core–periphery quality functions. While the quality measures are highly nonconvex in general and thus hardly treatable, we show that the global solution coincides with the nonlinear Perron eigenvector of a suitably defined parameter dependent matrix M(x), i.e. the positive solution to the nonlinear eigenvector problem M(x)x=λx. Using recent advances in nonlinear Perron–Frobeniustheory, we discuss uniqueness of the global solution and we propose a nonlinear power method-type scheme that (a) allows us to solve the optimization problem with global convergence guarantees and (b) effectively scales to very large and sparse networks. Finally, we present several numerical experiments showing that the new method largely out-performs state-of-the-art techniques for core-periphery detection.
Information Content of Complex NetworksHector Zenil
This short talk given in Stockholm, Sweden, explains how algorithmic complexity measures, notably Kolmogorov complexity approximated both by lossless compression algorithms and the Block Decomposition Method (BDM) are capable of characterizing graphs and networks by some of their group-theoretic and topological properties, notably graph automorphism group size and clustering coefficients of complex networks. The method distinguished between models of networks such as regular, random, small-world and scale-free.
We are interested in finding a permutation of the entries of a given square matrix so that the maximum number of its nonzero entries are moved to one of the corners in a L-shaped fashion.
If we interpret the nonzero entries of the matrix as the edges of a graph, this problem boils down to the so-called core–periphery structure, consisting of two sets: the core, a set of nodes that is highly connected across the whole graph, and the periphery, a set of nodes that is well connected only to the nodes that are in the core.
Matrix reordering problems have applications in sparse factorizations and preconditioning, while revealing core–periphery structures in networks has applications in economic, social and communication networks.
Martin Takac - “Solving Large-Scale Machine Learning Problems in a Distribute...diannepatricia
Martin Takac, Assistant Professor, Lehigh University, gave a great presentation today on “Solving Large-Scale Machine Learning Problems in a Distributed Way” as part of our Cognitive Systems Institute Speaker Series.
Slides from our PacificVis 2015 presentation.
The paper tackles the problems of the “giant hairballs”, the dense and tangled structures often resulting from visualiza- tion of large social graphs. Proposed is a high-dimensional rotation technique called AGI3D, combined with an ability to filter elements based on social centrality values. AGI3D is targeted for a high-dimensional embedding of a social graph and its projection onto 3D space. It allows the user to ro- tate the social graph layout in the high-dimensional space by mouse dragging of a vertex. Its high-dimensional rotation effects give the user an illusion that he/she is destructively reshaping the social graph layout but in reality, it assists the user to find a preferred positioning and direction in the high- dimensional space to look at the internal structure of the social graph layout, keeping it unmodified. A prototype im- plementation of the proposal called Social Viewpoint Finder is tested with about 70 social graphs and this paper reports four of the analysis results.
Enumeration methods are very important in a variety of settings, both mathematical and applications. For many problems there is actually no real hope to do the enumeration in reasonable time since the number of solutions is so big. This talk is about how to compute at the limit.
The talk is decomposed into:
(a) Regular enumeration procedure where one uses computerized case distinction.
(b) Use of symmetry groups for isomorphism checks.
(c) The augmentation scheme that allows to enumerate object up to isomorphism without keeping the full list in memory.
(d) The homomorphism principle that allows to map a complex problem to a simpler one.
Towards a stable definition of Algorithmic RandomnessHector Zenil
Although information content is invariant up to an additive constant, the range of possible additive constants applicable to programming languages is so large that in practice it plays a major role in the actual evaluation of K(s), the Kolmogorov complexity of a string s. We present a summary of the approach we've developed to overcome the problem by calculating its algorithmic probability and evaluating the algorithmic complexity via the coding theorem, thereby providing a stable framework for Kolmogorov complexity even for short strings. We also show that reasonable formalisms produce reasonable complexity classifications.
Fractal dimension versus Computational ComplexityHector Zenil
We investigate connections and tradeoffs between two important complexity measures: fractal dimension and computational (time) complexity. We report exciting results applied to space-time diagrams of small Turing machines with precise mathematical relations and formal conjectures connecting these measures. The preprint of the paper is available at: http://arxiv.org/abs/1309.1779
Fractal Dimension of Space-time Diagrams and the Runtime Complexity of Small ...Hector Zenil
Complexity measures are designed to capture complex behaviour and to quantify how complex that particular behaviour is. If a certain phenomenon is genuinely complex this means that it does not all of a sudden becomes simple by just translating the phenomenon to a different setting or framework with a different complexity value. It is in this sense that we expect different complexity measures from possibly entirely different fields to be related to each other. This work presents our work on a beautiful connection between the fractal dimension of space-time diagrams of Turing machines and their time complexity. Presented at Machines, Computations and Universality (MCU) 2013, Zurich, Switzerland.
Core–periphery detection in networks with nonlinear Perron eigenvectorsFrancesco Tudisco
Core–periphery detection is a highly relevant task in exploratory network analysis. Given a network of nodes and edges, one is interested in revealing the presence and measuring the consistency of a core–periphery structure using only the network topology. This mesoscale network structure consists of two sets: the core, a set of nodes that is highly connected across the whole network, and the periphery, a set of nodes that is well connected only to the nodes that are in the core. Networks with such a core–periphery structure have been observed in several applications, including economic, social, communication and citation networks.
In this talk we discuss a new core–periphery detection model based on the optimization of a class of core–periphery quality functions. While the quality measures are highly nonconvex in general and thus hardly treatable, we show that the global solution coincides with the nonlinear Perron eigenvector of a suitably defined parameter dependent matrix M(x), i.e. the positive solution to the nonlinear eigenvector problem M(x)x=λx. Using recent advances in nonlinear Perron–Frobeniustheory, we discuss uniqueness of the global solution and we propose a nonlinear power method-type scheme that (a) allows us to solve the optimization problem with global convergence guarantees and (b) effectively scales to very large and sparse networks. Finally, we present several numerical experiments showing that the new method largely out-performs state-of-the-art techniques for core-periphery detection.
Information Content of Complex NetworksHector Zenil
This short talk given in Stockholm, Sweden, explains how algorithmic complexity measures, notably Kolmogorov complexity approximated both by lossless compression algorithms and the Block Decomposition Method (BDM) are capable of characterizing graphs and networks by some of their group-theoretic and topological properties, notably graph automorphism group size and clustering coefficients of complex networks. The method distinguished between models of networks such as regular, random, small-world and scale-free.
We are interested in finding a permutation of the entries of a given square matrix so that the maximum number of its nonzero entries are moved to one of the corners in a L-shaped fashion.
If we interpret the nonzero entries of the matrix as the edges of a graph, this problem boils down to the so-called core–periphery structure, consisting of two sets: the core, a set of nodes that is highly connected across the whole graph, and the periphery, a set of nodes that is well connected only to the nodes that are in the core.
Matrix reordering problems have applications in sparse factorizations and preconditioning, while revealing core–periphery structures in networks has applications in economic, social and communication networks.
Martin Takac - “Solving Large-Scale Machine Learning Problems in a Distribute...diannepatricia
Martin Takac, Assistant Professor, Lehigh University, gave a great presentation today on “Solving Large-Scale Machine Learning Problems in a Distributed Way” as part of our Cognitive Systems Institute Speaker Series.
Slides from our PacificVis 2015 presentation.
The paper tackles the problems of the “giant hairballs”, the dense and tangled structures often resulting from visualiza- tion of large social graphs. Proposed is a high-dimensional rotation technique called AGI3D, combined with an ability to filter elements based on social centrality values. AGI3D is targeted for a high-dimensional embedding of a social graph and its projection onto 3D space. It allows the user to ro- tate the social graph layout in the high-dimensional space by mouse dragging of a vertex. Its high-dimensional rotation effects give the user an illusion that he/she is destructively reshaping the social graph layout but in reality, it assists the user to find a preferred positioning and direction in the high- dimensional space to look at the internal structure of the social graph layout, keeping it unmodified. A prototype im- plementation of the proposal called Social Viewpoint Finder is tested with about 70 social graphs and this paper reports four of the analysis results.
To describe the dynamics taking place in networks that structurally change over time, we propose an approach to search for attributes whose value changes impact the topology of the graph. In several applications, it appears that the variations of a group of attributes are often followed by some structural changes in the graph that one may assume they generate. We formalize the triggering pattern discovery problem as a method jointly rooted in sequence mining and graph analysis. We apply our approach on three real-world dynamic graphs of different natures - a co-authoring network, an airline network, and a social bookmarking system - assessing the relevancy of the triggering pattern mining approach.
Slides reviewing the paper:
Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. "Attention is all you need." In Advances in Neural Information Processing Systems, pp. 6000-6010. 2017.
The dominant sequence transduction models are based on complex recurrent orconvolutional neural networks in an encoder and decoder configuration. The best performing such models also connect the encoder and decoder through an attentionm echanisms. We propose a novel, simple network architecture based solely onan attention mechanism, dispensing with recurrence and convolutions entirely.Experiments on two machine translation tasks show these models to be superiorin quality while being more parallelizable and requiring significantly less timeto train. Our single model with 165 million parameters, achieves 27.5 BLEU onEnglish-to-German translation, improving over the existing best ensemble result by over 1 BLEU. On English-to-French translation, we outperform the previoussingle state-of-the-art with model by 0.7 BLEU, achieving a BLEU score of 41.1.
The State of Open Data on School BullyingTwo Sigma
How much of a problem is school bullying in NYC? The answer depends on who you ask. Data Clinic volunteers compared local surveys (where many students say bullying is happening) with federal data (where a majority of schools report zero incidents), to analyze these disparities for the 2013-14 school year. To present this work, the Data Clinic hosted an event as part of NYC’s Open Data Week, featuring a presentation of the analysis and a panel discussion with researchers, advocates, and journalists to better understand this important student safety issue.
Halite is an open source artificial intelligence programming competition, created by Two Sigma, where players build bots using the coding language of their choice to battle on a two-dimensional virtual board. Halite II, running on GCP, supported about 6,000 active game players from about 100 countries and 1,000 institutions over a three month period. The presentation surveys the principles needed for a successful AI programming competition and describes the architecture of the game environment, particularly the support that GCP provided for the support of 12 million game executions written in over 20 programming languages. Among other topics, this talk illustrates the approaches taken to security, scalability, and the considerations needed to allow machine learning bots to place in the top 50 results.
BeakerX is a collection of kernels and extensions to the Jupyter interactive computing platform. Its major features are: 1) JVM kernel support including Java, Scala, Groovy, Clojure, Kotlin, and SQL. The kernels are built from a shared base kernel that includes magics and classpath support. 2) a collection of interactive widgets for time-series plots, tables, and forms. There are APIs for our JVM languages plus Python and JavaScript. 3) prototype autotranslation for polyglot programming 4) One-click publication including interactive widgets, and 5) a data browser with drag-and-drop into the notebook. The presentation will include a demo of BeakerX and discussion of its history and relationship to its predecessor the Beaker Notebook.
Engineering with Open Source - Hyonjee JooTwo Sigma
Engineering systems using open source solutions can be a powerful way to leverage existing technology. However, not all open source solutions are made or supported equally, and it’s important to choose what you use carefully. In this talk, we’ll walk through building a metrics system for a high performance data platform, taking a look at some of the important factors to consider when choosing what open source offerings to use.
Bringing Linux back to the Server BIOS with LinuxBoot - Trammel HudsonTwo Sigma
The NERF and Heads projects bring Linux back to the cloud servers' boot ROMs by replacing nearly all of the vendor firmware with a reproducible built Linux runtime that acts as a fast, flexible, and measured boot loader. It has been years since any modern servers have supported Free Firmware options like LinuxBIOS or coreboot, and as a result server and cloud security has been dependent on unreviewable, closed source, proprietary vendor firmware of questionable quality. With Heads on NERF, we are making it possible to take back control of our systems with Open Source Software from very early in the boot process, helping build a more trustworthy and secure cloud.
Waiter: An Open-Source Distributed Auto-ScalerTwo Sigma
One of the key challenges in developing a service-oriented architecture (SOA) is anticipating traffic patterns and scaling the number of running instances of services to meet demand. In many situations, it’s hard to know how much traffic a service will receive and when that traffic will come. A service may see no requests for several days in a row and then suddenly see thousands of requests per second. If developers underestimate peak traffic, their service can quickly become overwhelmed and unresponsive, and may even crash, resulting in constant human intervention and poor developer productivity. On the other hand, if they provision sufficient capacity upfront, the resources they allocate will be completely wasted when there’s no traffic. In order to allow for better resource utilization, many cluster management platforms provide auto-scaling features. These features tend to auto-scale at the machine/resource level (as opposed to the request level) or by deferring to logic in the application layer. A more optimal approach would be to run services when–and only when–there is traffic. Waiter is a distributed auto-scaler that delivers this optimal type of request-level auto-scaling. It requires no input or handling from applications and is agnostic to underlying cluster managers; it currently uses Mesos, but can easily run on top of Kubernetes or other solutions. Another challenge with SOAs is enabling the evolution of service implementations without breaking downstream customers. On this front, Waiter supports service-versioning for downstream consumers by running multiple, individually-addressable versions of services. It automatically manages service lifecycles and reaps older versions after a period of inactivity. With a variety of unique features, Waiter is a compelling platform for applications across a broad range of industries. Existing web services can run on Waiter without modification as long as they communicate over HTTP and support the transmission of client requests to arbitrary backends. Two Sigma has employed the platform in a variety of critical production contexts for over two years, with use cases rising to hundreds of millions of requests per day.
Responsive and Scalable Real-time Data Analytics for SHPE 2017 - Cecilia YeTwo Sigma
Designing a system that can extract immediate insights from large amounts of data in real-time requires a special way of thinking. This talk presents a “reactive” approach to designing real-time, responsive, and scalable data applications that can continuously compute analytics on-the-fly. It also highlights a case study as an example of reactive design in action.
Archival Storage at Two Sigma - Josh LenersTwo Sigma
This talk is about archival storage at Two Sigma. We begin by presenting CelFS, Two Sigma’s geo-distributed file system which has been in deployment for over ten years. Although CelFS has scaled to serve tens of petabytes of data, it uses physical partitioning to provide quality of service guarantees, it has a high replication overhead, and cannot take advantage of outsourced cold storage (e.g., Amazon’s Glaclier or Google’s coldline). In the rest of the talk, we describe our response to these limitations in Jaks, a new storage system to reduce the TCO of CelFS and serve as the backend for other systems at Two Sigma. We also discuss how we hedge risk in changing such a foundational system.
Smooth Storage - A distributed storage system for managing structured time se...Two Sigma
Smooth is a distributed storage system for managing structured time series data at Two Sigma. Smooth’s design emphasizes scale, both in terms of size and aggregate request bandwidth, reliability and storage efficiency. It is optimized for large parallel streaming read/write accesses over provided time ranges. Smooth has a clear separation between the metadata and data layers, and supports multiple pluggable object stores for storing data files. Data can be replicated or moved between different stores and data centers to support availability, performance and storage tiering objectives. Smooth is widely used at Two Sigma by various applications including modeling research workflows, data pipelines and various data analysis jobs. Smooth has been in development for about 5 years, currently stores multiple PBs of compressed data, and serves peak aggregate throughput in excess of 100 GB/s. In this talk I will discuss the design and implementation of Smooth, our experience running it over the past two years, ongoing challenges and future directions.
Whether your data's in MySQL, a NoSQL, or somewhere in the cloud, you're likely paying decent money for storage and IOPS. With ever-growing data volumes, and the need for SSDs to cut latency and replication to provide insurance, your storage footprint is an important place to look for savings. It makes sense, then, why so many storage vendors tout compression as a key metric and differentiator.
The language vendors and users employ to reason about storage footprint and compression is embarrassingly vague if not meaningless or downright deceptive, but we can do better, and we must do better.
Whether your data's in MySQL, a NoSQL, or somewhere in the cloud, you're likely paying decent money for storage and IOPS. With ever-growing data volumes, and the need for SSDs to cut latency and replication to provide insurance, your storage footprint is an important place to look for savings. It makes sense, then, why so many storage vendors tout compression as a key metric and differentiator.
The language vendors and users employ to reason about storage footprint and compression is embarrassingly vague if not meaningless or downright deceptive, but we can do better, and we must do better.
This presentation discusses each part of the durable storage stack, from the hardware on up, and how usage numbers can take on different meanings at each layer. It covers what's important to know at each layer, and how to think about and talk about concepts like compression, fragmentation, write amplification, and wear leveling. Finally, it examines different ways benchmarketers can present data deceptively, and provides some techniques for identifying and cutting through those kinds of misrepresentations.
Identifying Emergent Behaviors in Complex Systems - Jane AdamsTwo Sigma
Forager ants in the Arizona desert have a problem: after leaving the nest, they don’t return until they’ve found food. On the hottest and driest days, this means many ants will die before finding food, let alone before bringing it back to the nest. Honeybees also have a problem: even small deviations from 35ºC in the brood nest can lead to brood death, malformed wings, susceptibility to pesticides, and suboptimal divisions of labor within the hive. All ants in the colony coordinate to minimize the number of forager ants lost while maximizing the amount of food foraged, and all bees in the hive coordinate to keep the brood nest temperature constant in changing environmental temperatures.
The solutions realized by each system are necessarily decentralized and abstract: no single ant or bee coordinates the others, and the solutions must withstand the loss of individual ants and bees and extend to new ants and bees. They focus on simple yet essential features and capabilities of each ant and bee, and use them to great effect. In this sense, they are incredibly elegant.
In this talk, we’ll examine a handful of natural and computer systems to illustrate how to cast system-wide problems into solutions at the individual component level, yielding incredibly simple algorithms for incredibly complex collective behaviors.
Improving Python and Spark Performance and Interoperability with Apache ArrowTwo Sigma
Apache Arrow-based interconnection between various big data tools (SQL, UDFs, machine learning, big data frameworks, etc.) enables you to use them together seamlessly and efficiently,
Exploring the Urban – Rural Incarceration Divide: Drivers of Local Jail Incar...Two Sigma
The Vera Institute of Justice (Vera) partnered with with Two Sigma’s Data Clinic, a volunteer-based program that leverages employees’ data science expertise, to uncover the factors contributing to continued jail growth in rural areas.
An overview of Rademacher Averages, a fundamental concept from statistical learning theory that can be used to derive uniform sample-dependent bounds to the deviation of samples averages from their expectations.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
Adjusting OpenMP PageRank : SHORT REPORT / NOTESSubhajit Sahu
For massive graphs that fit in RAM, but not in GPU memory, it is possible to take
advantage of a shared memory system with multiple CPUs, each with multiple cores, to
accelerate pagerank computation. If the NUMA architecture of the system is properly taken
into account with good vertex partitioning, the speedup can be significant. To take steps in
this direction, experiments are conducted to implement pagerank in OpenMP using two
different approaches, uniform and hybrid. The uniform approach runs all primitives required
for pagerank in OpenMP mode (with multiple threads). On the other hand, the hybrid
approach runs certain primitives in sequential mode (i.e., sumAt, multiply).
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfEnterprise Wired
In this guide, we'll explore the key considerations and features to look for when choosing a Trusted analytics platform that meets your organization's needs and delivers actionable intelligence you can trust.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
1. Algorithmic Data Science
=
Theory + Practice
Matteo Riondato – Labs, Two Sigma Investments
@teorionda – http://matteo.rionda.to
IEEE MIT URTC – November 5, 2016
1 / 24
2. Matteo Riondato
Ph.D. in CS
Working at
Labs, Two Sigma Investments (Research Scientist);
CS Dept., Brown U. (Visiting Asst. Prof.);
Doing research on algorithmic data science;
Tweeting @teorionda;
Reading matteo@twosigma.com;
“Living” at http://matteo.rionda.to.
2 / 24
3. Conjecture
Let X be a scientific discipline. Then
21st
-century X = datascience (X) + ε .
Partial evidence: “Computational X” exists for many X.
3 / 24
4. data science : 21st
century = statistics : 20th
century
4 / 24
5. data science for 21st
century society
questions
data
5 / 24
8. data science =
1/4 data representation and management
1/4 mathematical and statistical modeling
1/4 computational thinking and algorithms
1/4 domain expertise
Shake well, and strain into a cocktail glass.
7 / 24
17. Scientific question: Find relevant webpages on the web, influential participants in
a email chain, key proteins in a network, . . .
Data representation: represent the data as a graph G = (V , E).
a
h
b
g f e
c d
Modeling question: What are the important nodes in a graph G = (V , E)?
We need f : V → R+ to express the importance of a node.
The higher is f (x), the more important is x ∈ V .
12 / 24
18. Domain Knowledge / Modeling: Assume that
1) every node wants to communicate with every node; and
2) communication progresses along Shortest Paths (SPs).
Then, the higher the no. of SPs that a node v belongs to, the more important v is.
Definition
For each node x ∈ V , the betweeness b(x) of x is:
b(x) =
1
n(n − 1) u=x=v∈V
σuv (x)
σuv
∈ [0, 1]
• σuv : number of SPs from u to v, u, v ∈ V ;
• σuv (x): number of SPs from u to v that go through x.
I.e., b(x) is weighted fraction of SPs that go through x, among all SPs in G.
13 / 24
19. a
h
b
g f e
c d
Node x a b c d e f g h
b(x) 0 0.250 0.125 0.036 0.054 0.080 0.268 0
14 / 24
21. Algorithmic question: How to compute all b(x)?
Brandes’ Algorithm
Intuition: For each vertex s ∈ V :
1) Build the SP DAG from s via Dijkstra/BFS;
2) Traverse the SP DAG from the most distant node towards s, in reverse order of
distance. During the walk, appropriately increment b(v) of each non-leaf node v
traversed.
Source s: 1
1
234
567
89
15 / 24
22. Algorithmic question: How to compute all b(x)?
Brandes’ Algorithm
Intuition: For each vertex s ∈ V :
1) Build the SP DAG from s via Dijkstra/BFS;
2) Traverse the SP DAG from the most distant node towards s, in reverse order of
distance. During the walk, appropriately increment b(v) of each non-leaf node v
traversed.
Source s: 1
1
234
567
89
15 / 24
23. Algorithmic question: How to compute all b(x)?
Brandes’ Algorithm
Intuition: For each vertex s ∈ V :
1) Build the SP DAG from s via Dijkstra/BFS;
2) Traverse the SP DAG from the most distant node towards s, in reverse order of
distance. During the walk, appropriately increment b(v) of each non-leaf node v
traversed.
Source s: 1
1
234
567
89
(update to b(v) not shown)
15 / 24
24. Algorithmic question: How to compute all b(x)?
Brandes’ Algorithm
Intuition: For each vertex s ∈ V :
1) Build the SP DAG from s via Dijkstra/BFS;
2) Traverse the SP DAG from the most distant node towards s, in reverse order of
distance. During the walk, appropriately increment b(v) of each non-leaf node v
traversed.
Source s: 1
1
234
567
89
(update to b(v) not shown)
15 / 24
25. Algorithmic question: How to compute all b(x)?
Brandes’ Algorithm
Intuition: For each vertex s ∈ V :
1) Build the SP DAG from s via Dijkstra/BFS;
2) Traverse the SP DAG from the most distant node towards s, in reverse order of
distance. During the walk, appropriately increment b(v) of each non-leaf node v
traversed.
Source s: 1
1
234
567
89
(update to b(v) not shown)
15 / 24
26. Algorithmic question: How to compute all b(x)?
Brandes’ Algorithm
Intuition: For each vertex s ∈ V :
1) Build the SP DAG from s via Dijkstra/BFS;
2) Traverse the SP DAG from the most distant node towards s, in reverse order of
distance. During the walk, appropriately increment b(v) of each non-leaf node v
traversed.
Source s: 1
1
234
567
89
(update to b(v) not shown)
15 / 24
27. Algorithmic question: How to compute all b(x)?
Brandes’ Algorithm
Intuition: For each vertex s ∈ V :
1) Build the SP DAG from s via Dijkstra/BFS;
2) Traverse the SP DAG from the most distant node towards s, in reverse order of
distance. During the walk, appropriately increment b(v) of each non-leaf node v
traversed.
Source s: 1
1
234
567
89
(update to b(v) not shown)
15 / 24
28. Algorithmic question: How to compute all b(x)?
Brandes’ Algorithm
Intuition: For each vertex s ∈ V :
1) Build the SP DAG from s via Dijkstra/BFS;
2) Traverse the SP DAG from the most distant node towards s, in reverse order of
distance. During the walk, appropriately increment b(v) of each non-leaf node v
traversed.
Source s: 1
1
234
567
89
(update to b(v) not shown)
15 / 24
29. Algorithmic question: How to compute all b(x)?
Brandes’ Algorithm
Intuition: For each vertex s ∈ V :
1) Build the SP DAG from s via Dijkstra/BFS;
2) Traverse the SP DAG from the most distant node towards s, in reverse order of
distance. During the walk, appropriately increment b(v) of each non-leaf node v
traversed.
Source s: 1
1
234
567
89
(update to b(v) not shown)
Time complexity: O(nm + n2 log n)
n Dijkstra’s, plus n backward walks,
taking at most n each
Too much even with just 104 nodes.
15 / 24
30. Modeling / Domain knowledge:
High-quality approximations of all BCs are sufficient.
16 / 24
31. Modeling / Domain knowledge:
High-quality approximations of all BCs are sufficient.
Let ε ∈ (0, 1), and δ ∈ (0, 1) be user-specified parameters;
An (ε, δ)-approximation is a set {b(x), x ∈ V } of n values s.t.
Pr(∃x ∈ V s.t. |b(x) − b(x)| > ε) ≤ δ
i.e., with prob. ≥ 1 − δ, for all x ∈ V , b(x) is within ε of b(x):
a uniform probabilistic guarantee over all the estimations.
16 / 24
32. Algorithmic question:
How to obtain an (ε, δ)-approximation quickly?
Answer:
Sampling
Instead of computing all the SPs from each node x ∈ V , compute them only from
some randomly chosen nodes (samples).
Theory question:
How many samples do we need to obtain an (ε, δ)-approximation?
The more the better, but really, how many?
17 / 24
33. How many samples do we need to obtain an (ε, δ)-approximation?
Theory: Hoeffding Bound + Union Bound
18 / 24
34. How many samples do we need to obtain an (ε, δ)-approximation?
Theory: Hoeffding Bound + Union Bound
Need O
1
ε2
log |V | + log
1
δ
samples
18 / 24
35. How many samples do we need to obtain an (ε, δ)-approximation?
Theory: Hoeffding Bound + Union Bound
Need O
1
ε2
log |V | + log
1
δ
samples
Comments
Practice:
Fewer samples than the above are sufficient for (ε, δ)-approx.
Theory:
Dependency on |V | and not on edge structure seems wrong.
18 / 24
36. How many samples do we need to obtain an (ε, δ)-approximation?
Theory: Vapnik-Chervonenkis (VC) Dimension
Developed to evaluate supervised learning classifiers.
We twisted it to work in a non-supervised graph mining problem.
“The most practical theory ever” – Me, right now
19 / 24
37. How many samples do we need to obtain an (ε, δ)-approximation?
Theory: Vapnik-Chervonenkis (VC) Dimension
Developed to evaluate supervised learning classifiers.
We twisted it to work in a non-supervised graph mining problem.
“The most practical theory ever” – Me, right now
Need O
1
ε2
log diam(G) + log
1
δ
samples
Decreased sample size exponentially on small-world networks.
19 / 24
38. How many samples do we need to obtain an (ε, δ)-approximation?
Theory: Vapnik-Chervonenkis (VC) Dimension
Developed to evaluate supervised learning classifiers.
We twisted it to work in a non-supervised graph mining problem.
“The most practical theory ever” – Me, right now
Need O
1
ε2
log diam(G) + log
1
δ
samples
Decreased sample size exponentially on small-world networks.
Comments
Practice: Great improvement but still too many samples.
Theory: Graphs with the same diameter are not equally “hard”.
19 / 24
39. How many samples do we need to obtain an (ε, δ)-approximation?
Theory: Progressive sampling + Rademacher Averages
Let’s start sampling, use the sample to decide when to stop.
20 / 24
40. How many samples do we need to obtain an (ε, δ)-approximation?
Theory: Progressive sampling + Rademacher Averages
Let’s start sampling, use the sample to decide when to stop.
Stop when ηi ≤ ε, where ηi is. . .
20 / 24
41. How many samples do we need to obtain an (ε, δ)-approximation?
Theory: Progressive sampling + Rademacher Averages
Let’s start sampling, use the sample to decide when to stop.
Stop when ηi ≤ ε, where ηi is. . .
ηi = 2 min
t∈R+
1
t
ln
(r,C)∈T
et2
r2
/(2S2
i )
+ 3
(i + 1) ln(2/δ)
2Si
Comments
Practice: Getting closer to the empirical bound
Theory: Proving stuff is getting complicated (isn’t that good?)
20 / 24
42. Theory + Practice:
Get rid of “theoretical elegance” while maintaining correctness.
21 / 24
43. Theory + Practice:
Get rid of “theoretical elegance” while maintaining correctness.
Let
gS(x, y) = 2 exp −2 x2
(y − 2RF (S))2
+ exp − ((1 − x)y + 2xRF (S))
φ
2RF (S)
(1 − x)y + 2xRF (S)
− 1 .
Then compute
min
x,ξ
ξ
s.t. gS(x, ξ) ≤ η
ξ ∈ (2RF (S), 1]
x ∈ (0, 1)
and check if ξ < ε.
21 / 24
44. To be a data scientist, you need to get your hands dirty in data.
To be an algorithmic data scientist,
you need to get your hands dirty in
data
theory
22 / 24
46. 1) Embrace data science
2) Combine theory and practice
24 / 24
47. 1) Embrace data science
2) Combine theory and practice
Thank you!
EML: matteo@twosigma.com TWTR: @teorionda
WWW: http://matteo.rionda.to
24 / 24
48. This document is being distributed for informational and educational purposes only and is not an offer to sell or the solicitation of an offer to buy
any securities or other instruments. The information contained herein is not intended to provide, and should not be relied upon for investment
advice. The views expressed herein are not necessarily the views of Two Sigma Investments, LP or any of its affiliates (collectively, “Two Sigma”).
Such views reflect significant assumptions and subjective of the author(s) of the document and are subject to change without notice. The
document may employ data derived from third-party sources. No representation is made as to the accuracy of such information and the use of
such information in no way implies an endorsement of the source of such information or its validity.
The copyrights and/or trademarks in some of the images, logos or other material used herein may be owned by entities other than Two Sigma. If
so, such copyrights and/or trademarks are most likely owned by the entity that created the material and are used purely for identification and
comment as fair use under international copyright and/or trademark laws. Use of such image, copyright or trademark does not imply any
association with such organization (or endorsement of such organization) by Two Sigma, nor vice versa.