We will learn about modern algorithmic techniques for handling large datasets, often by using imprecise but concise representations of the data such as a sketch or a sample of the data. The lectures will cluster around three themes
Nearest Neighbor Search (similarity search): the general problem is, given a set of objects (e.g., images), to construct a data structure so that later, given a query object, one can efficiently find the most similar object from the database.
Streaming framework: we are required to solve a certain problem on a large collection of items that one streams through once (i.e., algorithm's memory footprint is much smaller than the dataset itself). For example, how can a router with 1Mb memory estimate the number of different IPs it sees in a multi-gigabytes long real-time traffic?
Parallel framework: we look at problems where neither the data or the output fits on a machine. For example, given a set of 2D points, how can we compute the minimum spanning tree over a cluster of machines.
The focus will be on techniques such as sketching, dimensionality reduction, sampling, hashing, and others.