The talk presents the evolution of Big-Data systems from single-purpose MapReduce frameworks to fully general computational infrastructures. In particular, I will follow the evolution of Hadoop, and show the benefits and challenges of a new architectural paradigm that decouples the resource management component (YARN) from the specifics of the application frameworks (e.g., MapReduce, Tez, REEF, Giraph, Naiad, Dryad, Spark,...). We argue that beside the primary goals of increasing scalability and programming model flexibility, this transformation dramatically facilitates innovation.
In this context, I will present some of our contributions to the evolution of Hadoop (namely: work-preserving preemption, and predictable resource allocation), and comment on the fascinating experience of working on open- source technologies from within Microsoft. The current Hadoop APIs (HDFS and YARN) provide the cluster equivalent of an OS API. With this as a backdrop, I will present our attempt to create the equivalent of stdlib for the cluster: the REEF project.
Carlo A. Curino received a PhD from Politecnico di Milano, and spent two years as Post Doc Associate at CSAIL MIT leading the relational cloud project. He worked at Yahoo! Research as Research Scientist focusing on mobile/cloud platforms and entity deduplication at scale. Carlo is currently a Senior Scientist at Microsoft in the Cloud and Information Services Lab (CISL) where he is working on big-data platforms and cloud computing.
Clipping is a handy way to collect important slides you want to go back to later.