The Berkeley Data Analytics Stack (BDAS) aims to address emerging challenges in data analysis through a set of systems, including Spark, Shark and Mesos, that enable faster and more powerful analytics. In this talk, we’ll cover two recent additions to BDAS:
* Spark Streaming is an extension of Spark that enables high-speed, fault-tolerant stream processing through a high-level API. It uses a new processing model called “discretized streams” to enable fault-tolerant stateful processing with exactly-once semantics, without the costly transactions required by existing systems. This lets applications process much higher rates of data per node. It also makes programming streaming applications easier by providing a set of high-level operators on streams (e.g. maps, filters, and windows) in Java and Scala.
* Shark is a Spark-based data warehouse system compatible with Hive. It can answer Hive QL queries up to 100 times faster than Hive without modification to existing data or queries. Shark supports Hive’s query language, metastore, serialization formats, and user-defined functions. It employs a number of novel and traditional database optimization techniques, including column-oriented storage and mid-query replanning, to efficiently execute SQL on top of Spark. The system is in early use at companies including Yahoo! and Conviva.