SQL on Hadoop: Defining the New Generation of Analytics Databases
- 1,650 views
The analytics and data warehousing industries are in the midst of a major period of transformation. Since the publication of Google?s MapReduce paper, we have witnessed the appearance of Apache ...
The analytics and data warehousing industries are in the midst of a major period of transformation. Since the publication of Google?s MapReduce paper, we have witnessed the appearance of Apache Hadoop, followed by the arrival of batch-oriented SQL systems like Apache Hive, and the scramble by established SQL vendors to implement Hadoop connectors. This talk addresses the recent emergence of a new generation of analytic databases inspired by Google Dremel. These databases have been designed with the goal of running real-time SQL natively on Hadoop in a manner that fully exploits the flexibility and performance of the underlying platform. Characterized by features including schema-on-read, support for semi-structured data, and pluggable storage engines, these new systems share important architectural details that distinguish them from the previous generation of analytic databases. In this talk, we will discuss the performance limitations of the connector-based approach employed by many established vendors and explain the long-term significance of Apache Hive?s data model. Then, we will unravel the novel architectural features common to next generation analytic database systems like CitusDB and Impala that make real-time SQL-on-Hadoop feasible. Finally, we will conclude by reviewing several important database lessons learned over the previous decades that remain relevant today.
- Total Views
- Views on SlideShare
- Embed Views