Variety is the spice of life, but it’s also the reality of big data. For this reason, JSON has now becoming lingua franca of data in the internet – for APIs, data exchange, data storage and data processing. In the business intelligence world, SQL is the language to analyze the data in other forms. Hence, the myriad of “SQL-on-Hadoop” projects. However, traditional SQL isn’t JSON/Parquet/etc. friendly. ETL into flattened tables is costly and not real time.
Apache Drill unifies SQL with variety of data forms on Hadoop. That enables interactive analytics using your favorite BI tool and visualization tool on you data simultaneously. In this talk, we’ll introduce Apache Drill and describe use cases.
- See more at: http://nosql2014.dataversity.net/sessionPop.cfm?confid=81&proposalid=6850#sthash.NhuLz6Dq.dpuf
With other technologies you have to do this, then this, then this, …
TODO: Add Impala and Splunk logos
Need an example or analogy to explain self-describing data.
All SQL engines (traditional or SQL-on-Hadoop) view tables as spreadsheet-like data structures with rows and columns. All records have the same structure, and there is no support for nested data or repeating fields. Drill views tables conceptually as collections of JSON (with additional types) documents. Each record can have a different structure (hence, schema-less). This is revolutionary and has never been done before.
If you consider the four data models shown in the 2x2, all models can be represented by the complex, no schema model (JSON) because it is the most flexible. However, no other data model can be represented by the flat, fixed schema model. Therefore, when using any SQL engine except Drill, the data has to be transformed before it can be available to queries.