Big Data SQL Support in Apache Apex / Hadoop

351 views

Published on

Presenter: Chinmay Kolhatkar is a Software Development Engineer at DataTorrent and Committer at Apache Apex

Abstract: This talk will cover SQL support which is available in Apache Apex. Apex uses Apache Calcite (another open source project) for Query planning and Optimization. Calcite provides the relational algebraic transformation of SQL query and internally Apex uses its production ready connectors and processes to build a pipeline. Using SQL Support in Apex one could register table from multiple types of sources e.g. Kafka, File, JDBC, Custom and build complete pipelines using simple INSERT and SELECT statements. This feature of Apex is useful for those who has knowledge of SQL and would like to use that in Streaming and Batch world.

Bio: Chinmay Kolhatkar is a Software Development Engineer at DataTorrent and Committer at Apache Apex. He has been working with DataTorrent for the past 2 years before which he worked on Cloud Technology at Intel Security (McAfee).

Learn more about Apex and DataTorrent: https://www.datatorrent.com/apex/

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
351
On SlideShare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
7
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Big Data SQL Support in Apache Apex / Hadoop

  1. 1. SQL on Apache Apex Chinmay Kolhatkar (chinmay@apache.org) January 24th 2017
  2. 2. Apache Apex - Stream Processing Easily Operable - Exposes an easy API for developing Operators (part of an application) and Applications Highly Scalable - Scales statically as well as dynamically Highly Performant - Can reach single digit millisecond end-to-end latency Fault Tolerant - Automatically recovers from failures - without manual intervention Stateful - Guarantees that no state will be lost Apex Malhar library YARN - Native - Uses Hadoop YARN framework for resource negotiation
  3. 3. Apex Platform Overview 3
  4. 4. An Apex Application is a DAG (Directed Acyclic Graph) A DAG is composed of vertices (Operators) and edges (Streams). A Stream is a sequence of data tuples which connects operators at end-points called Ports An Operator takes one or more input streams, performs computations & emits one or more output streams ● Each operator is USER’s business logic, or built-in operator from our open source library ● Operator may have multiple instances that run in parallel
  5. 5. Typical application example
  6. 6. Brief about SQL 1969 - CODASYL (network database) 1979 - First commercial SQL RDBMSs 1990 - Acceptance - transaction processing on SQL 1993 - Multi-dimensional databases 1996 - SQL EDWs 2006 - Hadoop and other “big data” technologies 2008 - NoSQL 2011 - SQL on Hadoop 2014 - Interactive analytics on {Hadoop, NoSQL, RDBMS} using SQL SQL remains popular. Why? “SQL on everything, in memory” by Julian Hyde, Strata NYC, Oct 16 2014
  7. 7. Brief about Calcite Traditional Architecture “SQL on everything, in memory” by Julian Hyde, Strata NYC, Oct 16 2014
  8. 8. Brief about Calcite Calcite Architecture “SQL on everything, in memory” by Julian Hyde, Strata NYC, Oct 16 2014
  9. 9. Brief about Calcite Expression Tree “SQL on everything, in memory” by Julian Hyde, Strata NYC, Oct 16 2014
  10. 10. Brief about Calcite Expression Tree (Optimized) “SQL on everything, in memory” by Julian Hyde, Strata NYC, Oct 16 2014
  11. 11. Apex-Calcite API Kafka Input CSV Parser Filter CSV Formattter FilteredWordsLines Kafka File Project Projected Line Writer Formatted SQLExecEnvironment.getEnvironment() .registerTable("ORDERS", new KafkaEndpoint(conf.get("broker"), conf.get("topic"), new CSVMessageFormat(conf.get("schemaInDef")))) .registerTable("SALES", new FileEndpoint(conf.get("destFolder"), conf.get("destFileName"), new CSVMessageFormat(conf.get("schemaOutDef")))) .registerFunction("APEXCONCAT", this.getClass(), "apex_concat_str") .executeSQL(dag, "INSERT INTO SALES " + "SELECT STREAM ROWTIME, FLOOR(ROWTIME TO DAY), APEXCONCAT('OILPAINT', SUBSTRING(PRODUCT, 6, 7) " + "FROM ORDERS WHERE ID > 3 AND PRODUCT LIKE 'paint%'");
  12. 12. Demo
  13. 13. Resources 13 • Apache Apex - http://apex.apache.org/ • Subscribe to forums ᵒ Apex - http://apex.apache.org/community.html ᵒ DataTorrent - https://groups.google.com/forum/#!forum/dt-users • Download - https://datatorrent.com/download/ • Twitter ᵒ @ApacheApex; Follow - https://twitter.com/apacheapex ᵒ @DataTorrent; Follow – https://twitter.com/datatorrent • Meetups - http://meetup.com/topics/apache-apex • Webinars - https://datatorrent.com/webinars/ • Videos - https://youtube.com/user/DataTorrent • Slides - http://slideshare.net/DataTorrent/presentations • Startup Accelerator – Free full featured enterprise product ᵒ https://datatorrent.com/product/startup-accelerator/ • Big Data Application Templates Hub – https://datatorrent.com/apphub
  14. 14. We Are Hiring 14 • jobs@datatorrent.com • Developers/Architects • QA Automation Developers • Information Developers • Build and Release • Community Leaders
  15. 15. 15

×