Building Streaming Applications with Apache Apex @ Big Data Europe 2016


Published on

Stream processing applications built on Apache Apex run on Hadoop clusters and typically power analytics use cases where availability, flexible scaling, high throughput, low latency and correctness are essential. These applications consume data from a variety of sources, including streaming sources like Apache Kafka, Kinesis or JMS, file based sources or databases. Processing results often need to be stored in external systems (sinks) for downstream consumers (pub-sub messaging, real-time visualization, Hive and other SQL databases etc.). Apex has the Malhar library with a wide range of connectors and other operators that are readily available to build applications. We will cover key characteristics like partitioning and processing guarantees, generic building blocks for new operators (write-ahead-log, incremental state saving, windowing etc.) and APIs for application specification.

Thomas Weise is Apache Apex PMC member and architect/co-founder at DataTorrent.

Chinmay Kolhatkar is a Committer for Apache Apex and a Software Engineer at DataTorrent

If you'd like to learn more about Apache Apex and DataTorrent, feel free to schedule a 30-minute consultation with our team here:

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Building Streaming Applications with Apache Apex @ Big Data Europe 2016

  1. 1. Building Streaming Applications with Apache Apex Chinmay Kolhatkar, Committer @ApacheApex, Engineer @DataTorrent Thomas Weise, PMC Chair @ApacheApex, Architect @DataTorrent Nov 15th 2016
  2. 2. Agenda 2 • Application Development Model • Creating Apex Application - Project Structure • Apex APIs •Configuration Example •Overview of Operator Library •Frequently used Connectors •Stateful Transformation & Windowing •Scalability - Partitioning •End-to-end Exactly Once
  3. 3. Application Development Model 3 ▪Stream is a sequence of data tuples ▪Operator takes one or more input streams, performs computations & emits one or more output streams • Each Operator is YOUR custom business logic in java, or built-in operator from our open source library • Operator has many instances that run in parallel and each instance is single-threaded ▪Directed Acyclic Graph (DAG) is made up of operators and streams Directed Acyclic Graph (DAG) Output Stream Tupl e Tupl e er Operator er Operator er Operator er Operator er Operator er Operator
  4. 4. Creating Apex Application Project 4
  5. 5. Apex Application Project Structure 5 • pom.xml •Defines project structure and dependencies • •Defines the DAG • •Sample Operator •properties.xml •Contains operator and application properties and attributes • •Sample test to test application in local mode
  6. 6. Apex APIs: Compositional (Low level) 6 Input Parser Counter Output CountsWordsLines Kafka Database Filter Filtered
  7. 7. Apex APIs: Declarative (High Level) 7 File Input Parser Word Counter Console Output CountsWordsLines Folder StdOut StreamFactory.fromFolder("/tmp") .flatMap(input -> Arrays.asList(input.split(" ")), name("Words")) .window(new WindowOption.GlobalWindow(), new TriggerOption().accumulatingFiredPanes().withEarlyFiringsAtEvery(1)) .countByKey(input -> new Tuple.PlainTuple<>(new KeyValPair<>(input, 1L)), name("countByKey")) .map(input -> input.getValue(), name("Counts")) .print(name("Console")) .populateDag(dag);
  8. 8. Apex APIs: SQL 8 Kafka Input CSV Parser Filter CSV Formattter FilteredWordsLines Kafka File Project Projected Line Writer Formatted SQLExecEnvironment.getEnvironment() .registerTable("ORDERS", new KafkaEndpoint(conf.get("broker"), conf.get("topic"), new CSVMessageFormat(conf.get("schemaInDef")))) .registerTable("SALES", new FileEndpoint(conf.get("destFolder"), conf.get("destFileName"), new CSVMessageFormat(conf.get("schemaOutDef")))) .registerFunction("APEXCONCAT", this.getClass(), "apex_concat_str") .executeSQL(dag, "INSERT INTO SALES " + "SELECT STREAM ROWTIME, FLOOR(ROWTIME TO DAY), APEXCONCAT('OILPAINT', SUBSTRING(PRODUCT, 6, 7) " + "FROM ORDERS WHERE ID > 3 AND PRODUCT LIKE 'paint%'");
  9. 9. Apex APIs: Beam 9 • Apex Runner of Beam is available!! •Build once run-anywhere model •Beam Streaming applications can be run on apex runner: public static void main(String[] args) { Options options = PipelineOptionsFactory.fromArgs(args).withValidation().as(Options.class); // Run with Apex runner options.setRunner(ApexRunner.class); Pipeline p = Pipeline.create(options); p.apply("ReadLines", TextIO.Read.from(options.getInput())) .apply(new CountWords()) .apply(MapElements.via(new FormatAsTextFn())) .apply("WriteCounts",; .run().waitUntilFinish(); }
  10. 10. Apex APIs: SAMOA 10 •Build once run-anywhere model for online machine learning algorithms •Any machine learning algorithm present in SAMOA can be run directly on Apex. •Following example does classification of input data from HDFS using VHT algorithm on Apex: ᵒ bin/samoa apex ../SAMOA-Apex-0.4.0-incubating-SNAPSHOT.jar "PrequentialEvaluation -d /tmp/dump.csv -l (classifiers.trees.VerticalHoeffdingTree -p 1) -s (org.apache.samoa.streams.ArffFileStream -s HDFSFileStreamSource -f /tmp/user/input/covtypeNorm.arff)"
  11. 11. Configuration Example (properties.xml) 11 Input Parser Counter Output CountsWordsLines Kafka Database Filter Filtered
  12. 12. Operator APIs 12 Input Adapters - Starting of the pipeline. Interacts with external system to generate stream Generic Operators - Processing part of pipeline Output Adapters - Last operator in pipeline. Interacts with external system to finalize the processed stream
  13. 13. Overview of Operator Library (Malhar) 13 RDBMS • JDBC • MySQL • Oracle • MemSQL NoSQL • Cassandra, HBase • Aerospike, Accumulo • Couchbase/ CouchDB • Redis, MongoDB • Geode Messaging • Kafka • JMS (ActiveMQ etc.) • Kinesis, SQS • Flume, NiFi File Systems • HDFS/ Hive • Local File • S3 Parsers • XML • JSON • CSV • Avro • Parquet Transformations • Filters, Expression, Enrich • Windowing, Aggregation • Join • Dedup Analytics • Dimensional Aggregations (with state management for historical data + query) Protocols • HTTP • FTP • WebSocket • MQTT • SMTP Other • Elastic Search • Script (JavaScript, Python, R) • Solr • Twitter
  14. 14. Frequently used Connectors Kafka Input 14 • •Based on 0.9 version of kafka-client •Emits byte[] for the message received from topic •Can read from multiple partitions and multiple brokers •Supports Idempotency •Automatic partitioning based on topic metadata • •Based on 0.8 version of kafka-client •Emits byte[] for the message received from topic •Can read from single partition but multiple brokers •Supports Idempotency •Supports partitioning on a single topic
  15. 15. Frequently used Connectors Kafka Output 15 • •From malhar-kafka library •Based on 0.9 version of kafka-client •Can read write to multiple partitions and multiple brokers •Support Exactly once -Upstream operators should be idempotent •Automatic partitioning based on topic metadata • •From malhar-contrib library •Based on 0.8 version of kafka-client •Can write to multiple topics and multiple brokers •Supports partitioning
  16. 16. Important Connectors File Input 16 •AbstractFileInputOperator •Used to read a file from source and emit the content of the file to downstream operator •Operator is idempotent •Supports Partitioning •Few Concrete Impl •FileLineInputOperator •AvroFileInputOperator •ParquetFilePOJOReader
  17. 17. Important Connectors File Output 17 •AbstractFileOutputOperator •Writes data to a file •Supports Partitions •Support Exactly Once •Upstream operators should be idempotent •Few Concrete Impl •StringFileOutputOperator
  18. 18. Windowing Support 18 • Streaming (Processing) Windows •Finite time sliced windows based on processing (event arrival) time •Used for bookkeeping of streaming application •Derived Windows are: Checkpoint Windows, Committed Windows • Event-time Windows •Computation based on event-time present in the tuple •Types of event-time windows supported: -Global : Single event-time window throughout the lifecycle of application -Timed : Tuple is assigned to single, non-overlapping, fixed width windows immediately followed by next window -Sliding Time : Tuple is can be assigned to multiple, overlapping fixed width windows. -Session : Tuple is assigned to single, variable width windows with a predefined min gap
  19. 19. Stateful Windowed Processing 19 • WindowedOperator from malhar-library •Used to process data based on Event time as contrary to ingression time •Supports windowing semantics of Apache Beam model •Supported features: •Watermarks •Allowed Lateness •Accumulation •Accumulation Modes: Accumulating, Discarding, Accumulating & Retracting •Triggers •Storage •In memory based •Managed State based
  20. 20. Stateful Windowed Processing Compositional API 20
  21. 21. Stateful Windowed Processing Declarative API 21 StreamFactory.fromFolder("/tmp") .flatMap(input -> Arrays.asList(input.split(" ")), name("ExtractWords")) .map(input -> new TimestampedTuple<>(System.currentTimeMillis(), input), name("AddTimestampFn")) .window(new TimeWindows(Duration.standardMinutes(WINDOW_SIZE)), new TriggerOption().accumulatingFiredPanes().withEarlyFiringsAtEvery(1)) .countByKey(input -> new TimestampedTuple<>(input.getTimestamp(), new KeyValPair<>(input.getValue(), 1L))), name("countWords")) .map(new FormatAsTableRowFn(), name("FormatAsTableRowFn")) .print(name("console")) .populateDag(dag);
  22. 22. •Useful for low latency and high throughput •Replicates (Partitions) the logic •Configured at launch time ( or properties.xml) •StreamCodec •Used for controlling distribution of tuples to downstream partitions •Unifier •Passthrough unifier added by platform to merge results from upstream partitions •Can also be customized •Type of partitions •Static partitions - Statically partition at launch time •Dynamic partitions - Partitions changing at runtime based on latency and/or throughput •Parallel partitions - Upstream and downstream operators using same partition scheme Scalability - Partitioning 22
  23. 23. Scalability - Partitioning (contd.) 23 0 1 2 3 Logical DAG 0 1 2 U Physical DAG 1 1 2 2 3 Parallel Partitions M x N Partitions OR Shuffle <configuration> <property> <name>dt.operator.1.attr.PARTITIONER</name> <value>com.datatorrent.common.partitioner.StatelessPartitioner:3</value> </property> <property> <name>dt.operator.2.port.inputPortName.attr.PARTITION_PARALLEL</name> <value>true</value> </property> </configuration>
  24. 24. End-to-End Exactly-Once 24 Input Counter Store Aggregate CountsWords Kafka Database ● Input ○ Uses com.datatorrent.contrib.kafka.KafkaSinglePortStringInputOperator ○ Emits words as a stream ○ Operator is idempotent ● Counter ○ Uses com.datatorrent.lib.algo.UniqueCounter ○ Counts across Application Window ○ Operator is idempotent ● Store ○ Uses CountStoreOperator ○ Inserts into JDBC ○ Operator is exactly-once (End-To-End Exactly Once = At least Once + Idempotency + States) ● Source:
  25. 25. End-to-End Exactly-Once (Contd.) 25 Input Counter Store Aggregate CountsWords Kafka Database public static class CountStoreOperator extends AbstractJdbcTransactionableOutputOperator<KeyValPair<String, Integer>> { public static final String SQL = "MERGE INTO words USING (VALUES ?, ?) I (word, wcount)" + " ON (words.word=I.word)" + " WHEN MATCHED THEN UPDATE SET words.wcount = words.wcount + I.wcount" + " WHEN NOT MATCHED THEN INSERT (word, wcount) VALUES (I.word, I.wcount)"; @Override protected String getUpdateCommand() { return SQL; } @Override protected void setStatementParameters(PreparedStatement statement, KeyValPair<String, Integer> tuple) throws SQLException { statement.setString(1, tuple.getKey()); statement.setInt(2, tuple.getValue()); } }
  26. 26. Resources for the use cases 26 • Pubmatic • • GE • • apache-apex-hadoop • SilverSpring Networks • • silver-spring-networks
  27. 27. Resources 27 • Apache Apex - • Subscribe - • Download - • Twitter ᵒ @ApacheApex; Follow - ᵒ @DataTorrent; Follow – • Meetups - • Webinars - • Videos - • Slides - • Startup Accelerator Program - Full featured enterprise product ᵒ
  28. 28. We Are Hiring 28 • • Developers/Architects • QA Automation Developers • Information Developers • Build and Release • Community Leaders
  29. 29. Q&A 29