Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Predix Time Series
with Apache Apex
Hello!
Venkat
Predix Data Services,
GE Digital
Big Data & Analytics
@WalmartLabs.
Pramod
Senior Architect,
DataTorrent Inc...
Quick
Survey
▪ Predix Platform Overview
▪ Predix Time Series
▪ Apache Apex
▪ Stream Processing with Apex – Journey and Learning
▪ Demo
...
▪ Platform for Industrial Internet
▪ Based on Cloud Foundry
▪ Provides rich set of services for rapid development
▪ Manage...
Want
big
impact?
Use big
image.
Predix Platform
Architecture
Who we are?Team Data Services
Love Java and Go
Distributed Systems
Big & Fast Data
We are Hiring!
Predix Time Series
Overview
▪ Streaming Ingestion
▪ Efficient storage
▪ Indexing the data for quick retrieval.
▪ Guarantee...
Predix Time Series
Architecture
▪ Support Interpolation
▪ Aggregations (percent, avg, sum, count)
▪ Filter by Attributes, Quality and Value
▪ Support for ...
▪ Signup @ Predix.io
▪ Create Time Series Instance
▪ Bind to an application
▪ Get credentials and connect your device
▪ Qu...
Apache Apex
Overview
▪Streaming Analytics Platform
▪Event based, low latency
▪Scalable and Highly available
▪Managed State...
Apex Platform
Stream Processing
Events
Reader
Filter
Operator
Filter
Operator
“Top K”
Operator
“Top K”
Operator
Datastore
Writer
Partiti...
Windowing Support
 Application
window
 Sliding/Tumb
ling Window
 Checkpoint
window
 No artificial
latency
Application Specification
Why Apache Apex
Development
High Performance and
Distributed
Dynamic Partitions
Rich set of operator
library
Support for a...
Time Series
DAG
Skimmed Version of
the DAG
Partitioning
Strategies
Input
Operator
Detection
Operator
Output
Operator
Logical DAG
Detection
Operator
Input
Operator
Detection
Operator
Unifier...
▪ Utilize hashcode and mask to determine Partition
▪ Mask picks the last n bits of the hashcode of the tuple
▪ StreamCodec...
MxN Partitioning
Input
Operator
Detection
Operator
Detection
Operator
Output
Operator
Output
Operator
Input
Operator
Input...
Parallel Partitioning
Input
Operator
Detection
Operator
Detection
Operator
Output
Operator
Output
Operator
Input
Operator
...
Unifier
▪ Combines outputs of multiple partitions
▪ Runs as an operator
▪ Logic depends on the operator functionality
▸Exa...
Custom partitioning
▪ Custom stream splitting
▪ Distribution of state during initial or dynamic partitioning
 Kafka opera...
Time Series DAG
Check pointing is tied to
the application id. This
problem becomes pertinent
if you are relying on that
state to do furthe...
Kafka Source was moving
an offset as committed as
soon as it read. Becomes
a problem when the
message is not completely
pr...
Gracefully stopping DAG
during upgrade, to get
exactly once semantics,
when downstream systems
cannot handle duplicates or...
Event time based
processing and out of order
data arrival
Solution
We have built some
Spooling Data structures
working wit...
Key Takeaways
▪ Upgradeability and tolerance for failure
▪ Monitoring DAG for failures
▪ Static partitioning helps only so...
Fault Tolerance
Fault tolerance
▪ Operator state is checkpointed to a persistent store
▸ Automatically performed by engine, no additional ...
Message Processing Semantics
Atleast once [1..n]
▪ On recovery operator state is restored to a checkpoint
▪ Data is replay...
Message Processing Semantics
Atmost once [0,1]
▪ On recovery the latest data is made available to operator
▪ Useful in use...
Stream Locality
▪ By default operators are deployed in containers (processes)
randomly on different nodes across the hadoo...
What happens during launch?
▪ User launches an application using the management
console or command line client
▪ DAG gets ...
Kafka Operator
▪ Supports both High and Low Level API Implementation
▪ Finer level control of offset for Exactly-Once Sema...
Debugging Issues
▪ Distributed systems are hard to debug
▪ LocalMode comes handy for developer testing and debugging
▪ Ena...
Demo
Thanks
!!
Any questions?
You can find us at @venkyz and @pramod
Bulk Upload - DAG
Rule Based Alerting - DAG
GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)
GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)
GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)
Upcoming SlideShare
Loading in …5
×

GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)

8,067 views

Published on

This presentation will introduce usage of Apache Apex for Time Series & Data Ingestion Service by General Electric Internet of things Predix platform. Apache Apex is a native Hadoop data in motion platform that is being used by customers for both streaming as well as batch processing. Common use cases include ingestion into Hadoop, streaming analytics, ETL, database off-loads, alerts and monitoring, machine model scoring, etc.

Abstract: Predix is an General Electric platform for Internet of Things. It helps users develop applications that connect industrial machines with people through data and analytics for better business outcomes. Predix offers a catalog of services that provide core capabilities required by industrial internet applications. We will deep dive into Predix Time Series and Data Ingestion services leveraging fast, scalable, highly performant, and fault tolerant capabilities of Apache Apex.

Speakers:

- Venkatesh Sivasubramanian, Sr Staff Software Engineer, GE Predix & Committer of Apache Apex

- Pramod Immaneni, PPMC member of Apache Apex, and DataTorrent Architect

Published in: Technology
  • Be the first to comment

GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)

  1. 1. Predix Time Series with Apache Apex
  2. 2. Hello! Venkat Predix Data Services, GE Digital Big Data & Analytics @WalmartLabs. Pramod Senior Architect, DataTorrent Inc, Apex PPMC Member
  3. 3. Quick Survey
  4. 4. ▪ Predix Platform Overview ▪ Predix Time Series ▪ Apache Apex ▪ Stream Processing with Apex – Journey and Learning ▪ Demo ▪ Q & A Outline
  5. 5. ▪ Platform for Industrial Internet ▪ Based on Cloud Foundry ▪ Provides rich set of services for rapid development ▪ Managed and Secured infrastructure ▪ Marketplace for Services Predix Platform
  6. 6. Want big impact? Use big image.
  7. 7. Predix Platform Architecture
  8. 8. Who we are?Team Data Services Love Java and Go Distributed Systems Big & Fast Data We are Hiring!
  9. 9. Predix Time Series Overview ▪ Streaming Ingestion ▪ Efficient storage ▪ Indexing the data for quick retrieval. ▪ Guaranteed data processing ▪ Highly available and scalable. ▪ Millisecond data point precision ▪ Support for String and Numbers ▪ Secured Access
  10. 10. Predix Time Series Architecture
  11. 11. ▪ Support Interpolation ▪ Aggregations (percent, avg, sum, count) ▪ Filter by Attributes, Quality and Value ▪ Support for Limit and Order By ▪ Both GET and POST to retrieve data points ▪ Sub-second query performance Predix Time Series API Sample { "tags": [ { "name": ["WIND_SPEED"], "filters": { "attributes": { "farm":["CA"] } }, "limit": 1000, "groups": { } } ] }
  12. 12. ▪ Signup @ Predix.io ▪ Create Time Series Instance ▪ Bind to an application ▪ Get credentials and connect your device ▪ Query the data Predix Time Series Get Started?
  13. 13. Apache Apex Overview ▪Streaming Analytics Platform ▪Event based, low latency ▪Scalable and Highly available ▪Managed State ▪Library of pre-built operators
  14. 14. Apex Platform
  15. 15. Stream Processing Events Reader Filter Operator Filter Operator “Top K” Operator “Top K” Operator Datastore Writer Partition Stream Unify Stream DAG Local/ Remote Find Top K engines with High/Low Oil pressure
  16. 16. Windowing Support  Application window  Sliding/Tumb ling Window  Checkpoint window  No artificial latency
  17. 17. Application Specification
  18. 18. Why Apache Apex Development High Performance and Distributed Dynamic Partitions Rich set of operator library Support for atleast-once, atmost-once and exactly- once processing semantics Operations Hadoop/Yarn Compatibility Fault tolerance and Platform Stability Ease of deployment and operability Enterprise grade security
  19. 19. Time Series DAG Skimmed Version of the DAG
  20. 20. Partitioning Strategies
  21. 21. Input Operator Detection Operator Output Operator Logical DAG Detection Operator Input Operator Detection Operator Unifier Operator Detection Operator` Output Operator Physical DAG
  22. 22. ▪ Utilize hashcode and mask to determine Partition ▪ Mask picks the last n bits of the hashcode of the tuple ▪ StreamCodec can be used to specify custom hashcode ▪ Custom partitioner can be used to change default map Stream Split tuple:{ Sensor, 98871231, 34, GOOD } Hashcode: 0010101000101 01 Mask (0x11) Partition 00 1 01 2 10 3 11 4
  23. 23. MxN Partitioning Input Operator Detection Operator Detection Operator Output Operator Output Operator Input Operator Input Operator Detection Operator Output Operator Detection Operator  Default Mechanism  StatelessPartitioner <property> <name>dt.application.<streamingApp>.operator.<name>.attr.PARTITIONER</name> <value>com.datatorrent.common.partitioner.StatelessPartitioner:4</value> </property>
  24. 24. Parallel Partitioning Input Operator Detection Operator Detection Operator Output Operator Output Operator Input Operator Input Operator Detection Operator Output Operator <property> <name>dt.application.<streamApp>.operator.<name>.port.input.attr.PARTITION_PARALLEL</name> <value>true</value> </property>
  25. 25. Unifier ▪ Combines outputs of multiple partitions ▪ Runs as an operator ▪ Logic depends on the operator functionality ▸Example if operator is computing average, unifier is computing final average from individual average and counts ▪ Default unifier if none specified ▪ Helps with skew ▪ Cascading unification possible if unification needs to be done in multiple stages
  26. 26. Custom partitioning ▪ Custom stream splitting ▪ Distribution of state during initial or dynamic partitioning  Kafka operators scale according to number of kafka partitions  Re-distribution of state during dynamic partitioning tuple:{ Sensor, 98871231, 34, GOOD } Hashcode: 0010101000101 01 Mask (0x00) Partition 00 1 00 2 00 3 00 4
  27. 27. Time Series DAG
  28. 28. Check pointing is tied to the application id. This problem becomes pertinent if you are relying on that state to do further processing. Solution Store states that matter externally, eg. HDFS, Zookeeper, Redis. Problems Encountered
  29. 29. Kafka Source was moving an offset as committed as soon as it read. Becomes a problem when the message is not completely processed by the DAG Solution Kafka Source was modified to wait till the messages are entirely processed in the DAG. Thanks to the community! We also implemented an offset manager and stored the offset in ZK Problems Encountered
  30. 30. Gracefully stopping DAG during upgrade, to get exactly once semantics, when downstream systems cannot handle duplicates or support transactions Solution Added a property to Mute the Source Operators and drain the messages before you bring the streaming pipeline down. APIs available for automation. Problems Encountered
  31. 31. Event time based processing and out of order data arrival Solution We have built some Spooling Data structures working with the apex team. Working to open source this. Problems Encountered
  32. 32. Key Takeaways ▪ Upgradeability and tolerance for failure ▪ Monitoring DAG for failures ▪ Static partitioning helps only so much ▪ Continuous Integration and Deployment ▪ Performance Testing and Benchmarking ▪ Ship and Store logs
  33. 33. Fault Tolerance
  34. 34. Fault tolerance ▪ Operator state is checkpointed to a persistent store ▸ Automatically performed by engine, no additional work needed by operator ▸ In case of failure operators are restarted from checkpoint state ▸ Frequency configurable per operator ▸ Asynchronous and distributed by default ▸ Default store is HDFS ▪ Automatic detection and recovery of failed operators ▸ Heartbeat mechanism ▪ Buffering mechanism to ensure replay of data from recovered point so that there is no loss of data ▪ Application master state checkpointed
  35. 35. Message Processing Semantics Atleast once [1..n] ▪ On recovery operator state is restored to a checkpoint ▪ Data is replayed from the checkpoint so it is effectively a rewind ▪ Messages will not be lost ▪ Default mechanism and is suitable for most applications ▪ End-to-end exactly once i.e., data is written only once to store in case of fault recovery ▸ Idempotent operations ▸ Rewinding output ▸ Writing meta information to store in transactional fashion ▸ Feedback from external store on last processed message
  36. 36. Message Processing Semantics Atmost once [0,1] ▪ On recovery the latest data is made available to operator ▪ Useful in use cases where some data loss is acceptable and latest data is sufficient Windowed exactly once [0,1] ▪ Operators checkpointed every window ▪ Can be combined with transactional mechanisms to ensure end-to- end exactly once behavior
  37. 37. Stream Locality ▪ By default operators are deployed in containers (processes) randomly on different nodes across the hadoop cluster ▪ Custom locality for streams ▸ Rack local: Data does not traverse network switches ▸ Node local: Data is passed via loopback interface and frees up network bandwidth ▸ Container local: Messages are passed via in memory queues between operators and does not require serialization ▸ Thread local: Messages are passed between operators in a same thread equivalent to calling a subsequent function on the message
  38. 38. What happens during launch? ▪ User launches an application using the management console or command line client ▪ DAG gets assembled on the client ▪ DAG and dependency jars gets saves to HDFS ▪ App Master (StrAM) gets launched on a Hadoop node ▸Converts logical plan to physical plan ▸Figures out execution plan ▸Requests containers from Hadoop ▸Launches StreamingContainer in individual containers with relevant operators
  39. 39. Kafka Operator ▪ Supports both High and Low Level API Implementation ▪ Finer level control of offset for Exactly-Once Semantics ▪ Supports ONE_TO_ONE and ONE_TO_MANY Partition Strategy ▪ Consume by size and number of messages ▪ Fault tolerent to recover from offsets
  40. 40. Debugging Issues ▪ Distributed systems are hard to debug ▪ LocalMode comes handy for developer testing and debugging ▪ Enable Yarn log aggregation ▸ yarn logs –applicationID <App_Id> ▪ DataTorrent webconsole provides streaming access to AppMaster and Container logs ▪ Understanding what happens where ▸ AppMaster ▸ NodeManager ▸ Containers
  41. 41. Demo
  42. 42. Thanks !! Any questions? You can find us at @venkyz and @pramod
  43. 43. Bulk Upload - DAG Rule Based Alerting - DAG

×