Upcoming SlideShare
×

# Introduction to Real-time data processing

2,148 views

Published on

These slides were designed for Apache Hadoop + Apache Apex workshop (University program).

Audience was mainly from third year engineering students from Computer, IT, Electronics and telecom disciplines.

I tried to keep it simple for beginners to understand. Some of the examples are using context from India. But, in general this would be good starting point for the beginners.

Advanced users/experts may not find this relevant.

Published in: Technology
2 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

Views
Total views
2,148
On SlideShare
0
From Embeds
0
Number of Embeds
357
Actions
Shares
0
30
0
Likes
2
Embeds 0
No embeds

No notes for slide

### Introduction to Real-time data processing

1. 1. Introduction to Real-time data processing Yogi Devendra (yogidevendra@apache.org)
2. 2. Agenda ● What is big data? ● Data at rest Vs Data in motion ● Batch processing Vs Real - time data processing (streaming) ● Examples ● When to use: Batch? Real-time? ● Current trends 2
3. 3. Image ref [4] 3 Big data
4. 4. Definition : big data Big data is high-volume, high-velocity and/or high-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation. [1] 4
5. 5. Exploding sizes of datasets 5 ● Google ○ >100PB data everyday [3] ● Large Hydron collidor : ○ 150M sensors x 40M sample per sec x 600 M collisions per sec ○ >500 exabytes per day [2] ○ 0.0001% of data is actually analysed
6. 6. 6 Questions Image ref [16]
7. 7. Data at rest Vs Data in motion ● At rest : ○ Dataset is fixed ○ a.k.a bounded [15] ● In motion : ○ continuously incoming data ○ a.k.a unbounded 7
8. 8. Data at rest Vs Data in motion (continued) ● Generally Big data has velocity ○ continuous data ● Difference lies in when are you analyzing your data? [5] ○ after the event occurs ⇒ at rest ○ as the event occurs ⇒ in motion 8
9. 9. Examples ● Data at rest ○ Finding stats about group in a closed room ○ Analyzing sales data for last month to make strategic decisions ● Data in motion ○ Finding stats about group in a marathon ○ e-commerce order processing 9
10. 10. 10 Questions Image ref [16]
11. 11. Batch processing ● Problem statement : ○ Process this entire data ○ give answer for X at the end. 11
12. 12. Batch processing : Use-cases 12 ● Sales summary for the previous month[5] ● Model training for Spam emails
13. 13. Batch processing : Characteristics 13 ● Access to entire data ● Split decided at the launch time. ● Capable of doing complex analysis (e.g. Model training) [6] ● Optimize for Throughput (data processed per sec) ● Example frameworks : Map Reduce, Apache Spark [6]
14. 14. 14 Questions Image ref [16]
15. 15. Real time data processing ● a.k.a. Stream processing ● Problem statement : ○ Process incoming stream of data ○ to give answer for X at this moment. 15
16. 16. Stream processing : Use-cases ● e-commerce order processing ● Credit card fraud detection ● Label given email as : spam vs non- spam 16
17. 17. Image ref [7] 17
18. 18. Stream processing : Characteristics ● Results for X are based on the current data ● Computes function on one record or smaller window. [6] ● Optimizations for latency (avg. time taken for a record) 18
19. 19. Stream processing : Characteristics ● Need to complete computes in near real- time ● Computes something relatively simple e.g. Using pre-defined model to label a record. ● Example frameworks: Apache Apex, Apache storm 19
20. 20. 20 Questions Image ref [16]
21. 21. 21
22. 22. Batch Vs Streaming pani puri ⇒ Streaming image ref [9] wada ⇒ batch image ref [8] 22
23. 23. 23 Questions Image ref [16]
24. 24. Micro-batch ● Create batch of small size ● Process each micro-batch separately ● Example frameworks: Spark streaming pani puri ⇒ micro-batch image ref [10] 24
25. 25. ● Depends on use-case ○ Some are suitable for batch ○ Some are suitable for streaming ○ Some can be solved by any one ○ Some might need combination of two. 25 When to use : Batch Vs Streaming?
26. 26. When to use : Batch Vs Real time?(continued) ● Answers for current snapshot ⇒ Real-time ○ Answers at the end ⇒ Open ● Complex calculations, multiple iterations over entire data ⇒ Batch ○ Simple computations ⇒ Open ● Low latency requirements (< 1s) ⇒ Real- time 26
27. 27. When to use : Batch Vs Real time?(continued) ● Each record can be processed independently ⇒ Open ○ Independent processing not possible ⇒ Batch ● Depends on use-case ○ Some use-cases can be solved by any one ○ Some other might need combination of two. 27
28. 28. 28 Questions Image ref [16]
29. 29. Can one replace the other? ● Batch processing is designed for ‘data at rest’. ‘data in motion’ becomes stale; if processed in batch mode. ● Real-time processing is designed for ‘data in motion’. But, can be used for ‘data at rest’ as well (in many cases). 29
30. 30. 30 Questions Image ref [16]
31. 31. Quiz : is this Batch or Real-time? ● Queue for roller coaster ride image ref [11] ● Queue at the petrol pump image ref [12] 31
32. 32. Quiz : is this Batch or Real-time? ● Selecting relevant ad to show for requested page ● Courier dispatch from city A to B image ref [13] image ref [14] 32
33. 33. 33 Questions Image ref [16]
34. 34. Current trends ● Difficulty in splitting problems as Map Reduce : Alternative paradigms for expressing user intent . ● More and more use-cases demanding faster insight to data (near real-time) ● ‘Data in motion’ is common. ● ‘Real-time data processing’ getting traction. 34
35. 35. 35 Questions Image ref [16]
36. 36. 36