Introduction to Real-Time Data Processing

Introduction to Real-time
data processing
Yogi Devendra
(yogidevendra@apache.org)

Agenda
●What is big data?
●Data at rest Vs Data in motion
●Batch processing Vs Real - time data
processing (streaming)
●Examples
●When to use: Batch? Real-time?
●Current trends
2

Exploding sizes of datasets
4
●Google
○>100PB data everyday [3]
●Large Hydron collidor :
○150M sensors x 40M sample per sec x 600 M
collisions per sec
○>500 exabytes per day [2]
○0.0001% of data is actually analysed

Data at rest Vs Data in motion
● At rest :
○ Dataset is fixed
○ a.k.a bounded [15]
● In motion :
○ continuously incoming data
○ a.k.a unbounded
5

Data at rest Vs Data in motion (continued)
●Generally Big data has velocity
○continuous data
●Difference lies in when are you analyzing
your data? [5]
○after the event occurs ⇒ at rest
○as the event occurs ⇒ in motion
6

Examples
●Data at rest
○Finding stats about group in a closed room
○Analyzing sales data for last month to make
strategic decisions
●Data in motion
○Finding stats about group in a marathon
○e-commerce order processing
7

Batch processing
●Problem statement :
○Process this entire data
○give answer for X at the end.
8

Batch processing : Use-cases
9
● Sales summary for the previous
month[5]
● Model training for Spam emails

Batch processing : Characteristics
10
●Access to entire data
●Split decided at the launch time.
●Capable of doing complex analysis (e.g.
Model training) [6]
●Optimize for Throughput (data processed
per sec)
●Example frameworks : Map Reduce,
Apache Spark [6]

Real time data processing
● a.k.a. Stream processing
● Problem statement :
○ Process incoming stream of data
○ to give answer for X at this
moment.
11

Stream processing : Use-cases
● e-commerce order processing
● Credit card fraud detection
● Label given email as : spam vs non-
spam
12

Stream processing : Characteristics
● Results for X are based on the
current data
● Computes function on one record or
smaller window. [6]
● Optimizations for latency (avg. time
taken for a record)
14

Stream processing : Characteristics
●Need to complete computes in near real-
time
●Computes something relatively simple e.g.
Using pre-defined model to label a record.
●Example frameworks: Apache Apex,
Apache storm
15

Batch Vs Streaming
pani puri ⇒ Streaming
image ref [9]
wada ⇒ batch
image ref [8]
17

Micro-batch
●Create batch of
small size
●Process each
micro-batch
separately
●Example
frameworks: Spark
streaming
pani puri ⇒ micro-batch
image ref [10]
18

● Depends on use-case
○Some are suitable for batch
○Some are suitable for streaming
○Some can be solved by any one
○Some might need combination of two.
19
When to use : Batch Vs Streaming?

When to use : Batch Vs Real time?(continued)
●Answers for current snapshot ⇒ Real-time
○Answers at the end ⇒ Open
●Complex calculations, multiple iterations
over entire data ⇒ Batch
○Simple computations ⇒ Open
●Low latency requirements (< 1s) ⇒ Real-
time
20

When to use : Batch Vs Real time?(continued)
●Each record can be processed
independently ⇒ Open
○Independent processing not possible ⇒
Batch
● Depends on use-case
○Some use-cases can be solved by any one
○Some other might need combination of two.
21

Can one replace the other?
●Batch processing is designed for ‘data at
rest’. ‘data in motion’ becomes stale; if
processed in batch mode.
●Real-time processing is designed for ‘data
in motion’. But, can be used for ‘data at
rest’ as well (in many cases).
22

Quiz : is this Batch or Real-time?
●Queue for roller coaster
ride image ref [11]
●Queue at the petrol
pump image ref [12]
23

Quiz : is this Batch or Real-time?
●Selecting relevant ad
to show for requested
page
●Courier dispatch from
city A to B
image ref [13]
image ref [14]
24

Current trends
●Difficulty in splitting problems as Map
Reduce : Alternative paradigms for
expressing user intent .
●More and more use-cases demanding
faster insight to data (near real-time)
●‘Data in motion’ is common.
●‘Real-time data processing’ getting
traction.
25

References
1. Big Data | Gartner IT Glossary http://www.gartner.com/it-glossary/big-data/
2. Big Data | Wikipedia https://en.wikipedia.org/wiki/Big_data
3. Data size estimates | Follow the data https://followthedata.wordpress.com/2014/06/24/data-size-estimates/
4. Data Never Sleeps 2.0 | Domo https://www.domo.com/blog/2014/04/data-never-sleeps-2-0/
5. Data in motion vs. data at rest | Internap http://www.internap.com/2013/06/20/data-in-motion-vs-data-at-rest/
6. Difference between batch processing and stream processing | Quora https://www.quora.com/What-are-the-differences-between-batch-
processing-and-stream-processing-systems/answer/Sean-Owen?srid=O9ht
7. How FAST is Credit Card Fraud Detection | FICO http://www.fico.com/en/latest-thinking/infographic/how-fast-is-credit-card-fraud-
detection
8. CULINARY TERMS | panjakhada http://panjakhada.com/the-basics/
9. Crispy Chaat ... | grabhouse http://grabhouse.com/urbancocktail/11-crispy-chaat-joints-food-lovers-hyderabad/
10. Paani puri stall | citiyshor http://www.cityshor.com/pune/food/street-food/camp/murali-paani-puri-stall/
11. Great Inventions: The Roller Coaster | findingdulcinea http://www.findingdulcinea.com/features/science/innovations/great-inventions/the-
roller-coaster.html
12. RIL petrol pump network | economictimes http://articles.economictimes.indiatimes.com/2015-05-24/news/62583419_1_petrol-and-diesel-
fuel-retailing-ril
13. Publishers | Propellerads https://propellerads.com/publishers/
14. Michael Bishop Couriers | Google plus https://plus.google.com/110684176517668223067
15. The world beyond batch: Streaming 101 http://radar.oreilly.com/2015/08/the-world-beyond-batch-streaming-101.html
16. How to Answer the Question http://www.clipartpanda.com/clipart_images/how-to-answer-the-question-46954146
17. Thank You http://www.planwallpaper.com/thank-you
28

Introduction to Real-Time Data Processing

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (15)

Similar to Introduction to Real-Time Data Processing

Similar to Introduction to Real-Time Data Processing (20)

More from Apache Apex

More from Apache Apex (6)

Recently uploaded

Recently uploaded (20)

Introduction to Real-Time Data Processing

Editor's Notes