2. 3
Introduction – About Me
• Principal (big) Data Architect
• Think Big Analytics – 7 years
• Data Lakes, Streaming Analytics, ETL, Strategy
• Before Big Data
– Analytic Data Warehousing
– OLTP
– Electricity
– High End Graphics
– Supercomputers
– Numerical Analysis
@douglas_ma
10. 11
Director @ a major US airline:
“It’s not about analyzing 7 years of history to
make the future better,
it’s about looking at what happened this morning
and to make this afternoon better”
11. 12
What is a Streaming Data Lake?
1. Data In
Motion
2. Layers of
Curation
Canonical
Model
Source
Facing
Consumer
Facing
13. 14
Do’s & Don’ts
Don’t slow the data down
Example:
Don’t turn CDC
into batches
Batch Batch Batch
Batch
Streaming
& CDC
Raw Processed Conformed &
Integrated
14. 15
Do’s & Don’ts
Do keep your data moving
Curate in a stream,
Sync as needed
sync sync sync
Streaming
& CDC
Raw Processed Conformed &
Integrated
NoSQL
15. 16
Do’s & Don’ts
Do know your data, know your requirements and how they relate to
time
ab
a b a b
cd
c d
c
d
Event
IT System
System
Latency
Response
Watermark
Real World Projection Consumed
Operational
b
+
c
+
d
16. 17
Do’s & Don’ts
Do think of batches as degenerate* streams
ab a
b
cd
c
d
*degenerate as in mathematics
Event Operational
IT SystemReal World
17. 18
Do’s & Don’ts
Do checkpoint your streams
Important:
Audit Balance Controls
Recoverability
18. 19
Do’s & Don’ts
Don’t spread related events across topics
ab
a
b
Profile Topic
Sales Topic
a) Profile update event
b) Sales Transaction
19. 20
Do’s & Don’ts
Do put related topics together
ab
a
b
Profile Topic
Sales Topic
ab a
Customer
Topic
b
a) Profile update event
b) Sales Transaction
21. Thank You!
Rate This Session #
with the Teradata Analytics Universe Mobile App
1254
@douglas_ma
Follow Me
Twitter
Questions/Comments
Email: Douglas.Moore@Teradata.com
Editor's Notes
I’ve been a big data architect for the last 7 years
I’ve deployed a lot of data lakes, streaming systems, ETL, and strategy to customers here and Europe.
For the last 40 years, it’s been about integrated data,
Then 10y rs ago it was about more and bigger data, the more data, the value you can extract.
Machine Learning came along and simple algorithms given more data performed better than complicated rulesets and expert opinions distilled.
Then Deep Learning came along and those algorithms are even more data hungry, very hungry. The curse of dimensionality
These days, it’s not just more data it’s more data in less time drives value
You still need curated data, when sensor data comes in to you, you’ll find lots of noise drop outs etc
You still need to integrate your data, link it. Your sensor, claim, reservation data, becomes so much more valuable when linked to your customers, devices, properties,…
Now you have to do all this not in 30 days, not a week or day but within seconds.
The value of data is perishable
The Half life of a tweet is just 2.8 hrs as found by Hilary Mason, then Bit.ly’s lead data scientist.
Hilary Mason, Bit.ly’s lead scientist, found that links have different lifespans if they are posted on Facebook and Twitter or sent through e-mail or chat clients. After analyzing 1,000 popular links shared on bit.ly, Ms. Mason discovered that the average half life of a link on Twitter is 2.8 hours. On Facebook it’s 3.2 hours, and for e-mail and messenger services it’s 3.4 hours. This means a link gets an extra 24 minutes of life on Facebook compared to Twitter.
Relate this to an engagement story AWS Storm based streaming analytics… binning event counts, fitting to a curve, R based models
A hip established customer centered company is potentially harming Joe’s credit record because they can’t integrate their systems in a reasonable amount of time
This is an example of a utility company with a failing digital strategy, they can’t within a reasonable amount of time integrate their mobile/internet with the rest of their legacy systems,
In this case, a high tech digital company is just annoying Sheila
Ed here is perplexed as to why there is just some random delay to updating his account,
What she’s saying here is big data is nice, but the real value comes in producing insights, re-routing places, & resources in a timely manner that has meaning impact on operations.
Someone suggested to me, perhaps we should call this a “Data River”
Discuss Enriching vs. standardizing
(appending quality factors, corrections, keeping original values)
Discuss Validation vs. Routing
You will need to join with ‘slow streams’. Keep them close in dataframes, caches.
The first don’t, don’t slow the data down
Anti-practice: “This one client… would source data, via CDC, … then land it in HDFS then that was it. No standardization, common keys, common summarizations…
they would talk about real time,… yet they terminated the data flows at HDFS.
They’re incurring a large cost by first doing it as a batch then later as a stream.
Best Practice: Build levels of curation, within streams, sync to a durable storage as needed for other access patterns
For stateful streams, for processing with a large watermark on the data projection you’ll need a low latency no-sql storage, sized according to your working set (volume rate * watermark)
Anti-practice: “This one client… would source data, via CDC, … then land it in HDFS then that was it. No standardization, common keys, common summarizations…
they would talk about real time,… yet they terminated the data flows at HDFS.
They’re incurring a large cost by first doing it as a batch then later as a stream.
Best Practice: Build levels of curation, within streams, sync to a durable storage as needed for other access patterns
For stateful streams, for processing with a large watermark on the data projection you’ll need a low latency no-sql storage, this needs to be sized according to your working set (volume rate * watermark)
Let’s say you have a real time analytics system and you want to see world wide reservations, or claims, or orders, or equipment status summarized, and summarized to a rolling five minute window:
Response Time – The time between initiating a request and when the start of the response is first received.
System Latency - The time between the event time and when event is available for analysis
Operational Time - When did that event arrive into your data management system
Watermark – The maximum lateness of a late arriving event before it’s considered too late. Now you can extend your water mark, but you’ll need more memory to maintain state.
Event Time - What time did the business recognize the event? E.g. When order was signed, when the payment was processed, when the item was shipped
There’s even more aspects of time
- Processing windows, tumbling windows, sliding windows, recovery point objectives, return to operations
Think of batches as de-generate streams, events are lumped together into thin slices of operational time.
If you need another justification for doing streams, just remember It takes more resources, with lower system utilization to process batches.
Do checkpoint and perform audit balance controls on your streams
Anti-practice: “This major travel site, handling 100 billion XML events / day…
They pay commissions based on their weblogs so accuracy is important.
They have a beautifully designed streaming data lake… to checkpoint they quiesce the producers once a day at midnight, synchronize, then restart the producers
Now this works for them, they can recover to the previous day’s values.
Instead, look at every hour, every 5 minutes dropping a marker into each stream partition, this gives them an opportunity to reduce their recovery point objective
Best Practice:
“Drop a Coke can”, Metrics Metrics Metrics
Every 5 minutes generate a count of your events, emit that on your metrics stream
Let’s say a customer comes in and updates their credit card and then they go to order a widget from your website.
Let’s say your transactional system writes a & b in the correct order.
Your CDC captures these two events
Your streaming system takes the two events and spreads them out over two subject oriented topics
In this example, you have a chance that the sales transaction event arriving before the profile update reaches your system.
Pain ensues
Topics & partitions guarantee order of delivery, so don’t put your related events into separate topics.
You’ve just exacerbated the one problem you were trying to avoid with late arriving data.
What if the customer profile changed and then they perform a transaction? …
Same kind of the same thing, the two are related you want to make sure they arrive in order as much as is possible.
Do send fully annotated / enriched events unless you have a rediciulously large blob, like a move or something.
Instead, put related topics together
Topics & partitions guarantee order of delivery, do put your related records into the same topic & partition to help ensure the correct order of delivery and analysis.
There’s much more to know but alas our time is short.
There’s a tremendous value in now
With Now you can better satisfy your customers and capture value your competitors are missing.
Keep your data moving, it will require learning a couple new things but overall it will be more efficient and will better serve your business
Know how your data relates to time, make sure event, operational, latency and response times are clearly tracked and understood by all involved.