Streaming data lakes - Do's and Don'ts from the field. Teradata Analytics Universe 2018-10-14

•Download as PPTX, PDF•

0 likes•128 views

Douglas Moore

The Do's and Don'ts of building a streaming data lake.

Data & Analytics

© 2015 Teradata
Douglas Moore, Principal Data Architect, Teradata
Streaming Data Lakes
Do’s and Don’ts from the Field

3
Introduction – About Me
• Principal (big) Data Architect
• Think Big Analytics – 7 years
• Data Lakes, Streaming Analytics, ETL, Strategy
• Before Big Data
– Analytic Data Warehousing
– OLTP
– Electricity
– High End Graphics
– Supercomputers
– Numerical Analysis
@douglas_ma

6
https://bits.blogs.nytimes.com/2011/09/07/the-lifespan-of-a-link/
Value of data is
perishable
The Lifespan of a link
5.5hr
s

7
Relationships are perishable
Harming your customers

8
Failing Digital Strategy
Relationships are perishable

9
Annoying your customers
Relationships are perishable

10
Really???
Relationships are perishable

11
Director @ a major US airline:
“It’s not about analyzing 7 years of history to
make the future better,
it’s about looking at what happened this morning
and to make this afternoon better”

12
What is a Streaming Data Lake?
1. Data In
Motion
2. Layers of
Curation
Canonical
Model
Source
Facing
Consumer
Facing

13
Do’s & Don’ts
Do
Ingest,
Standardize,
Validate,
Enrich,
Integrate,
Conform &
Project
In a stream
Canonical
Model
Source
Facing
Consumer
Facing

14
Do’s & Don’ts
Don’t slow the data down
Example:
Don’t turn CDC
into batches
Batch Batch Batch
Batch
Streaming
& CDC
Raw Processed Conformed &
Integrated

15
Do’s & Don’ts
Do keep your data moving
Curate in a stream,
Sync as needed
sync sync sync
Streaming
& CDC
Raw Processed Conformed &
Integrated
NoSQL

16
Do’s & Don’ts
Do know your data, know your requirements and how they relate to
time
ab
a b a b
cd
c d
c
d
Event
IT System
System
Latency
Response
Watermark
Real World Projection Consumed
Operational
b
+
c
+
d

17
Do’s & Don’ts
Do think of batches as degenerate* streams
ab a
b
cd
c
d
*degenerate as in mathematics
Event Operational
IT SystemReal World

18
Do’s & Don’ts
Do checkpoint your streams
Important:
Audit Balance Controls
Recoverability

21
Summary
1. Tremendous value in ‘now’
2. Keep your data moving
3. Know how your data relates to time

Thank You!
Rate This Session #
with the Teradata Analytics Universe Mobile App
1254
@douglas_ma
Follow Me
Twitter
Questions/Comments
Email: Douglas.Moore@Teradata.com

Similar to Streaming data lakes - Do's and Don'ts from the field. Teradata Analytics Universe 2018-10-14

Why Data Science Projects Fail?Ethan Ram

A Realistic Approach to Transforming IT Operations: Analytics + Automation + ...Enterprise Management Associates

Data Structures - The Cornerstone of Your Data’s HomeDATAVERSITY

Eecs6893 big dataanalytics-lecture1Aravindharamanan S

Why Bad Data May Be Your Best OpportunityZach Gardner

Data-Ed Webinar: Data Architecture RequirementsDATAVERSITY

Data-Ed: Data Architecture Requirements Data Blueprint

التنقيب في البيانات - Data Miningnabil_alsharafi

DataEd Webinar: Reference & Master Data Management - Unlocking Business ValueDATAVERSITY

Big Data Analytics Materials, Chapter: 1RUHULAMINHAZARIKA

Future of Data Strategy (ASEAN)Denodo

Profit from AI & Machine Learning: The Best Practices for People & ProcessTony Baer

Data Architecture StrategiesDATAVERSITY

Agile Leadership: Guiding DataOps Teams Through Rapid Change and UncertaintyTamrMarketing

SMi Group's 16th annual E&P Information & Data Management conference & exhibi...Dale Butler

Building the Enterprise Data Lake: A look at architecturemark madsen

Big_Data.pptxmohamedibrahim946387

FlockData OverviewFlockData

Big data analytics presented at meetup big data for decision makersRuhollah Farchtchi

A Survey on Big Data AnalyticsBHARATH KUMAR

Similar to Streaming data lakes - Do's and Don'ts from the field. Teradata Analytics Universe 2018-10-14 (20)

Why Data Science Projects Fail?

A Realistic Approach to Transforming IT Operations: Analytics + Automation + ...

Data Structures - The Cornerstone of Your Data’s Home

Eecs6893 big dataanalytics-lecture1

Why Bad Data May Be Your Best Opportunity

Data-Ed Webinar: Data Architecture Requirements

Data-Ed: Data Architecture Requirements

التنقيب في البيانات - Data Mining

DataEd Webinar: Reference & Master Data Management - Unlocking Business Value

Big Data Analytics Materials, Chapter: 1

Future of Data Strategy (ASEAN)

Profit from AI & Machine Learning: The Best Practices for People & Process

Data Architecture Strategies

Agile Leadership: Guiding DataOps Teams Through Rapid Change and Uncertainty

SMi Group's 16th annual E&P Information & Data Management conference & exhibi...

Building the Enterprise Data Lake: A look at architecture

Big_Data.pptx

FlockData Overview

Big data analytics presented at meetup big data for decision makers

A Survey on Big Data Analytics

Recently uploaded

7. Epi of Chronic respiratory diseases.pptibrahimabdi22

怎样办理圣地亚哥州立大学毕业证（SDSU毕业证书）成绩单学校原版复制vexqp

Reconciling Conflicting Data Curation Actions: Transparency Through Argument...Bertram Ludäscher

Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...HyderabadDolls

Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...nirzagarg

Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...gajnagarg

Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums

DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann

Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop

Dubai Call Girls Peeing O525547819 Call Girls Dubaikojalkojal131

Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...HyderabadDolls

Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...gragchanchal546

+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...Health

Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangeThinkInnovation

In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabiaahmedjiabur940

Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg

Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...nirzagarg

Kings of Saudi Arabia, information about themeitharjee

Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg

SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...Elaine Werffeli

Recently uploaded (20)

7. Epi of Chronic respiratory diseases.ppt

怎样办理圣地亚哥州立大学毕业证（SDSU毕业证书）成绩单学校原版复制

Reconciling Conflicting Data Curation Actions: Transparency Through Argument...

Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...

Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...

Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...

Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...

DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK

Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...

Dubai Call Girls Peeing O525547819 Call Girls Dubai

Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...

Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...

+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...

Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange

In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia

Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...

Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...

Kings of Saudi Arabia, information about them

Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...

SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...

Streaming data lakes - Do's and Don'ts from the field. Teradata Analytics Universe 2018-10-14

2. 3 Introduction – About Me • Principal (big) Data Architect • Think Big Analytics – 7 years • Data Lakes, Streaming Analytics, ETL, Strategy • Before Big Data – Analytic Data Warehousing – OLTP – Electricity – High End Graphics – Supercomputers – Numerical Analysis @douglas_ma

3. 4 𝑣𝑎𝑙𝑢𝑒 = 𝑑𝑎𝑡𝑎

4. 5 𝑣𝑎𝑙𝑢𝑒 = 𝑑𝑎𝑡𝑎 𝑡𝑖𝑚𝑒

5. 6 https://bits.blogs.nytimes.com/2011/09/07/the-lifespan-of-a-link/ Value of data is perishable The Lifespan of a link 5.5hr s

6. 7 Relationships are perishable Harming your customers

7. 8 Failing Digital Strategy Relationships are perishable

8. 9 Annoying your customers Relationships are perishable

9. 10 Really??? Relationships are perishable

10. 11 Director @ a major US airline: “It’s not about analyzing 7 years of history to make the future better, it’s about looking at what happened this morning and to make this afternoon better”

11. 12 What is a Streaming Data Lake? 1. Data In Motion 2. Layers of Curation Canonical Model Source Facing Consumer Facing

12. 13 Do’s & Don’ts Do Ingest, Standardize, Validate, Enrich, Integrate, Conform & Project In a stream Canonical Model Source Facing Consumer Facing

13. 14 Do’s & Don’ts Don’t slow the data down Example: Don’t turn CDC into batches Batch Batch Batch Batch Streaming & CDC Raw Processed Conformed & Integrated

14. 15 Do’s & Don’ts Do keep your data moving Curate in a stream, Sync as needed sync sync sync Streaming & CDC Raw Processed Conformed & Integrated NoSQL

15. 16 Do’s & Don’ts Do know your data, know your requirements and how they relate to time ab a b a b cd c d c d Event IT System System Latency Response Watermark Real World Projection Consumed Operational b + c + d

16. 17 Do’s & Don’ts Do think of batches as degenerate* streams ab a b cd c d *degenerate as in mathematics Event Operational IT SystemReal World

17. 18 Do’s & Don’ts Do checkpoint your streams Important: Audit Balance Controls Recoverability

18. 19 Do’s & Don’ts Don’t spread related events across topics ab a b Profile Topic Sales Topic a) Profile update event b) Sales Transaction

19. 20 Do’s & Don’ts Do put related topics together ab a b Profile Topic Sales Topic ab a Customer Topic b a) Profile update event b) Sales Transaction

20. 21 Summary 1. Tremendous value in ‘now’ 2. Keep your data moving 3. Know how your data relates to time

21. Thank You! Rate This Session # with the Teradata Analytics Universe Mobile App 1254 @douglas_ma Follow Me Twitter Questions/Comments Email: Douglas.Moore@Teradata.com

Editor's Notes

I’ve been a big data architect for the last 7 years I’ve deployed a lot of data lakes, streaming systems, ETL, and strategy to customers here and Europe.
For the last 40 years, it’s been about integrated data, Then 10y rs ago it was about more and bigger data, the more data, the value you can extract. Machine Learning came along and simple algorithms given more data performed better than complicated rulesets and expert opinions distilled. Then Deep Learning came along and those algorithms are even more data hungry, very hungry. The curse of dimensionality
These days, it’s not just more data it’s more data in less time drives value You still need curated data, when sensor data comes in to you, you’ll find lots of noise drop outs etc You still need to integrate your data, link it. Your sensor, claim, reservation data, becomes so much more valuable when linked to your customers, devices, properties,… Now you have to do all this not in 30 days, not a week or day but within seconds.
The value of data is perishable The Half life of a tweet is just 2.8 hrs as found by Hilary Mason, then Bit.ly’s lead data scientist. Hilary Mason, Bit.ly’s lead scientist, found that links have different lifespans if they are posted on Facebook and Twitter or sent through e-mail or chat clients. After analyzing 1,000 popular links shared on bit.ly, Ms. Mason discovered that the average half life of a link on Twitter is 2.8 hours. On Facebook it’s 3.2 hours, and for e-mail and messenger services it’s 3.4 hours. This means a link gets an extra 24 minutes of life on Facebook compared to Twitter. Relate this to an engagement story AWS Storm based streaming analytics… binning event counts, fitting to a curve, R based models
A hip established customer centered company is potentially harming Joe’s credit record because they can’t integrate their systems in a reasonable amount of time
This is an example of a utility company with a failing digital strategy, they can’t within a reasonable amount of time integrate their mobile/internet with the rest of their legacy systems,
In this case, a high tech digital company is just annoying Sheila
Ed here is perplexed as to why there is just some random delay to updating his account,
What she’s saying here is big data is nice, but the real value comes in producing insights, re-routing places, & resources in a timely manner that has meaning impact on operations.
Someone suggested to me, perhaps we should call this a “Data River”
Discuss Enriching vs. standardizing (appending quality factors, corrections, keeping original values) Discuss Validation vs. Routing You will need to join with ‘slow streams’. Keep them close in dataframes, caches.
The first don’t, don’t slow the data down Anti-practice: “This one client… would source data, via CDC, … then land it in HDFS then that was it. No standardization, common keys, common summarizations… they would talk about real time,… yet they terminated the data flows at HDFS. They’re incurring a large cost by first doing it as a batch then later as a stream. Best Practice: Build levels of curation, within streams, sync to a durable storage as needed for other access patterns For stateful streams, for processing with a large watermark on the data projection you’ll need a low latency no-sql storage, sized according to your working set (volume rate * watermark)
Anti-practice: “This one client… would source data, via CDC, … then land it in HDFS then that was it. No standardization, common keys, common summarizations… they would talk about real time,… yet they terminated the data flows at HDFS. They’re incurring a large cost by first doing it as a batch then later as a stream. Best Practice: Build levels of curation, within streams, sync to a durable storage as needed for other access patterns For stateful streams, for processing with a large watermark on the data projection you’ll need a low latency no-sql storage, this needs to be sized according to your working set (volume rate * watermark)
Let’s say you have a real time analytics system and you want to see world wide reservations, or claims, or orders, or equipment status summarized, and summarized to a rolling five minute window: Response Time – The time between initiating a request and when the start of the response is first received. System Latency - The time between the event time and when event is available for analysis Operational Time - When did that event arrive into your data management system Watermark – The maximum lateness of a late arriving event before it’s considered too late. Now you can extend your water mark, but you’ll need more memory to maintain state. Event Time - What time did the business recognize the event? E.g. When order was signed, when the payment was processed, when the item was shipped There’s even more aspects of time - Processing windows, tumbling windows, sliding windows, recovery point objectives, return to operations
Think of batches as de-generate streams, events are lumped together into thin slices of operational time. If you need another justification for doing streams, just remember It takes more resources, with lower system utilization to process batches.
Do checkpoint and perform audit balance controls on your streams Anti-practice: “This major travel site, handling 100 billion XML events / day… They pay commissions based on their weblogs so accuracy is important. They have a beautifully designed streaming data lake… to checkpoint they quiesce the producers once a day at midnight, synchronize, then restart the producers Now this works for them, they can recover to the previous day’s values. Instead, look at every hour, every 5 minutes dropping a marker into each stream partition, this gives them an opportunity to reduce their recovery point objective Best Practice: “Drop a Coke can”, Metrics Metrics Metrics Every 5 minutes generate a count of your events, emit that on your metrics stream
Let’s say a customer comes in and updates their credit card and then they go to order a widget from your website. Let’s say your transactional system writes a & b in the correct order. Your CDC captures these two events Your streaming system takes the two events and spreads them out over two subject oriented topics In this example, you have a chance that the sales transaction event arriving before the profile update reaches your system. Pain ensues Topics & partitions guarantee order of delivery, so don’t put your related events into separate topics. You’ve just exacerbated the one problem you were trying to avoid with late arriving data. What if the customer profile changed and then they perform a transaction? … Same kind of the same thing, the two are related you want to make sure they arrive in order as much as is possible. Do send fully annotated / enriched events unless you have a rediciulously large blob, like a move or something.
Instead, put related topics together Topics & partitions guarantee order of delivery, do put your related records into the same topic & partition to help ensure the correct order of delivery and analysis.
There’s much more to know but alas our time is short. There’s a tremendous value in now With Now you can better satisfy your customers and capture value your competitors are missing. Keep your data moving, it will require learning a couple new things but overall it will be more efficient and will better serve your business Know how your data relates to time, make sure event, operational, latency and response times are clearly tracked and understood by all involved.

Streaming data lakes - Do's and Don'ts from the field. Teradata Analytics Universe 2018-10-14

Recommended

Recommended

More Related Content

Similar to Streaming data lakes - Do's and Don'ts from the field. Teradata Analytics Universe 2018-10-14

Similar to Streaming data lakes - Do's and Don'ts from the field. Teradata Analytics Universe 2018-10-14 (20)

Recently uploaded

Recently uploaded (20)

Streaming data lakes - Do's and Don'ts from the field. Teradata Analytics Universe 2018-10-14

Editor's Notes