Building a Data Lake - An App Dev's Perspective

Building a Data Lake
An App Dev’s Perspective
GeekNight Hyderabad - March 8th 2017
Geetha Balasundaram
geethab@thoughtworks.com
© 2017 ThoughtWorks Technologies Pvt. Limited

ABOUT ME
Developer @ ThoughtWorks
Building a data lake in the enterprise ecosystem
Helping a retail business make sense of it ( data guided org )
Been part of web development space ( enterprise rewrite )
Equally startled like everyone else by the data engineering space
Share know-how’s and do-how’s from our team’s experience
snithish@thoughtworks.com

AGENDA
What is data in the true sense…
Data Warehouse in an enterprise ecosystem...
What is a data lake...
Data lake implementation in an enterprise ecosystem…
How to make effective use of a data lake: technology+process+people
Cluster Administration tool - Cloudera Manager
Pitfalls to avoid

Question ???
How did R.Ashwin perform in the last
Test match?
HIGH LEVEL
PROBLEM STATEMENT

COMPLEX HISTORICAL DATA
Why?
Exploit and derive as much new insights as possible
Match Made
Enterprise systems produce this nature of complexity

DATA WAREHOUSE
https://martinfowler.com/articles/microservices.html
ETL

DID MICROSERVICES CAUSE THIS PROBLEM ?
Decentralised Data
https://martinfowler.com/articles/microservices.html

MICROSERVICES HELPED
Break down business unit
Break down complexity
Understand the nature of data

Question ???
R.Ashwin performed well ( 6/41 ) in yesterday’s match!
Complex historical data can quantify how well he has performed
Can we say why did he do well in this particular match?
What factors affected his enhanced performance?

FACT is a FACT
… even when we don’t know how it can be used

KEY DIFFERENCE
https://martinfowler.com/bliki/DataLake.html

What is a data lake?

LAKE is...
.. a large body of water in a more natural state.
The contents of the lake, stream in from a source to fill the lake,
and various users of the lake can come to examine, dive in, or
take samples
https://jamesdixon.wordpress.com/2010/10/14/pentaho-hadoop-and-data-lakes/

DATA LAKE is...
.. a large body of water data facts in a more natural state.
The contents of the lake, stream in from a source to fill the lake,
and various users of the lake can come to examine analyse, dive
in build models, or take samples use subset for specific use
cases

Implementation

OUR IMPLEMENTATION - TECH STACK
DATA SOURCE
DATA INGESTION
DATA LAKE
DATA MARTS
DATA ANALYSIS
Staging / Queue

How to make effective use of a data lake:
technology+process+people

Functionality Vs Reality
I need a feature so that I can do this action…..
to
I need this insight so that I can take this action….
eg : I need a functionality to order items anytime before or during a promotion…
to
..I need to know on time, if I have to order items anytime before or during a promotion…
so that I can improve promotion sales
People

Start Simple
There is no data lake yet…
Carve out portions of data which are easy wins yet critical to
arrive at the earlier stated insight..
Set up the infrastructure and pipeline
Get your hands dirty..
eg: Sales is an important factor to analyse / predict anything in retail space..
Technology

How much should I know about the data ?
As a consumer of data (read ‘not a consumer of service’)
How much should I know about it?
Schema ⇔ Contracts
Nature of the data versioned vs latest
transactional vs reference
facts vs aggregate
frequency of change
…..
Technology

DATA INSIGHT - Part 1
Incrementally add
new data to the
lake
Serve data
for analysis
eg: What data wrt promotions do I need to bring into the datalake ??
Sales → improve promotion sales
Technology

DATA INSIGHT - Part 2
Sales + Promotions → improve promotion sales
How does adding more data to the lake help arriving at new insights..?
history of past promotions sales = how much to order for this promotion
history of past promotion sales + ‘X’ = how much to order for this promotion
history of past promotion sales + ‘X’ + ‘Y’ …… = how much to order for this promotion
eg: seasonality has a strong correlation with sales
history of past promotion sales + ‘X’ + ‘Y’ …… + ‘A’ = how much to order for this promotion after the start
People

Think Agile
Sales + Promotions + X factor → improve promotion sales
Near perfect list of
parameters
Progressive set of
parameters
Sales + Promotions → is the quantity arrived from these factors (known to business) ordered on time?
Process

DataMarts
... as a store of bottled water – cleansed and packaged and
structured for easy consumption

DataMarts
... as a store of data subset - curated from meaningful facts
bundled into logical groups for arriving at useful insights

Easy Insight
Sales + Promotions →
is the quantity arrived from these factors (known to business) ordered on time?
System : Tells me what is the quantity that is supposed to be ordered
for this promotion..
System : Tells me in realtime what is the quantity that is ordered
Technology

Cluster Administration Tool
Cloudera Manager

QUICK RECAP
What is data in the true sense…
Data Warehouse in an enterprise ecosystem...
What is a data lake...
Data lake implementation in an enterprise ecosystem...
How to make effective use of a data lake…
Cluster Administration tool - Cloudera Manager

PITFALLS TO AVOID
Data envy - Ref:https://martinfowler.com/bliki/Datensparsamkeit.html
Tool envy
Reliable data is a luxury
Understanding the nature of data is a must
Dialogue with the data scientist
Treating the data lake like a RDBMS
Keeping the business involved
Data flow state visibility

Building a Data Lake - An App Dev's Perspective

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Building a Data Lake - An App Dev's Perspective

Similar to Building a Data Lake - An App Dev's Perspective (20)

More from GeekNightHyderabad

More from GeekNightHyderabad (20)

Recently uploaded

Recently uploaded (20)

Building a Data Lake - An App Dev's Perspective