Get big picture of data platform architecture by knowing its purpose and problem it solves.
These slides take top down approach, starting with basic purpose of data platform ie. to serve analytics use cases. These slides categorise use cases and analyses their expectation from data platform.
2. Purpose of any data platform (big / not big)
is to enable analytics on data
dataeaze
Why?
3. Different analytics use cases expect different set of
features from data platform
Components part of big data ecosystem
are made
to serve needed features of analytics use cases
dataeaze
Why?
4. So to understand data platform
to understand data platform components
It is necessary to know purpose
It is necessary to know needs of analytics use cases
which are served by data platform
dataeaze
Why?
5. Here
We take look at all categories of analytics use
cases on data platform
dataeaze
What?
7. We analyse each use case as
Nature of data
processing in order to
serve this use case
Expectations from data
platform to enable
required data processing
dataeaze
What?
8. Static Reports
are summary reports prepared for the purpose of
giving status to decision makers
Example
Report for top management at end of day specifying
daily sales, transactions, revenue, total traffic
dataeaze
9. Nature of data processing
Static reports are
Scheduled to execute at fixed time interval,
Generate analysis reports for given time period,
Can execute on raw data directly or on intermediate store
dataeaze
Static Reports
10. Expectations from data platform
Scheduled data processing
Static reports are executed at predefined schedule repeatedly
Timely arrival of data
Generated reports should represent complete picture of given
timeframe, and should be generated before deadline.
Process raw data to get result
Capability to generate report from raw data if it cannot be
extracted from intermediate data form
dataeaze
Static Reports
11. Dashboard Reports
Dashboard is reporting user interface where user can interactively
choose his own view of data with limited set of filters.
Example
An e-commerce company having dashboard for sellers where
sellers get to know how much inventory sold across demographic,
across product categories, across time range.
dataeaze
12. Nature of data processing
Periodically keep on processing raw data to
bring it in form required by dashboards
Populate transformed data into interactive
store backend of dashboards
dataeaze
Dashboard
13. Expectations from data platform
ETL
To convert raw data in format required by dashboard
Scheduled data processing
Timely repeated executions of ETL jobs to populate
dashboards with latest updates
Interactive data store
Dashboard reports are interactive in nature, so backend store
is supposed to return results in near real time
dataeaze
Dashboard
14. Ad Hoc data analysis
This is for business queries which are raised as per need,
This is not scheduled and is executed one time whenever necessary
Example
A product manager wanting to know detail analysis about
customer behavior on a navigation panel, so as to define optimised
ad placements.
dataeaze
15. Nature of data processing
Steps to serve an ad hoc report,
Identify data sources which will satisfy given
request
Execute data processing (preferable sql like
query) on identified source
Load results in data representation tool
dataeaze
Ad Hoc
16. Expectations from data platform
data processing SQL engine
SQL query engine makes it easy to represent required analysis
in form of SQL query, saves analyst’s time
complex data processing
A platform which supports writing custom complex data
analysis, which is not possible through SQL
dataeaze
Ad Hoc
17. BI Reporting
Business Intelligence tools provide advanced general purpose
dashboards which host wide array of dimensions in backend data
store. User can define and save transformations, analysis queries
through BI tool and get back reports in tabular or graphical form.
Example
A BI report representing weekly sales stats across multiple regions
for previous 6 months. This report is once created and saved. Users
execute saved report whenever they want.
dataeaze
18. Nature of data processing
Scheduled ETL jobs to convert raw data to
required intermediate data form
Data is loaded to interactive SQL data stores
BI tools are connected to SQL data store as
backend
dataeaze
BI Reporting
19. Expectations from data platform
ETL
Raw data should be transformed to required format and get
loaded to SQL data warehouse
Scheduling of ETL
Defined ETL jobs should be scheduled to execute at fixed time
interval.
data processing SQL engine
SQL query engine makes it easy to extract data out, saves
time. BI tools can connect to this SQL data store.
dataeaze
BI Reporting
20. Data Processing for Applications
This is data processing done to provide feedback input to business
applications. Business applications take better decisions based on
latest data feedback.
Example
Ad servers getting periodically updated about latest minimum
ecpm to expect for an ad placement getting filled dynamically.
dataeaze
21. Nature of data processing
Complex data processing (machine learning) on raw
data
Scheduled data processing
Update result into interactive key-value store which get
fetched directly from applications
dataeaze
App data processing
22. Expectations from data platform
Capability to implement custom complex data processing
User should be able to easily define custom complex data processing
algorithms (like machine learning)
Scheduled data processing
Required for periodic execution of data processing jobs
dataeaze
App data processing
23. Real time stream data processing
It is analysing an event as soon as it happens. Sooner the analysis
better is value obtained from it.
Example
Stock ticker getting displayed on yahoo finance
dataeaze
24. Nature of data processing
As soon as event happens its log entry is
collected
All log entries are buffered, made available
for processing layer.
Pull records from message buffer and
perform processing on it.
dataeaze
Real time stream
25. Expectations from data platform
Scalable message buffer
A message buffer to keep received messages which are pulled
from this buffer for processing
Real time stream processing engine
To pull and process records in real time. Provide user ability to
define custom data processing.
dataeaze
Real time stream
26. Let us take a look at super set of expectations across
all use cases
dataeaze
28. Super set of expectations
Expectation / Capability Use caseNeeded by
Complex data analysis using query
language
Scheduled ETL data processing
Data store for interactive data
analysis
Data ingestion with timely arrival of
data
Scalable message buffer to be
consumed by stream data processing
Streaming data processing platform
Static reports
ad hoc data analysis
BI reporting
Dashboard reports
app specific data processing
Real time stream data processing
Summarise all
dataeaze
30. We have identified common set of features expected
from data platform
by most of analytics use cases
Let us map these to data platform components
Conclude
dataeaze
31. Capabilities provided by data platform components
Expectation / Capability Data platform
component
Supported by
Complex data analysis using query
language
Scheduled ETL data processing
Data store for interactive data
analysis
Data ingestion with timely arrival of
data
Scalable message buffer to be
consumed by stream data processing
Streaming data processing platform
Data Ingestion
Batch data processing
Workflow scheduler
Interactive data stores
Message buffers
Real time stream
engine
Data Platform
Tools
Flume, Kafka, Scribe
Hive, Mapred
Oozie
Hbase, Spark, ..
Kafka
Storm, Spark
Conclude
dataeaze
33. Going backwords
Now you know about
Data platform components
capabilities supported by those
satisfying features of analytics use cases
Conclude
dataeaze
34. Thank You
Get in touch with us at
contactus@dataeaze.io
dataeaze