2. GOAL: DATA DRIVEN
BUSINESS DECISIONS and
ACTIONS
BASE: SMART COLLECTION and STORING OF DATA
Buzzwords: Hadoop, Document Databases, Columnar, datalake
PATH: ACTIONABLE INTERACTIVE INFOGRAPHICS
Buzzwords: dashboards, Predictive/prescriptive analytics, self-service
BI, Machine Learning
3. NOT ELIMINATING – AUGMENTING!
We leave current DWH operational and intact
Not Revolution – EVOLUTION!
Of storage
Visualisation
Decision making
In the direction of
Business with truly
crossfunctional team
A shift from traditional reporting to BI and Data Science
Only raw data persists, computations and visualisations are ”as need arise”
New architecture and software:
modern analytical tools
Machine learning
Graph databases
OVERALL STRATEGY DECISIONS
4. Challenges to overcome
• in BASE: Volume, Variety, Complexity, Security
• in PATH: Resourses, Ownership, Question Repository, Design
Can be overcome by:
• Right tech platforms
• Right competence
• Crossfunctional team
5. How one should view data
STORAGE TRANSFORMATION VISUALISATION
Essentially, a file in
a folder on disk.
Essentially, make new file on disk
or in memory
0
5
Essentially, ”playing” the
file an appropriate player
conceptDWHdatalake
Mdf, ldf files. Only relational or
dimensional data
SQL, SSIS, C#
Anything. Files,
databases, KV-
stores etc
Rich programmatic
interface
Tools to design
and publish
reports
may be moved
to the cloud
is provided in
the cloud
ODBC
6. Hadoop: redundant cluster file system + MapReduce
Hadoop: A yellow stuffed elephant
In Cutting's own words:
“The name my kid gave a
stuffed yellow elephant.
Short, relatively easy to
spell and pronounce,
meaningless, and not used
elsewhere: those are my
naming criteria. Kids are
good at generating such.
Googol is a kid’s term”
Why a non-related
meaningless name
In Cutting own words:
“The rules of names for
software is they're
meaningless because
sometimes the use of a
particular piece of software
drifts, and if your name is too
closely associated with that, it
could end up being wrong
over time"
Doug Cutting with the famous elephant
7. Modern Cloud Architecture
STORAGE TRANSFORMATION VISUALISATION
Sources:
Files, Pictures,
databases
push
Azure
cloud:
Here we store all possivle
data formats within the
organization with Azure
Tech Stack.
Exists on top of
HDinsight
Can
consume
data from
diverse
sources
Python/Java.
Just with a few lines of code: create and
persist resilient distributed datasets
Transform into dataframes
Aggregate as needed
Interactive
notebooks (internal
marketplace of
ideas)
Modern visualisation
tools: Tableau or PowerBI
ODBC
interface/native
connector
Exists on top of
spark in AzureExists on top
of spark in
Azure
Conductor of cluster
resourses and
distributed
calculations
HDInsight (Hadoop) for files
DocumentDB for rich docs
BLOB storage for media
SQL Azure for tabular data
8. Source files Data lake
Change in spark
jobs
New reports
Change in source
system: column types,
encoding, extra fields
Supplier rebuilds his file
export: column
types,encoding, extra
fields
9.
10. Blog and video resources for Azure cloud
services
https://www.youtube.com/playlist?list=PLeIihrNL8cl4BiKiD-
VSTah_XZqmaJR3p
http://www.youtube.com/channel/UCRzsq7k4-kT-h3TDUBQ82-w
https://blogs.msdn.microsoft.com/azuredatalake/
11. STREAMING DATA
TO COLLECT:
SNAPSHOT DATA TO
COLLECT:
Possible use cases
Customer profile Social networks
Aggregated snapshots Transactions ”as they come”
Transactional history face recognition
Geodata geolocation
Segmentation+targeting Fraudulent transactions
Churn Personalized support
immediate risk recalculation Click interactions+log analysis
Customer lifetime score
BENEFITS-SOLUTIONS-
ACTIONS